You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@incubator.apache.org by fp <fp...@lucene.cn> on 2021/02/27 10:33:01 UTC

[Proposal] lxdb - proposal for Apache Incubation

Dear Apache Incubator Community,


Please accept the following proposal for presentation and discussion:
https://github.com/lucene-cn/lxdb/wiki


LXDB is a high-performance,OLAP,full text search database.it`s base on hbase,but replaced hfile with lucene index to support more effective secondary indexes,it`s also base on spark sql,so that you can used sql api to visit data and do olap calculate. and also the lucene index is store on hdfs (not local disk).


In our Production System, LXDB supported 200+ clusters,some of the single cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000 billion rows for total), one of the biggest single table has 200million lucene index on LXDB.


Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS, Lucene.We have merged these separated projects again,LXDB&nbsp;equals spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10 years to complete these merging operations.But the purpose is no longer a search engine, but a database.





Best regards
&nbsp; yannian mu




LXDB Proposal
== Abstract ==
LXDB is a high-performance,OLAP,full text search database.


=== it`s base on hbase,but replaced hfile with lucene index to support more effective secondary indexes.===&nbsp;
we modify hbase region server ,we&nbsp; change hfile to lucene,when put data we put&nbsp; document to lucene instande of&nbsp; put data to hfile
lucene index store on region server&nbsp;&nbsp;(it is not sote in different cluster like elstice search+hbase ,it takes to copy of data)


=== it`s base on spark sql for olap===&nbsp;
we Integrated spark and hbase together ,it`s useage like this ,
1.unpackage lxdb.tar.gz&nbsp;
2.config hadoop_config path,
3.run start-all.sh to start cluster.&nbsp;
lxdb can startup spark through hadoop yarn ,and then spark executor process Embedded start hbase region server service .&nbsp;


you can operate lxdb database throuth spark sql api(hive) or mysql api.
1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
2.the sql`s condition (filter or group by agg) will predicate to hbase ,
3.hbase used lucene index to filter data in region server.
all of the spark,hbase,lucene is Embedded Integrated together,it is not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es + hbase+spark Solution.


== Background ==
=== Multiple copies of data ===
Apache HBase+Elastic Search is the most popular Solution on full text search ,but it`s weak on Online AnalyticalProcessing.
so most of the time the Production System used spark(or hive or impala or presto) ,hbase,solr/es at the same time.Multiple copies of data are stored in multiple systems,multiple systems has different Api .Data consistency is difficult to guarantee.For the above reasons we merger spark,hbase,elastic into one project .it`s target is used one copy of data,one cluster,one api to solve olap,kv,full text...database scenarios.


=== Merging and splitting of lucene indexes(hstore) acrocess different machine on hdfs ===
As we all know solr/es store file in local fileSystem,it`s shard num must be a fix num,but if we store index on hdfs,the index can split able like hbase hstore,it can split or merge acorss machine nodes ,this is very usefull for distribute database ,it depend malloc how much resource on a table,most of time the records of a table is different by time by time so the num of shards always need adjust,if index store local it can`t split acroces throw different machine ,but lucene index store on hdfs it`s can do it.
whether the number of pieces can be flexibly adjusted, whether it has the ability of elastic scaling, in a distributed database is particularly important



=== solved Insufficient of&nbsp; secondary indexes ===
some people use hbase secondary index like Phoenix prjoect. but those programme base on the hbase rowkey has a lot of redundancy,He can't create too many indexes,Data inflation rate is too high,so used lucene index instand of secondary is the best chooses.&nbsp;


=== we add an lucene index for spark olap===&nbsp;
Most of OLAP systems has violent scanning problems and Poor timeliness of data like hive,spark sql,impala or some of the mpp database.
1.They used violent scans to calculate the data.but another choice is add index to the big data.some of the time using index can greatly improve the performance of the original brute force scanning. i think&nbsp; that just like the traditional database, indexing technology can greatly improve the performance of the speed database.
2.Another problem of thoses database or system, Most of them are an offline system or batch system,lxdb `s target is realtime append ,realtime kv update just like hbase.


==future==
=== lucene on parquet ===
recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files to&nbsp; like parquet or orc format.
To solve the performance problem of traversing Lucene index.To solve the problem that opening Lucene file needs to load files such as tip into memory, which leads to slow opening Lucene index file,To enable Lucene to store multi column joint index by column, which is used to handle some logic such as multi table join and materialized view ,mulity fields group by by invert index,The current Lucene index has many problems because of too many file pointers and single column problems,We want to modify Lucene to make it more suitable for HDFS, not only for full-text retrieval, but also better at statistical analysis, which is a real database level index,We want Lucene to be splitable, which can separate storage from computation.




===&nbsp; supporting all kinds of Predicate pushdown calculation&nbsp;===
We find that if we can combine the calculation method with the data closely, we can give more play to the performance of the database. Index is only a way of calculating push down. For example, storage push down, we can store the index on the SSD device, and the data part on the SATA device. We can store the data that are often grouped together in advance, instead of calculating line by line, We can give important tables or columns to dedicated devices and resources, but these hbases are still lacking, which we need to further improve


=== Distribution of intervention data ===
we can used row key to intervention data to different nodes ,it can do many interestest things


=== Resource control, resource isolation ===
lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp; we can do it , I can control the priority of SQL so that Lucene with higher priority can get faster IO resources.


== Status ==
since 2011 I released the first open source version on Alibaba&nbsp; ,At that time, mdrill used 10 nodes 48g machines to support 400 billion data. the first index on hdfs is from this version.it`s one year ahead of the community.&nbsp; https://github.com/alibaba/mdrill .


since 2014 i stoped mdrill project update for the reason of i join into tencent . in our team we developed&nbsp; hermes project ,we also build lucene on hdfs , hermes now realtime import 1000 billion rows of data per day.It's the largest database I've ever developed , https://plus.tencent.com/bigdata/hermes


since 2018 I set up my own company called luxin, Lu Xin is the Chinese pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is lucene.xin ,mail domain is lucene.cn.
luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp; it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of cluster use lsql. it`s process about 200 billions per day ,amount of 20000 billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;


since 2010 In the case of COVID-19 our team decide to developed the next generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add hbase to lsql To solve the update problem.nowadays we have finish the first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki







== Known Risks ==
==Meritocracy ==


lxdb has been deployed in production and is applying more than 200 lines of business. It has demonstrated great performance benefits and has proved to be a better way for reporting and analysis based big data. Still We look forward to growing a rich user and developer community.


=== Orphaned products ===


The core developers currently work full-time for Luxin.
lxdb is widely adopted by many companies and individuals. There's no
realistic chance of it becoming orphaned. and we have a number of 1000 person tencent qq Instant messaging group



=== Inexperience with Open Source===

The core developers are all active users and followers of open source. They are already committers and contributors to the lxdb project.&nbsp; developed yannian mu has tens years on open source project,&nbsp; jstorm https://github.com/alibaba/jstorm and mdrill&nbsp;https://github.com/alibaba/mdrill




=== Homogenous Developers ===&nbsp;


The most of core developers are from luxin for the Closed source products reason, but when lxdb was open sourced, lxdb will received a lot of bug fixes and enhancements from other developers not working at luxin.Where did you learn it from and where did you return it.





===Reliance on Salaried Developers ===


Lxin invested in lxdb as the&nbsp; solution and some of its key engineers are working full time on the project. In addition, since there is a growing Big Data need for scalable solutions, we look forward to other Apache developers and researchers to contribute to the project. Also key to addressing the risk associated with relying on Salaried developers from a single entity is to increase the diversity of the contributors and actively lobby , Apache lxdb intends to do this.


=== An Excessive Fascination with the Apache Brand ===


Lxdb is proposing to enter incubation at Apache in order to help efforts to diversify the committer-base, not so much to capitalize on the Apache brand. The Lxdb project is in production use already inside lxdb, but is not expected to be an lxdb product for external customers. As such, the lxdb project is not seeking to use the Apache brand as a marketing tool.





=== Documentation===&nbsp;


Information about Palo can be found at https://github.com/lucene-cn/lxdb. The following links provide more information about lxdb in open source:


* wiki site: https://github.com/lucene-cn/lxdb/wiki
* Issue Tracking: https://github.com/lucene-cn/lxdb/issues
* Overview: https://github.com/lucene-cn/lxdb/wiki/intro
* lxin home page: http://www.lucene.xin

* lsql document: http://docs.lucene.xin/lsql/v21/



##Initial Source


lxdb will development source code under an Apache license at https://github.com/lucene-cn/lxdb.






=== Core Developers ===



Currently most of the core developers of LXDB are working in the research Team of luxin.


- yannian mu (dev)&nbsp;
- yu chen (dev)&nbsp;
- guangshi hao (dev)&nbsp;
- wei sun (dev)&nbsp;
- qihua zheng (dev)&nbsp;
- xin wang (dev)&nbsp;
- qingsong liu (dev)&nbsp;
- anxing zhou (Tester)&nbsp;
- jiajun duan (Tester)&nbsp;



== External Dependencies ==

As all dependencies are managed using Apache Maven
Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true




== Required Resources ==


=== Mailing lists ===


&nbsp;* lxdb-private (PMC discussion)
&nbsp;* lxdb-dev (developer discussion)
&nbsp;* lxdb-user (user discussion)
&nbsp;* lxdb-commits (SCM commits)
&nbsp;* lxdb-issues (JIRA issue feed)


=== Subversion Directory ===


Instead of subversion, LXDB prefers to git as source control
management system: git://git.apache.org/lxdb

Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by lidong dai <da...@gmail.com>.
Hi,
  Kammi’s summary is very comprehensive,  try to open source first. and
you'd better find an experienced mentor to help you, it will be very
helpful !  Good luck


Best Regards
---------------
DolphinScheduler(Incubator) PPMC
Lidong Dai
dailidong66@gmail.com
---------------


On Sun, Feb 28, 2021 at 6:52 PM Furkan KAMACI <fu...@gmail.com>
wrote:

> Hi,
>
> Actually you have a detailed documentation which explains which approach
> you have compared to similar systems and performance metrics of following
> them i.e. reducing storage 10 to the 100 times or having low latency
> queries.
>
> My advices are (some of them are same with Sheng's and Liang's ):
>
> 1) Find an experienced mentor to guide you.
>
> 2) Start to translate your documentation to English.
>
> 3) Open source your project. How can we have a comment on your project if
> we cannot see anything about it?
>
> 4) Gain contributors to your project. At least you should show your
> intention to have committers/contributors out of your company. Eliminate
> the risk of being non-meritocratic management of the project.
>
> 5) Structure your proposal. Explain why people need this project, which
> problems do current projects have and how you managed to handle them. We
> should understand is it a bundle of other projects, a completely new
> project, or a wrapper of other projects which eliminates the shortcomings
> of them.
>
> 6) Find a suitable name for your project in order to not try to solve
> trademark problems that may lose your time if you enter the incubation.
>
> Kind Regards,
> Furkan KAMACI
>
>
> On Sun, Feb 28, 2021 at 1:02 PM Liang Chen <ch...@gmail.com>
> wrote:
>
> > Hi
> >
> > It would be better if you could find an experienced IPMC member to help
> you
> > for preparing the proposal.
> > Based on Sheng Wu input, i have one more comment : can you please explain
> > what are the different with other similar data analysis DB?  you can
> > consider explaining from use cases perspective.
> >
> > Regards
> > Liang
> >
> >
> > fp wrote
> > > Dear Apache Incubator Community,
> > >
> > >
> > > Please accept the following proposal for presentation and discussion:
> > > https://github.com/lucene-cn/lxdb/wiki
> > >
> > >
> > > LXDB is a high-performance,OLAP,full text search database.it`s base on
> > > hbase,but replaced hfile with lucene index to support more effective
> > > secondary indexes,it`s also base on spark sql,so that you can used sql
> > api
> > > to visit data and do olap calculate. and also the lucene index is store
> > on
> > > hdfs (not local disk).
> > >
> > >
> > > In our Production System, LXDB supported 200+ clusters,some of the
> single
> > > cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
> > > billion rows for total), one of the biggest single table has 200million
> > > lucene index on LXDB.
> > >
> > >
> > > Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),
> > HDFS,
> > > Lucene.We have merged these separated projects again,LXDB&nbsp;equals
> > > spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me
> 10
> > > years to complete these merging operations.But the purpose is no
> longer a
> > > search engine, but a database.
> > >
> > >
> > >
> > >
> > >
> > > Best regards
> > > &nbsp; yannian mu
> > >
> > >
> > >
> > >
> > > LXDB Proposal
> > > == Abstract ==
> > > LXDB is a high-performance,OLAP,full text search database.
> > >
> > >
> > > === it`s base on hbase,but replaced hfile with lucene index to support
> > > more effective secondary indexes.===&nbsp;
> > > we modify hbase region server ,we&nbsp; change hfile to lucene,when put
> > > data we put&nbsp; document to lucene instande of&nbsp; put data to
> hfile
> > > lucene index store on region server&nbsp;&nbsp;(it is not sote in
> > > different cluster like elstice search+hbase ,it takes to copy of data)
> > >
> > >
> > > === it`s base on spark sql for olap===&nbsp;
> > > we Integrated spark and hbase together ,it`s useage like this ,
> > > 1.unpackage lxdb.tar.gz&nbsp;
> > > 2.config hadoop_config path,
> > > 3.run start-all.sh to start cluster.&nbsp;
> > > lxdb can startup spark through hadoop yarn ,and then spark executor
> > > process Embedded start hbase region server service .&nbsp;
> > >
> > >
> > > you can operate lxdb database throuth spark sql api(hive) or mysql api.
> > > 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
> > > 2.the sql`s condition (filter or group by agg) will predicate to hbase
> ,
> > > 3.hbase used lucene index to filter data in region server.
> > > all of the spark,hbase,lucene is Embedded Integrated together,it is
> > > not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es
> +
> > > hbase+spark Solution.
> > >
> > >
> > > == Background ==
> > > === Multiple copies of data ===
> > > Apache HBase+Elastic Search is the most popular Solution on full text
> > > search ,but it`s weak on Online AnalyticalProcessing.
> > > so most of the time the Production System used spark(or hive or impala
> or
> > > presto) ,hbase,solr/es at the same time.Multiple copies of data are
> > stored
> > > in multiple systems,multiple systems has different Api .Data
> consistency
> > > is difficult to guarantee.For the above reasons we merger
> > > spark,hbase,elastic into one project .it`s target is used one copy of
> > > data,one cluster,one api to solve olap,kv,full text...database
> scenarios.
> > >
> > >
> > > === Merging and splitting of lucene indexes(hstore) acrocess different
> > > machine on hdfs ===
> > > As we all know solr/es store file in local fileSystem,it`s shard num
> must
> > > be a fix num,but if we store index on hdfs,the index can split able
> like
> > > hbase hstore,it can split or merge acorss machine nodes ,this is very
> > > usefull for distribute database ,it depend malloc how much resource on
> a
> > > table,most of time the records of a table is different by time by time
> so
> > > the num of shards always need adjust,if index store local it can`t
> split
> > > acroces throw different machine ,but lucene index store on hdfs it`s
> can
> > > do it.
> > > whether the number of pieces can be flexibly adjusted, whether it has
> the
> > > ability of elastic scaling, in a distributed database is particularly
> > > important
> > >
> > >
> > >
> > > === solved Insufficient of&nbsp; secondary indexes ===
> > > some people use hbase secondary index like Phoenix prjoect. but those
> > > programme base on the hbase rowkey has a lot of redundancy,He can't
> > create
> > > too many indexes,Data inflation rate is too high,so used lucene index
> > > instand of secondary is the best chooses.&nbsp;
> > >
> > >
> > > === we add an lucene index for spark olap===&nbsp;
> > > Most of OLAP systems has violent scanning problems and Poor timeliness
> of
> > > data like hive,spark sql,impala or some of the mpp database.
> > > 1.They used violent scans to calculate the data.but another choice is
> add
> > > index to the big data.some of the time using index can greatly improve
> > the
> > > performance of the original brute force scanning. i think&nbsp; that
> just
> > > like the traditional database, indexing technology can greatly improve
> > the
> > > performance of the speed database.
> > > 2.Another problem of thoses database or system, Most of them are an
> > > offline system or batch system,lxdb `s target is realtime append
> > ,realtime
> > > kv update just like hbase.
> > >
> > >
> > > ==future==
> > > === lucene on parquet ===
> > > recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm
> files
> > > to&nbsp; like parquet or orc format.
> > > To solve the performance problem of traversing Lucene index.To solve
> the
> > > problem that opening Lucene file needs to load files such as tip into
> > > memory, which leads to slow opening Lucene index file,To enable Lucene
> to
> > > store multi column joint index by column, which is used to handle some
> > > logic such as multi table join and materialized view ,mulity fields
> group
> > > by by invert index,The current Lucene index has many problems because
> of
> > > too many file pointers and single column problems,We want to modify
> > Lucene
> > > to make it more suitable for HDFS, not only for full-text retrieval,
> but
> > > also better at statistical analysis, which is a real database level
> > > index,We want Lucene to be splitable, which can separate storage from
> > > computation.
> > >
> > >
> > >
> > >
> > > ===&nbsp; supporting all kinds of Predicate pushdown
> calculation&nbsp;===
> > > We find that if we can combine the calculation method with the data
> > > closely, we can give more play to the performance of the database.
> Index
> > > is only a way of calculating push down. For example, storage push down,
> > we
> > > can store the index on the SSD device, and the data part on the SATA
> > > device. We can store the data that are often grouped together in
> advance,
> > > instead of calculating line by line, We can give important tables or
> > > columns to dedicated devices and resources, but these hbases are still
> > > lacking, which we need to further improve
> > >
> > >
> > > === Distribution of intervention data ===
> > > we can used row key to intervention data to different nodes ,it can do
> > > many interestest things
> > >
> > >
> > > === Resource control, resource isolation ===
> > > lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp;
> > we
> > > can do it , I can control the priority of SQL so that Lucene with
> higher
> > > priority can get faster IO resources.
> > >
> > >
> > > == Status ==
> > > since 2011 I released the first open source version on Alibaba&nbsp;
> ,At
> > > that time, mdrill used 10 nodes 48g machines to support 400 billion
> data.
> > > the first index on hdfs is from this version.it`s one year ahead of
> the
> > > community.&nbsp; https://github.com/alibaba/mdrill .
> > >
> > >
> > > since 2014 i stoped mdrill project update for the reason of i join into
> > > tencent . in our team we developed&nbsp; hermes project ,we also build
> > > lucene on hdfs , hermes now realtime import 1000 billion rows of data
> per
> > > day.It's the largest database I've ever developed ,
> > > https://plus.tencent.com/bigdata/hermes
> > >
> > >
> > > since 2018 I set up my own company called luxin, Lu Xin is the Chinese
> > > pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
> > > lucene.xin ,mail domain is lucene.cn.
> > > luxin`s first version of lxdb is called lsql,it`s means lucene
> sql.&nbsp;
> > > it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
> > > cluster use lsql. it`s process about 200 billions per day ,amount of
> > 20000
> > > billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;
> > >
> > >
> > > since 2010 In the case of COVID-19 our team decide to developed the
> next
> > > generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
> > > hbase to lsql To solve the update problem.nowadays we have finish the
> > > first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > == Known Risks ==
> > > ==Meritocracy ==
> > >
> > >
> > > lxdb has been deployed in production and is applying more than 200
> lines
> > > of business. It has demonstrated great performance benefits and has
> > proved
> > > to be a better way for reporting and analysis based big data. Still We
> > > look forward to growing a rich user and developer community.
> > >
> > >
> > > === Orphaned products ===
> > >
> > >
> > > The core developers currently work full-time for Luxin.
> > > lxdb is widely adopted by many companies and individuals. There's no
> > > realistic chance of it becoming orphaned. and we have a number of 1000
> > > person tencent qq Instant messaging group
> > >
> > >
> > >
> > > === Inexperience with Open Source===
> > >
> > > The core developers are all active users and followers of open source.
> > > They are already committers and contributors to the lxdb project.&nbsp;
> > > developed yannian mu has tens years on open source project,&nbsp;
> jstorm
> > > https://github.com/alibaba/jstorm and
> > > mdrill&nbsp;https://github.com/alibaba/mdrill
> > >
> > >
> > >
> > >
> > > === Homogenous Developers ===&nbsp;
> > >
> > >
> > > The most of core developers are from luxin for the Closed source
> products
> > > reason, but when lxdb was open sourced, lxdb will received a lot of bug
> > > fixes and enhancements from other developers not working at luxin.Where
> > > did you learn it from and where did you return it.
> > >
> > >
> > >
> > >
> > >
> > > ===Reliance on Salaried Developers ===
> > >
> > >
> > > Lxin invested in lxdb as the&nbsp; solution and some of its key
> engineers
> > > are working full time on the project. In addition, since there is a
> > > growing Big Data need for scalable solutions, we look forward to other
> > > Apache developers and researchers to contribute to the project. Also
> key
> > > to addressing the risk associated with relying on Salaried developers
> > from
> > > a single entity is to increase the diversity of the contributors and
> > > actively lobby , Apache lxdb intends to do this.
> > >
> > >
> > > === An Excessive Fascination with the Apache Brand ===
> > >
> > >
> > > Lxdb is proposing to enter incubation at Apache in order to help
> efforts
> > > to diversify the committer-base, not so much to capitalize on the
> Apache
> > > brand. The Lxdb project is in production use already inside lxdb, but
> is
> > > not expected to be an lxdb product for external customers. As such, the
> > > lxdb project is not seeking to use the Apache brand as a marketing
> tool.
> > >
> > >
> > >
> > >
> > >
> > > === Documentation===&nbsp;
> > >
> > >
> > > Information about Palo can be found at
> https://github.com/lucene-cn/lxdb
> > .
> > > The following links provide more information about lxdb in open source:
> > >
> > >
> > > * wiki site: https://github.com/lucene-cn/lxdb/wiki
> > > * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
> > > * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
> > > * lxin home page: http://www.lucene.xin
> > >
> > > * lsql document: http://docs.lucene.xin/lsql/v21/
> > >
> > >
> > >
> > > ##Initial Source
> > >
> > >
> > > lxdb will development source code under an Apache license at
> > > https://github.com/lucene-cn/lxdb.
> > >
> > >
> > >
> > >
> > >
> > >
> > > === Core Developers ===
> > >
> > >
> > >
> > > Currently most of the core developers of LXDB are working in the
> research
> > > Team of luxin.
> > >
> > >
> > > - yannian mu (dev)&nbsp;
> > > - yu chen (dev)&nbsp;
> > > - guangshi hao (dev)&nbsp;
> > > - wei sun (dev)&nbsp;
> > > - qihua zheng (dev)&nbsp;
> > > - xin wang (dev)&nbsp;
> > > - qingsong liu (dev)&nbsp;
> > > - anxing zhou (Tester)&nbsp;
> > > - jiajun duan (Tester)&nbsp;
> > >
> > >
> > >
> > > == External Dependencies ==
> > >
> > > As all dependencies are managed using Apache Maven
> > > Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp;
> &nbsp;
> > > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
> > > lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
> > > &nbsp; &nbsp; &nbsp; true
> > > zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License
> > 2.0&nbsp;
> > > &nbsp; &nbsp; &nbsp; &nbsp; true
> > > hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
> > > &nbsp; &nbsp; &nbsp; &nbsp; true
> > > spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> > > true
> > > hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
> > > License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> > > hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> > true
> > >
> > >
> > >
> > >
> > > == Required Resources ==
> > >
> > >
> > > === Mailing lists ===
> > >
> > >
> > > &nbsp;* lxdb-private (PMC discussion)
> > > &nbsp;* lxdb-dev (developer discussion)
> > > &nbsp;* lxdb-user (user discussion)
> > > &nbsp;* lxdb-commits (SCM commits)
> > > &nbsp;* lxdb-issues (JIRA issue feed)
> > >
> > >
> > > === Subversion Directory ===
> > >
> > >
> > > Instead of subversion, LXDB prefers to git as source control
> > > management system: git://git.apache.org/lxdb
> >
> >
> >
> >
> >
> > --
> > Sent from: http://apache-incubator-general.996316.n3.nabble.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> > For additional commands, e-mail: general-help@incubator.apache.org
> >
> >
>

Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by Juan Pan <pa...@apache.org>.
Hi,


My +1 for the suggestions and summary from Furkan KAMACI.
They are truly many IPMC concerns, I guess.
Some of the items will need you plenty of time to handle, 
I am unsure whether it is the best time for you to propose now.
But, at least I suppose you have a direction to improve.


Sincerely,
Trista



-------------------------------------------------------
Email:panjuan@apache.org
Juan Pan(Trista) Apache ShardingSphere


On 02/28/2021 18:51,Furkan KAMACI<fu...@gmail.com> wrote:
Hi,

Actually you have a detailed documentation which explains which approach
you have compared to similar systems and performance metrics of following
them i.e. reducing storage 10 to the 100 times or having low latency
queries.

My advices are (some of them are same with Sheng's and Liang's ):

1) Find an experienced mentor to guide you.

2) Start to translate your documentation to English.

3) Open source your project. How can we have a comment on your project if
we cannot see anything about it?

4) Gain contributors to your project. At least you should show your
intention to have committers/contributors out of your company. Eliminate
the risk of being non-meritocratic management of the project.

5) Structure your proposal. Explain why people need this project, which
problems do current projects have and how you managed to handle them. We
should understand is it a bundle of other projects, a completely new
project, or a wrapper of other projects which eliminates the shortcomings
of them.

6) Find a suitable name for your project in order to not try to solve
trademark problems that may lose your time if you enter the incubation.

Kind Regards,
Furkan KAMACI


On Sun, Feb 28, 2021 at 1:02 PM Liang Chen <ch...@gmail.com> wrote:

Hi

It would be better if you could find an experienced IPMC member to help you
for preparing the proposal.
Based on Sheng Wu input, i have one more comment : can you please explain
what are the different with other similar data analysis DB?  you can
consider explaining from use cases perspective.

Regards
Liang


fp wrote
Dear Apache Incubator Community,


Please accept the following proposal for presentation and discussion:
https://github.com/lucene-cn/lxdb/wiki


LXDB is a high-performance,OLAP,full text search database.it`s base on
hbase,but replaced hfile with lucene index to support more effective
secondary indexes,it`s also base on spark sql,so that you can used sql
api
to visit data and do olap calculate. and also the lucene index is store
on
hdfs (not local disk).


In our Production System, LXDB supported 200+ clusters,some of the single
cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
billion rows for total), one of the biggest single table has 200million
lucene index on LXDB.


Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),
HDFS,
Lucene.We have merged these separated projects again,LXDB&nbsp;equals
spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10
years to complete these merging operations.But the purpose is no longer a
search engine, but a database.





Best regards
&nbsp; yannian mu




LXDB Proposal
== Abstract ==
LXDB is a high-performance,OLAP,full text search database.


=== it`s base on hbase,but replaced hfile with lucene index to support
more effective secondary indexes.===&nbsp;
we modify hbase region server ,we&nbsp; change hfile to lucene,when put
data we put&nbsp; document to lucene instande of&nbsp; put data to hfile
lucene index store on region server&nbsp;&nbsp;(it is not sote in
different cluster like elstice search+hbase ,it takes to copy of data)


=== it`s base on spark sql for olap===&nbsp;
we Integrated spark and hbase together ,it`s useage like this ,
1.unpackage lxdb.tar.gz&nbsp;
2.config hadoop_config path,
3.run start-all.sh to start cluster.&nbsp;
lxdb can startup spark through hadoop yarn ,and then spark executor
process Embedded start hbase region server service .&nbsp;


you can operate lxdb database throuth spark sql api(hive) or mysql api.
1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
2.the sql`s condition (filter or group by agg) will predicate to hbase ,
3.hbase used lucene index to filter data in region server.
all of the spark,hbase,lucene is Embedded Integrated together,it is
not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es +
hbase+spark Solution.


== Background ==
=== Multiple copies of data ===
Apache HBase+Elastic Search is the most popular Solution on full text
search ,but it`s weak on Online AnalyticalProcessing.
so most of the time the Production System used spark(or hive or impala or
presto) ,hbase,solr/es at the same time.Multiple copies of data are
stored
in multiple systems,multiple systems has different Api .Data consistency
is difficult to guarantee.For the above reasons we merger
spark,hbase,elastic into one project .it`s target is used one copy of
data,one cluster,one api to solve olap,kv,full text...database scenarios.


=== Merging and splitting of lucene indexes(hstore) acrocess different
machine on hdfs ===
As we all know solr/es store file in local fileSystem,it`s shard num must
be a fix num,but if we store index on hdfs,the index can split able like
hbase hstore,it can split or merge acorss machine nodes ,this is very
usefull for distribute database ,it depend malloc how much resource on a
table,most of time the records of a table is different by time by time so
the num of shards always need adjust,if index store local it can`t split
acroces throw different machine ,but lucene index store on hdfs it`s can
do it.
whether the number of pieces can be flexibly adjusted, whether it has the
ability of elastic scaling, in a distributed database is particularly
important



=== solved Insufficient of&nbsp; secondary indexes ===
some people use hbase secondary index like Phoenix prjoect. but those
programme base on the hbase rowkey has a lot of redundancy,He can't
create
too many indexes,Data inflation rate is too high,so used lucene index
instand of secondary is the best chooses.&nbsp;


=== we add an lucene index for spark olap===&nbsp;
Most of OLAP systems has violent scanning problems and Poor timeliness of
data like hive,spark sql,impala or some of the mpp database.
1.They used violent scans to calculate the data.but another choice is add
index to the big data.some of the time using index can greatly improve
the
performance of the original brute force scanning. i think&nbsp; that just
like the traditional database, indexing technology can greatly improve
the
performance of the speed database.
2.Another problem of thoses database or system, Most of them are an
offline system or batch system,lxdb `s target is realtime append
,realtime
kv update just like hbase.


==future==
=== lucene on parquet ===
recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files
to&nbsp; like parquet or orc format.
To solve the performance problem of traversing Lucene index.To solve the
problem that opening Lucene file needs to load files such as tip into
memory, which leads to slow opening Lucene index file,To enable Lucene to
store multi column joint index by column, which is used to handle some
logic such as multi table join and materialized view ,mulity fields group
by by invert index,The current Lucene index has many problems because of
too many file pointers and single column problems,We want to modify
Lucene
to make it more suitable for HDFS, not only for full-text retrieval, but
also better at statistical analysis, which is a real database level
index,We want Lucene to be splitable, which can separate storage from
computation.




===&nbsp; supporting all kinds of Predicate pushdown calculation&nbsp;===
We find that if we can combine the calculation method with the data
closely, we can give more play to the performance of the database. Index
is only a way of calculating push down. For example, storage push down,
we
can store the index on the SSD device, and the data part on the SATA
device. We can store the data that are often grouped together in advance,
instead of calculating line by line, We can give important tables or
columns to dedicated devices and resources, but these hbases are still
lacking, which we need to further improve


=== Distribution of intervention data ===
we can used row key to intervention data to different nodes ,it can do
many interestest things


=== Resource control, resource isolation ===
lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp;
we
can do it , I can control the priority of SQL so that Lucene with higher
priority can get faster IO resources.


== Status ==
since 2011 I released the first open source version on Alibaba&nbsp; ,At
that time, mdrill used 10 nodes 48g machines to support 400 billion data.
the first index on hdfs is from this version.it`s one year ahead of the
community.&nbsp; https://github.com/alibaba/mdrill .


since 2014 i stoped mdrill project update for the reason of i join into
tencent . in our team we developed&nbsp; hermes project ,we also build
lucene on hdfs , hermes now realtime import 1000 billion rows of data per
day.It's the largest database I've ever developed ,
https://plus.tencent.com/bigdata/hermes


since 2018 I set up my own company called luxin, Lu Xin is the Chinese
pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
lucene.xin ,mail domain is lucene.cn.
luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp;
it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
cluster use lsql. it`s process about 200 billions per day ,amount of
20000
billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;


since 2010 In the case of COVID-19 our team decide to developed the next
generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
hbase to lsql To solve the update problem.nowadays we have finish the
first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki







== Known Risks ==
==Meritocracy ==


lxdb has been deployed in production and is applying more than 200 lines
of business. It has demonstrated great performance benefits and has
proved
to be a better way for reporting and analysis based big data. Still We
look forward to growing a rich user and developer community.


=== Orphaned products ===


The core developers currently work full-time for Luxin.
lxdb is widely adopted by many companies and individuals. There's no
realistic chance of it becoming orphaned. and we have a number of 1000
person tencent qq Instant messaging group



=== Inexperience with Open Source===

The core developers are all active users and followers of open source.
They are already committers and contributors to the lxdb project.&nbsp;
developed yannian mu has tens years on open source project,&nbsp; jstorm
https://github.com/alibaba/jstorm and
mdrill&nbsp;https://github.com/alibaba/mdrill




=== Homogenous Developers ===&nbsp;


The most of core developers are from luxin for the Closed source products
reason, but when lxdb was open sourced, lxdb will received a lot of bug
fixes and enhancements from other developers not working at luxin.Where
did you learn it from and where did you return it.





===Reliance on Salaried Developers ===


Lxin invested in lxdb as the&nbsp; solution and some of its key engineers
are working full time on the project. In addition, since there is a
growing Big Data need for scalable solutions, we look forward to other
Apache developers and researchers to contribute to the project. Also key
to addressing the risk associated with relying on Salaried developers
from
a single entity is to increase the diversity of the contributors and
actively lobby , Apache lxdb intends to do this.


=== An Excessive Fascination with the Apache Brand ===


Lxdb is proposing to enter incubation at Apache in order to help efforts
to diversify the committer-base, not so much to capitalize on the Apache
brand. The Lxdb project is in production use already inside lxdb, but is
not expected to be an lxdb product for external customers. As such, the
lxdb project is not seeking to use the Apache brand as a marketing tool.





=== Documentation===&nbsp;


Information about Palo can be found at https://github.com/lucene-cn/lxdb
.
The following links provide more information about lxdb in open source:


* wiki site: https://github.com/lucene-cn/lxdb/wiki
* Issue Tracking: https://github.com/lucene-cn/lxdb/issues
* Overview: https://github.com/lucene-cn/lxdb/wiki/intro
* lxin home page: http://www.lucene.xin

* lsql document: http://docs.lucene.xin/lsql/v21/



##Initial Source


lxdb will development source code under an Apache license at
https://github.com/lucene-cn/lxdb.






=== Core Developers ===



Currently most of the core developers of LXDB are working in the research
Team of luxin.


- yannian mu (dev)&nbsp;
- yu chen (dev)&nbsp;
- guangshi hao (dev)&nbsp;
- wei sun (dev)&nbsp;
- qihua zheng (dev)&nbsp;
- xin wang (dev)&nbsp;
- qingsong liu (dev)&nbsp;
- anxing zhou (Tester)&nbsp;
- jiajun duan (Tester)&nbsp;



== External Dependencies ==

As all dependencies are managed using Apache Maven
Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; true
zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License
2.0&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; true
hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; true
spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
true
hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
true




== Required Resources ==


=== Mailing lists ===


&nbsp;* lxdb-private (PMC discussion)
&nbsp;* lxdb-dev (developer discussion)
&nbsp;* lxdb-user (user discussion)
&nbsp;* lxdb-commits (SCM commits)
&nbsp;* lxdb-issues (JIRA issue feed)


=== Subversion Directory ===


Instead of subversion, LXDB prefers to git as source control
management system: git://git.apache.org/lxdb





--
Sent from: http://apache-incubator-general.996316.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org



Re: Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by "fp@lucene.cn" <fp...@lucene.cn>.
Hi  Ming Wen

I am quit sorry, i make a reformat.

Here is my project info.
====================

Dear Apache Incubator Community,

Please accept the following proposal for presentation and discussion:
https://github.com/lucene-cn/lxdb/wiki

LXDB is a high-performance,OLAP,full text search database.it`s base on hbase,but replaced hfile with lucene index to support more effective secondary indexes,it`s also base on spark sql,so that you can used sql api to visit data and do olap calculate. and also the lucene index is store on hdfs (not local disk).

In our Production System, LXDB supported 200+ clusters,some of the single cluster is 1000+ nodes,insert 200 billion rows  per day ( 20000 billion rows for total), one of the biggest single table has 200million lucene index on LXDB.

Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS, Lucene.We have merged these separated projects again,LXDB equals spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10 years to complete these merging operations.But the purpose is no longer a search engine, but a database.


Best regards
  yannian mu


LXDB Proposal
== Abstract ==
LXDB is a high-performance,OLAP,full text search database.

=== it`s base on hbase,but replaced hfile with lucene index to support more effective secondary indexes.===
we modify hbase region server ,we  change hfile to lucene,when put data we put  document to lucene instande of  put data to hfile
lucene index store on region server  (it is not sote in different cluster like elstice search+hbase ,it takes to copy of data)

=== it`s base on spark sql for olap===
we Integrated spark and hbase together ,it`s useage like this ,
1.unpackage lxdb.tar.gz
2.config hadoop_config path,
3.run start-all.sh to start cluster.
lxdb can startup spark through hadoop yarn ,and then spark executor process Embedded start hbase region server service .

you can operate lxdb database throuth spark sql api(hive) or mysql api.
1.the sql used spark rdd+hbase scaner  to visit hbase .
2.the sql`s condition (filter or group by agg) will predicate to hbase ,
3.hbase used lucene index to filter data in region server.
all of the spark,hbase,lucene is Embedded Integrated together,it is not  a  seperate cluster ,that is the different with solr/es + hbase+spark Solution.

== Background ==
=== Multiple copies of data ===
Apache HBase+Elastic Search is the most popular Solution on full text search ,but it`s weak on Online AnalyticalProcessing.
so most of the time the Production System used spark(or hive or impala or presto) ,hbase,solr/es at the same time.Multiple copies of data are stored in multiple systems,multiple systems has different Api .Data consistency is difficult to guarantee.For the above reasons we merger spark,hbase,elastic into one project .it`s target is used one copy of data,one cluster,one api to solve olap,kv,full text...database scenarios.

=== Merging and splitting of lucene indexes(hstore) acrocess different machine on hdfs ===
As we all know solr/es store file in local fileSystem,it`s shard num must be a fix num,but if we store index on hdfs,the index can split able like hbase hstore,it can split or merge acorss machine nodes ,this is very usefull for distribute database ,it depend malloc how much resource on a table,most of time the records of a table is different by time by time so the num of shards always need adjust,if index store local it can`t split acroces throw different machine ,but lucene index store on hdfs it`s can do it.
whether the number of pieces can be flexibly adjusted, whether it has the ability of elastic scaling, in a distributed database is particularly important

=== solved Insufficient of  secondary indexes ===
some people use hbase secondary index like Phoenix prjoect. but those programme base on the hbase rowkey has a lot of redundancy,He can't create too many indexes,Data inflation rate is too high,so used lucene index instand of secondary is the best chooses.

=== we add an lucene index for spark olap===
Most of OLAP systems has violent scanning problems and Poor timeliness of data like hive,spark sql,impala or some of the mpp database.
1.They used violent scans to calculate the data.but another choice is add index to the big data.some of the time using index can greatly improve the performance of the original brute force scanning. i think  that just like the traditional database, indexing technology can greatly improve the performance of the speed database.
2.Another problem of thoses database or system, Most of them are an offline system or batch system,lxdb `s target is realtime append ,realtime kv update just like hbase.

==future==
=== lucene on parquet ===
recenetly i will change lucene  tim,tip(invert index) ,dvd,dvm files to  like parquet or orc format.
To solve the performance problem of traversing Lucene index.To solve the problem that opening Lucene file needs to load files such as tip into memory, which leads to slow opening Lucene index file,To enable Lucene to store multi column joint index by column, which is used to handle some logic such as multi table join and materialized view ,mulity fields group by by invert index,The current Lucene index has many problems because of too many file pointers and single column problems,We want to modify Lucene to make it more suitable for HDFS, not only for full-text retrieval, but also better at statistical analysis, which is a real database level index,We want Lucene to be splitable, which can separate storage from computation.

===  supporting all kinds of Predicate pushdown calculation ===
We find that if we can combine the calculation method with the data closely, we can give more play to the performance of the database. Index is only a way of calculating push down. For example, storage push down, we can store the index on the SSD device, and the data part on the SATA device. We can store the data that are often grouped together in advance, instead of calculating line by line, We can give important tables or columns to dedicated devices and resources, but these hbases are still lacking, which we need to further improve

=== Distribution of intervention data ===
we can used row key to intervention data to different nodes ,it can do many interestest things

=== Resource control, resource isolation ===
lucene recent is not support resource isolation,but  on hdfs  we can do it , I can control the priority of SQL so that Lucene with higher priority can get faster IO resources.

== Status ==
since 2011 I released the first open source version on Alibaba ,At that time, mdrill used 10 nodes 48g machines to support 400 billion data. the first index on hdfs is from this version.it`s one year ahead of the community. https://github.com/alibaba/mdrill .

since 2014 i stoped mdrill project update for the reason of i join into tencent . in our team we developed hermes project ,we also build lucene on hdfs , hermes now realtime import 1000 billion rows of data per day.It's the largest database I've ever developed , https://plus.tencent.com/bigdata/hermes

since 2018 I set up my own company called luxin, Lu Xin is the Chinese pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is lucene.xin ,mail domain is lucene.cn.
luxin`s first version of lxdb is called lsql,it`s means lucene sql.  it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of cluster use lsql. it`s process about 200 billions per day ,amount of 20000 billions rows in one  single cluster. (1000 nodes)

since 2010 In the case of COVID-19 our team decide to developed the next generation of lsql called lxdb(lx=lucene pronunciation ). we add hbase to lsql To solve the update problem.nowadays we have finish the first version of lxdb. https://github.com/lucene-cn/lxdb/wiki


== Known Risks ==
==Meritocracy ==

lxdb has been deployed in production and is applying more than 200 lines of business. It has demonstrated great performance benefits and has proved to be a better way for reporting and analysis based big data. Still We look forward to growing a rich user and developer community.
=== Orphaned products ===

The core developers currently work full-time for Luxin.
lxdb is widely adopted by many companies and individuals. There's no
realistic chance of it becoming orphaned. and we have a number of 1000 person tencent qq Instant messaging group

=== Inexperience with Open Source===
The core developers are all active users and followers of open source. They are already committers and contributors to the lxdb project. developed yannian mu has tens years on open source project, jstorm https://github.com/alibaba/jstorm and mdrill https://github.com/alibaba/mdrill


=== Homogenous Developers ===

The most of core developers are from luxin for the Closed source products reason, but when lxdb was open sourced, lxdb will received a lot of bug fixes and enhancements from other developers not working at luxin.Where did you learn it from and where did you return it.


===Reliance on Salaried Developers ===

Lxin invested in lxdb as the  solution and some of its key engineers are working full time on the project. In addition, since there is a growing Big Data need for scalable solutions, we look forward to other Apache developers and researchers to contribute to the project. Also key to addressing the risk associated with relying on Salaried developers from a single entity is to increase the diversity of the contributors and actively lobby , Apache lxdb intends to do this.

=== An Excessive Fascination with the Apache Brand ===

Lxdb is proposing to enter incubation at Apache in order to help efforts to diversify the committer-base, not so much to capitalize on the Apache brand. The Lxdb project is in production use already inside lxdb, but is not expected to be an lxdb product for external customers. As such, the lxdb project is not seeking to use the Apache brand as a marketing tool.


=== Documentation===

Information about Palo can be found at https://github.com/lucene-cn/lxdb. The following links provide more information about lxdb in open source:

* wiki site: https://github.com/lucene-cn/lxdb/wiki
* Issue Tracking: https://github.com/lucene-cn/lxdb/issues
* Overview: https://github.com/lucene-cn/lxdb/wiki/intro
* lxin home page: http://www.lucene.xin
* lsql document: http://docs.lucene.xin/lsql/v21/

##Initial Source

lxdb will development source code under an Apache license at https://github.com/lucene-cn/lxdb.


=== Core Developers ===

Currently most of the core developers of LXDB are working in the research Team of luxin.

- yannian mu (dev)
- yu chen (dev)
- guangshi hao (dev)
- wei sun (dev)
- qihua zheng (dev)
- xin wang (dev)
- qingsong liu (dev)
- anxing zhou (Tester)
- jiajun duan (Tester)

== External Dependencies ==
As all dependencies are managed using Apache Maven
Dependency          License                     Optional?
lucene              Apache License 2.0          true
zookeeper           Apache License 2.0          true
hbase               Apache License 2.0          true
spark               Apache License 2.0          true
hadoop              Apache License 2.0          true
hive                Apache License 2.0          true

== Required Resources ==

=== Mailing lists ===

* lxdb-private (PMC discussion)
* lxdb-dev (developer discussion)
* lxdb-user (user discussion)
* lxdb-commits (SCM commits)
* lxdb-issues (JIRA issue feed)

=== Subversion Directory ===

Instead of subversion, LXDB prefers to git as source control
management system: git://git.apache.org/lxdb


===Different from carbondata and clickhouse===
When carbondata appeared in 2015, it was a product that shocked me very much. Adding a layer of index to big data is what I have been doing all these years. I didn't expect that there would be a team in this world with the same idea as me,They are all based on Hadoop, and even the startup is based on spark on yarn

Everyone is based on spark, and its core is the underlying data structure of spark. We can improve the speed of spark by unique data format such as index,Whether the data has an index and whether the index is stored on the local disk or HDFS is a significant feature that distinguishes us from other analytical databases, such as hive, spark SQL, impala and some MAPP databases,On this point, we are consistent with carbontata

Our team later spent a certain amount of energy to do a test with carbontata, and the positioning in some directions is still very different,As for Clickhouse, I didn't come across many projects before. Until one day, when I was recruiting in the group, someone asked me, is your product as fast as Clickhouse? Therefore, I knew that there was such a good product in the industry,


#1 Coarse grained index vs fine-grained index, or index stored by block and index not stored by block,
We found that the writing speed of carbondata and Clickhouse is very fast, while we used lxdb and elastic search at the same time, because both of them are based on Lucene, which is an order of magnitude lower than the former two

#2 Later, we found that the main difference lies in the way of index. One is the index by block, and the other is the overall global index. The former is very fast in storage, and it is easier to separate index and calculation. Even carbondata is a real cloud native database (the Clickhouse data is stored locally, not cloud native), But the benefit is not only the improvement of single column filtering, but also the improvement of multi condition combination filtering and the convenience of updating. If the former is not handled properly, it is easy to cause full scan, but there will be a high cost to realize updating, The latter can be combined with BitSet or bloom filter to realize the combination of multi column conditions, and the global index is more suitable for updating. Therefore, lxdb and es have the characteristics of real-time updating. This is why we are different from carbondata. We inherit a HBase in comparison, and the main purpose is to realize the real-time updating of kV level, In the future, if lxdb wants to take a step on the cloud native Road, it is bound to make some innovations and changes in the index format of Lucene

#3 Because lxdb is bound to HBase in the future, OLTP at kV level is also a direction in the future

#4 In terms of statistical analysis, the performance of docvalues used by Lucene is not as good as that of carbondata and clickhouse,Because of this reason, I spent some experience to improve the performance of random reading on HDFS, and the speed can be increased by 100-200 times. But I think the code to modify HDFS will lead to poor compatibility of our products in the customer platform in the future, and will force customers to replace Hadoop with our version. I didn't choose this scheme in the end, This is the address of my improvement project https://github.com/lucene-cn/lxhadoop
One of the ideas that came to my mind later is to replace the format of parquet with the inverted and forward row of Lucene, so that I can carry out multi condition full-text retrieval. The multi column feature of parquet allows me to avoid the performance problem of random reading by efficiently traversing the inverted table

####Different from alibaba analytic db####
        I'm not particularly familiar with analyticdb, so I just looked up some information through the search engine. If there is any misunderstanding, please criticize and correct me
Most of the time they are really similar,Analyticdb is a very excellent database, but its technical principles can hardly be found on the Internet. From my personal point of view, they may have the following differences
#1)Analyticdb is a cloud native data warehouse in the full sense,This is also the feature they added to the new edition, which supports the separation of storage and computing, and the time-sharing flexibility of resources on demand. The same piece of data can start different computing resources at different computing nodes according to different computing
However, lxdb is not a real cloud native database. Although we store the Lucene index on HDFS, we can only separate the storage from computing. At present, when the Lucene itself is opened for the first time, the index information such as tip must be preloaded into memory, which leads to the persistent opening of Lucene in the resident process, Therefore, lxdb has not been able to separate computing from computing, that is, it has not been able to distribute computing resources to different processes according to different queries. This has always been a pity of lxdb, so I have been trying these years
At present, cloud native database has great market potential, and we are willing to try it,And I know that it's not difficult to change Lucene like this, or it's less difficult than integrating spark, HBase and Lucene together.
#2)Analyticdb can't be built by itself, it can only run on the cloud platform provided by it,Must be purchased with the underlying cloud environment, which sometimes gives users more restrictions. Lxdb is based on Hadoop platform. As long as users have Hadoop environment, lxdb can directly start services through yard, which is suitable for private deployment and deployment on the cloud, and it doesn't limit any manufacturers. It is relatively open
#3)I feel that it is more like a batch engine,It is more like a scene of centralized import and batch query,At least his cloud native model should be like this,Or I didn't find the user manual for real-time import
, while lxdb is a real-time engine with low data latency,Relatively speaking, it is easier for batch engine to realize cloud native, while it is more difficult for real-time millisecond delay engine to realize the separation of storage and computing. It needs a snapshot mechanism to record the data change at a certain time, so as to realize the separation of computing and computing between different nodes
#4 According to the official documents, see specifications and restrictions, the best configuration is C32. The number of nodes supported by C32 is less than 128, and the storage capacity is 1PB. In the production environment, lxdb has 904 nodes, 50pb disk capacity, and 70% storage utilization,Of course, it can be inaccurate and unfair to adb.



fp@lucene.cn  yannian mu



fp@lucene.cn  yannian mu
 
From: Ming Wen
Date: 2021-02-28 21:18
To: general
Subject: Re: Re: [Proposal] lxdb - proposal for Apache Incubation
Hi, fp,
Your email is hard to read.
Please change to a normal mail client first.
Back to your proposal, the key concern is not technology, but IPMC can not
evaluate a project when we can see anything.
 
Thanks,
Ming Wen, Apache APISIX PMC Chair
Twitter: _WenMing
 
 
fp@lucene.cn <fp...@lucene.cn> 于2021年2月28日周日 下午9:02写道:
 
> Hi Furkan Kamaci
>
>
> Thank you for your proposal, I will start to improve and prepare
>
>
>
>
> 1.Find an experienced mentor to guide you.
>
>
>
>      todo
>
>
>
> 2.Start to translate your documentation to English.
>
>
>
> 3.Open source your project. How can we have a comment on your project if
>
>
>
> we cannot see anything about it?
>
>
>
>
>
>
>
>      give me some time,I discussed with my team, my English is too poor.
>
>
>
>
>
>
>
> 4) Gain contributors to your project. At least you should show your
>
>
>
> intention to have committers/contributors out of your company. Eliminate
>
>
>
> the risk of being non-meritocratic management of the project.
>
>
>
>
>
>
>
> That's what I have to do
>
>
>
>
>
>
>
> 5) Structure your proposal. Explain why people need this project, which
>
>
>
> problems do current projects have and how you managed to handle them. We
>
>
>
> should understand is it a bundle of other projects, a completely new
>
>
>
> project, or a wrapper of other projects which eliminates the shortcomings
>
>
>
> of them.
>
>
>
> 6) Find a suitable name for your project in order to not try to solve
>
>
>
> trademark problems that may lose your time if you enter the incubation.
>
>
>
>
>
>
>
> ok i thike a new name ,for example like hydrogen sql
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> fp@lucene.cn  yannian mu
>
>
>
>
>
>
>
> From: Furkan KAMACI
>
>
>
> Date: 2021-02-28 18:51
>
>
>
> To: general
>
>
>
> Subject: Re: [Proposal] lxdb - proposal for Apache Incubation
>
>
>
> Hi,
>
>
>
>
>
>
>
> Actually you have a detailed documentation which explains which approach
>
>
>
> you have compared to similar systems and performance metrics of following
>
>
>
> them i.e. reducing storage 10 to the 100 times or having low latency
>
>
>
> queries.
>
>
>
>
>
>
>
> My advices are (some of them are same with Sheng's and Liang's ):
>
>
>
>
>
>
>
> 1) Find an experienced mentor to guide you.
>
>
>
>
>
>
>
> 2) Start to translate your documentation to English.
>
>
>
>
>
>
>
> 3) Open source your project. How can we have a comment on your project if
>
>
>
> we cannot see anything about it?
>
>
>
>
>
>
>
> 4) Gain contributors to your project. At least you should show your
>
>
>
> intention to have committers/contributors out of your company. Eliminate
>
>
>
> the risk of being non-meritocratic management of the project.
>
>
>
>
>
>
>
> 5) Structure your proposal. Explain why people need this project, which
>
>
>
> problems do current projects have and how you managed to handle them. We
>
>
>
> should understand is it a bundle of other projects, a completely new
>
>
>
> project, or a wrapper of other projects which eliminates the shortcomings
>
>
>
> of them.
>
>
>
>
>
>
>
> 6) Find a suitable name for your project in order to not try to solve
>
>
>
> trademark problems that may lose your time if you enter the incubation.
>
>
>
>
>
>
>
> Kind Regards,
>
>
>
> Furkan KAMACI
>
>
>
>
>
>
>
>
>
>
>
> On Sun, Feb 28, 2021 at 1:02 PM Liang Chen <ch...@gmail.com>
> wrote:
>
>
>
>
>
>
>
> > Hi
>
>
>
> >
>
>
>
> > It would be better if you could find an experienced IPMC member to help
> you
>
>
>
> > for preparing the proposal.
>
>
>
> > Based on Sheng Wu input, i have one more comment : can you please explain
>
>
>
> > what are the different with other similar data analysis DB?  you can
>
>
>
> > consider explaining from use cases perspective.
>
>
>
> >
>
>
>
> > Regards
>
>
>
> > Liang
>
>
>
> >
>
>
>
> >
>
>
>
> > fp wrote
>
>
>
> > > Dear Apache Incubator Community,
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Please accept the following proposal for presentation and discussion:
>
>
>
> > > https://github.com/lucene-cn/lxdb/wiki
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > LXDB is a high-performance,OLAP,full text search database.it`s base on
>
>
>
> > > hbase,but replaced hfile with lucene index to support more effective
>
>
>
> > > secondary indexes,it`s also base on spark sql,so that you can used sql
>
>
>
> > api
>
>
>
> > > to visit data and do olap calculate. and also the lucene index is store
>
>
>
> > on
>
>
>
> > > hdfs (not local disk).
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > In our Production System, LXDB supported 200+ clusters,some of the
> single
>
>
>
> > > cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
>
>
>
> > > billion rows for total), one of the biggest single table has 200million
>
>
>
> > > lucene index on LXDB.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),
>
>
>
> > HDFS,
>
>
>
> > > Lucene.We have merged these separated projects again,LXDB&nbsp;equals
>
>
>
> > > spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me
> 10
>
>
>
> > > years to complete these merging operations.But the purpose is no
> longer a
>
>
>
> > > search engine, but a database.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Best regards
>
>
>
> > > &nbsp; yannian mu
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > LXDB Proposal
>
>
>
> > > == Abstract ==
>
>
>
> > > LXDB is a high-performance,OLAP,full text search database.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === it`s base on hbase,but replaced hfile with lucene index to support
>
>
>
> > > more effective secondary indexes.===&nbsp;
>
>
>
> > > we modify hbase region server ,we&nbsp; change hfile to lucene,when put
>
>
>
> > > data we put&nbsp; document to lucene instande of&nbsp; put data to
> hfile
>
>
>
> > > lucene index store on region server&nbsp;&nbsp;(it is not sote in
>
>
>
> > > different cluster like elstice search+hbase ,it takes to copy of data)
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === it`s base on spark sql for olap===&nbsp;
>
>
>
> > > we Integrated spark and hbase together ,it`s useage like this ,
>
>
>
> > > 1.unpackage lxdb.tar.gz&nbsp;
>
>
>
> > > 2.config hadoop_config path,
>
>
>
> > > 3.run start-all.sh to start cluster.&nbsp;
>
>
>
> > > lxdb can startup spark through hadoop yarn ,and then spark executor
>
>
>
> > > process Embedded start hbase region server service .&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > you can operate lxdb database throuth spark sql api(hive) or mysql api.
>
>
>
> > > 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
>
>
>
> > > 2.the sql`s condition (filter or group by agg) will predicate to hbase
> ,
>
>
>
> > > 3.hbase used lucene index to filter data in region server.
>
>
>
> > > all of the spark,hbase,lucene is Embedded Integrated together,it is
>
>
>
> > > not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es
> +
>
>
>
> > > hbase+spark Solution.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == Background ==
>
>
>
> > > === Multiple copies of data ===
>
>
>
> > > Apache HBase+Elastic Search is the most popular Solution on full text
>
>
>
> > > search ,but it`s weak on Online AnalyticalProcessing.
>
>
>
> > > so most of the time the Production System used spark(or hive or impala
> or
>
>
>
> > > presto) ,hbase,solr/es at the same time.Multiple copies of data are
>
>
>
> > stored
>
>
>
> > > in multiple systems,multiple systems has different Api .Data
> consistency
>
>
>
> > > is difficult to guarantee.For the above reasons we merger
>
>
>
> > > spark,hbase,elastic into one project .it`s target is used one copy of
>
>
>
> > > data,one cluster,one api to solve olap,kv,full text...database
> scenarios.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Merging and splitting of lucene indexes(hstore) acrocess different
>
>
>
> > > machine on hdfs ===
>
>
>
> > > As we all know solr/es store file in local fileSystem,it`s shard num
> must
>
>
>
> > > be a fix num,but if we store index on hdfs,the index can split able
> like
>
>
>
> > > hbase hstore,it can split or merge acorss machine nodes ,this is very
>
>
>
> > > usefull for distribute database ,it depend malloc how much resource on
> a
>
>
>
> > > table,most of time the records of a table is different by time by time
> so
>
>
>
> > > the num of shards always need adjust,if index store local it can`t
> split
>
>
>
> > > acroces throw different machine ,but lucene index store on hdfs it`s
> can
>
>
>
> > > do it.
>
>
>
> > > whether the number of pieces can be flexibly adjusted, whether it has
> the
>
>
>
> > > ability of elastic scaling, in a distributed database is particularly
>
>
>
> > > important
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === solved Insufficient of&nbsp; secondary indexes ===
>
>
>
> > > some people use hbase secondary index like Phoenix prjoect. but those
>
>
>
> > > programme base on the hbase rowkey has a lot of redundancy,He can't
>
>
>
> > create
>
>
>
> > > too many indexes,Data inflation rate is too high,so used lucene index
>
>
>
> > > instand of secondary is the best chooses.&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === we add an lucene index for spark olap===&nbsp;
>
>
>
> > > Most of OLAP systems has violent scanning problems and Poor timeliness
> of
>
>
>
> > > data like hive,spark sql,impala or some of the mpp database.
>
>
>
> > > 1.They used violent scans to calculate the data.but another choice is
> add
>
>
>
> > > index to the big data.some of the time using index can greatly improve
>
>
>
> > the
>
>
>
> > > performance of the original brute force scanning. i think&nbsp; that
> just
>
>
>
> > > like the traditional database, indexing technology can greatly improve
>
>
>
> > the
>
>
>
> > > performance of the speed database.
>
>
>
> > > 2.Another problem of thoses database or system, Most of them are an
>
>
>
> > > offline system or batch system,lxdb `s target is realtime append
>
>
>
> > ,realtime
>
>
>
> > > kv update just like hbase.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > ==future==
>
>
>
> > > === lucene on parquet ===
>
>
>
> > > recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm
> files
>
>
>
> > > to&nbsp; like parquet or orc format.
>
>
>
> > > To solve the performance problem of traversing Lucene index.To solve
> the
>
>
>
> > > problem that opening Lucene file needs to load files such as tip into
>
>
>
> > > memory, which leads to slow opening Lucene index file,To enable Lucene
> to
>
>
>
> > > store multi column joint index by column, which is used to handle some
>
>
>
> > > logic such as multi table join and materialized view ,mulity fields
> group
>
>
>
> > > by by invert index,The current Lucene index has many problems because
> of
>
>
>
> > > too many file pointers and single column problems,We want to modify
>
>
>
> > Lucene
>
>
>
> > > to make it more suitable for HDFS, not only for full-text retrieval,
> but
>
>
>
> > > also better at statistical analysis, which is a real database level
>
>
>
> > > index,We want Lucene to be splitable, which can separate storage from
>
>
>
> > > computation.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > ===&nbsp; supporting all kinds of Predicate pushdown
> calculation&nbsp;===
>
>
>
> > > We find that if we can combine the calculation method with the data
>
>
>
> > > closely, we can give more play to the performance of the database.
> Index
>
>
>
> > > is only a way of calculating push down. For example, storage push down,
>
>
>
> > we
>
>
>
> > > can store the index on the SSD device, and the data part on the SATA
>
>
>
> > > device. We can store the data that are often grouped together in
> advance,
>
>
>
> > > instead of calculating line by line, We can give important tables or
>
>
>
> > > columns to dedicated devices and resources, but these hbases are still
>
>
>
> > > lacking, which we need to further improve
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Distribution of intervention data ===
>
>
>
> > > we can used row key to intervention data to different nodes ,it can do
>
>
>
> > > many interestest things
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Resource control, resource isolation ===
>
>
>
> > > lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp;
>
>
>
> > we
>
>
>
> > > can do it , I can control the priority of SQL so that Lucene with
> higher
>
>
>
> > > priority can get faster IO resources.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == Status ==
>
>
>
> > > since 2011 I released the first open source version on Alibaba&nbsp;
> ,At
>
>
>
> > > that time, mdrill used 10 nodes 48g machines to support 400 billion
> data.
>
>
>
> > > the first index on hdfs is from this version.it`s one year ahead of
> the
>
>
>
> > > community.&nbsp; https://github.com/alibaba/mdrill .
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > since 2014 i stoped mdrill project update for the reason of i join into
>
>
>
> > > tencent . in our team we developed&nbsp; hermes project ,we also build
>
>
>
> > > lucene on hdfs , hermes now realtime import 1000 billion rows of data
> per
>
>
>
> > > day.It's the largest database I've ever developed ,
>
>
>
> > > https://plus.tencent.com/bigdata/hermes
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > since 2018 I set up my own company called luxin, Lu Xin is the Chinese
>
>
>
> > > pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
>
>
>
> > > lucene.xin ,mail domain is lucene.cn.
>
>
>
> > > luxin`s first version of lxdb is called lsql,it`s means lucene
> sql.&nbsp;
>
>
>
> > > it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
>
>
>
> > > cluster use lsql. it`s process about 200 billions per day ,amount of
>
>
>
> > 20000
>
>
>
> > > billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > since 2010 In the case of COVID-19 our team decide to developed the
> next
>
>
>
> > > generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
>
>
>
> > > hbase to lsql To solve the update problem.nowadays we have finish the
>
>
>
> > > first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == Known Risks ==
>
>
>
> > > ==Meritocracy ==
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > lxdb has been deployed in production and is applying more than 200
> lines
>
>
>
> > > of business. It has demonstrated great performance benefits and has
>
>
>
> > proved
>
>
>
> > > to be a better way for reporting and analysis based big data. Still We
>
>
>
> > > look forward to growing a rich user and developer community.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Orphaned products ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > The core developers currently work full-time for Luxin.
>
>
>
> > > lxdb is widely adopted by many companies and individuals. There's no
>
>
>
> > > realistic chance of it becoming orphaned. and we have a number of 1000
>
>
>
> > > person tencent qq Instant messaging group
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Inexperience with Open Source===
>
>
>
> > >
>
>
>
> > > The core developers are all active users and followers of open source.
>
>
>
> > > They are already committers and contributors to the lxdb project.&nbsp;
>
>
>
> > > developed yannian mu has tens years on open source project,&nbsp;
> jstorm
>
>
>
> > > https://github.com/alibaba/jstorm and
>
>
>
> > > mdrill&nbsp;https://github.com/alibaba/mdrill
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Homogenous Developers ===&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > The most of core developers are from luxin for the Closed source
> products
>
>
>
> > > reason, but when lxdb was open sourced, lxdb will received a lot of bug
>
>
>
> > > fixes and enhancements from other developers not working at luxin.Where
>
>
>
> > > did you learn it from and where did you return it.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > ===Reliance on Salaried Developers ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Lxin invested in lxdb as the&nbsp; solution and some of its key
> engineers
>
>
>
> > > are working full time on the project. In addition, since there is a
>
>
>
> > > growing Big Data need for scalable solutions, we look forward to other
>
>
>
> > > Apache developers and researchers to contribute to the project. Also
> key
>
>
>
> > > to addressing the risk associated with relying on Salaried developers
>
>
>
> > from
>
>
>
> > > a single entity is to increase the diversity of the contributors and
>
>
>
> > > actively lobby , Apache lxdb intends to do this.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === An Excessive Fascination with the Apache Brand ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Lxdb is proposing to enter incubation at Apache in order to help
> efforts
>
>
>
> > > to diversify the committer-base, not so much to capitalize on the
> Apache
>
>
>
> > > brand. The Lxdb project is in production use already inside lxdb, but
> is
>
>
>
> > > not expected to be an lxdb product for external customers. As such, the
>
>
>
> > > lxdb project is not seeking to use the Apache brand as a marketing
> tool.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Documentation===&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Information about Palo can be found at
> https://github.com/lucene-cn/lxdb
>
>
>
> > .
>
>
>
> > > The following links provide more information about lxdb in open source:
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > * wiki site: https://github.com/lucene-cn/lxdb/wiki
>
>
>
> > > * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
>
>
>
> > > * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
>
>
>
> > > * lxin home page: http://www.lucene.xin
>
>
>
> > >
>
>
>
> > > * lsql document: http://docs.lucene.xin/lsql/v21/
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > ##Initial Source
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > lxdb will development source code under an Apache license at
>
>
>
> > > https://github.com/lucene-cn/lxdb.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Core Developers ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Currently most of the core developers of LXDB are working in the
> research
>
>
>
> > > Team of luxin.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > - yannian mu (dev)&nbsp;
>
>
>
> > > - yu chen (dev)&nbsp;
>
>
>
> > > - guangshi hao (dev)&nbsp;
>
>
>
> > > - wei sun (dev)&nbsp;
>
>
>
> > > - qihua zheng (dev)&nbsp;
>
>
>
> > > - xin wang (dev)&nbsp;
>
>
>
> > > - qingsong liu (dev)&nbsp;
>
>
>
> > > - anxing zhou (Tester)&nbsp;
>
>
>
> > > - jiajun duan (Tester)&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == External Dependencies ==
>
>
>
> > >
>
>
>
> > > As all dependencies are managed using Apache Maven
>
>
>
> > > Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp;
> &nbsp;
>
>
>
> > > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
>
>
>
> > > lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
>
>
>
> > > &nbsp; &nbsp; &nbsp; true
>
>
>
> > > zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License
>
>
>
> > 2.0&nbsp;
>
>
>
> > > &nbsp; &nbsp; &nbsp; &nbsp; true
>
>
>
> > > hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
>
>
>
> > > &nbsp; &nbsp; &nbsp; &nbsp; true
>
>
>
> > > spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
>
>
>
> > > true
>
>
>
> > > hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
>
>
>
> > > License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
>
>
>
> > > hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
>
>
>
> > true
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == Required Resources ==
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Mailing lists ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > &nbsp;* lxdb-private (PMC discussion)
>
>
>
> > > &nbsp;* lxdb-dev (developer discussion)
>
>
>
> > > &nbsp;* lxdb-user (user discussion)
>
>
>
> > > &nbsp;* lxdb-commits (SCM commits)
>
>
>
> > > &nbsp;* lxdb-issues (JIRA issue feed)
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Subversion Directory ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Instead of subversion, LXDB prefers to git as source control
>
>
>
> > > management system: git://git.apache.org/lxdb
>
>
>
> >
>
>
>
> >
>
>
>
> >
>
>
>
> >
>
>
>
> >
>
>
>
> > --
>
>
>
> > Sent from: http://apache-incubator-general.996316.n3.nabble.com/
>
>
>
> >
>
>
>
> > ---------------------------------------------------------------------
>
>
>
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>
>
>
> > For additional commands, e-mail: general-help@incubator.apache.org
>
>
>
> >
>
>
>
> >
>
>
>

Re: Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by Ming Wen <we...@apache.org>.
Hi, fp,
Your email is hard to read.
Please change to a normal mail client first.
Back to your proposal, the key concern is not technology, but IPMC can not
evaluate a project when we can see anything.

Thanks,
Ming Wen, Apache APISIX PMC Chair
Twitter: _WenMing


fp@lucene.cn <fp...@lucene.cn> 于2021年2月28日周日 下午9:02写道:

> Hi Furkan Kamaci
>
>
> Thank you for your proposal, I will start to improve and prepare
>
>
>
>
> 1.Find an experienced mentor to guide you.
>
>
>
>      todo
>
>
>
> 2.Start to translate your documentation to English.
>
>
>
> 3.Open source your project. How can we have a comment on your project if
>
>
>
> we cannot see anything about it?
>
>
>
>
>
>
>
>      give me some time,I discussed with my team, my English is too poor.
>
>
>
>
>
>
>
> 4) Gain contributors to your project. At least you should show your
>
>
>
> intention to have committers/contributors out of your company. Eliminate
>
>
>
> the risk of being non-meritocratic management of the project.
>
>
>
>
>
>
>
> That's what I have to do
>
>
>
>
>
>
>
> 5) Structure your proposal. Explain why people need this project, which
>
>
>
> problems do current projects have and how you managed to handle them. We
>
>
>
> should understand is it a bundle of other projects, a completely new
>
>
>
> project, or a wrapper of other projects which eliminates the shortcomings
>
>
>
> of them.
>
>
>
> 6) Find a suitable name for your project in order to not try to solve
>
>
>
> trademark problems that may lose your time if you enter the incubation.
>
>
>
>
>
>
>
> ok i thike a new name ,for example like hydrogen sql
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> fp@lucene.cn  yannian mu
>
>
>
>
>
>
>
> From: Furkan KAMACI
>
>
>
> Date: 2021-02-28 18:51
>
>
>
> To: general
>
>
>
> Subject: Re: [Proposal] lxdb - proposal for Apache Incubation
>
>
>
> Hi,
>
>
>
>
>
>
>
> Actually you have a detailed documentation which explains which approach
>
>
>
> you have compared to similar systems and performance metrics of following
>
>
>
> them i.e. reducing storage 10 to the 100 times or having low latency
>
>
>
> queries.
>
>
>
>
>
>
>
> My advices are (some of them are same with Sheng's and Liang's ):
>
>
>
>
>
>
>
> 1) Find an experienced mentor to guide you.
>
>
>
>
>
>
>
> 2) Start to translate your documentation to English.
>
>
>
>
>
>
>
> 3) Open source your project. How can we have a comment on your project if
>
>
>
> we cannot see anything about it?
>
>
>
>
>
>
>
> 4) Gain contributors to your project. At least you should show your
>
>
>
> intention to have committers/contributors out of your company. Eliminate
>
>
>
> the risk of being non-meritocratic management of the project.
>
>
>
>
>
>
>
> 5) Structure your proposal. Explain why people need this project, which
>
>
>
> problems do current projects have and how you managed to handle them. We
>
>
>
> should understand is it a bundle of other projects, a completely new
>
>
>
> project, or a wrapper of other projects which eliminates the shortcomings
>
>
>
> of them.
>
>
>
>
>
>
>
> 6) Find a suitable name for your project in order to not try to solve
>
>
>
> trademark problems that may lose your time if you enter the incubation.
>
>
>
>
>
>
>
> Kind Regards,
>
>
>
> Furkan KAMACI
>
>
>
>
>
>
>
>
>
>
>
> On Sun, Feb 28, 2021 at 1:02 PM Liang Chen <ch...@gmail.com>
> wrote:
>
>
>
>
>
>
>
> > Hi
>
>
>
> >
>
>
>
> > It would be better if you could find an experienced IPMC member to help
> you
>
>
>
> > for preparing the proposal.
>
>
>
> > Based on Sheng Wu input, i have one more comment : can you please explain
>
>
>
> > what are the different with other similar data analysis DB?  you can
>
>
>
> > consider explaining from use cases perspective.
>
>
>
> >
>
>
>
> > Regards
>
>
>
> > Liang
>
>
>
> >
>
>
>
> >
>
>
>
> > fp wrote
>
>
>
> > > Dear Apache Incubator Community,
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Please accept the following proposal for presentation and discussion:
>
>
>
> > > https://github.com/lucene-cn/lxdb/wiki
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > LXDB is a high-performance,OLAP,full text search database.it`s base on
>
>
>
> > > hbase,but replaced hfile with lucene index to support more effective
>
>
>
> > > secondary indexes,it`s also base on spark sql,so that you can used sql
>
>
>
> > api
>
>
>
> > > to visit data and do olap calculate. and also the lucene index is store
>
>
>
> > on
>
>
>
> > > hdfs (not local disk).
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > In our Production System, LXDB supported 200+ clusters,some of the
> single
>
>
>
> > > cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
>
>
>
> > > billion rows for total), one of the biggest single table has 200million
>
>
>
> > > lucene index on LXDB.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),
>
>
>
> > HDFS,
>
>
>
> > > Lucene.We have merged these separated projects again,LXDB&nbsp;equals
>
>
>
> > > spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me
> 10
>
>
>
> > > years to complete these merging operations.But the purpose is no
> longer a
>
>
>
> > > search engine, but a database.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Best regards
>
>
>
> > > &nbsp; yannian mu
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > LXDB Proposal
>
>
>
> > > == Abstract ==
>
>
>
> > > LXDB is a high-performance,OLAP,full text search database.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === it`s base on hbase,but replaced hfile with lucene index to support
>
>
>
> > > more effective secondary indexes.===&nbsp;
>
>
>
> > > we modify hbase region server ,we&nbsp; change hfile to lucene,when put
>
>
>
> > > data we put&nbsp; document to lucene instande of&nbsp; put data to
> hfile
>
>
>
> > > lucene index store on region server&nbsp;&nbsp;(it is not sote in
>
>
>
> > > different cluster like elstice search+hbase ,it takes to copy of data)
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === it`s base on spark sql for olap===&nbsp;
>
>
>
> > > we Integrated spark and hbase together ,it`s useage like this ,
>
>
>
> > > 1.unpackage lxdb.tar.gz&nbsp;
>
>
>
> > > 2.config hadoop_config path,
>
>
>
> > > 3.run start-all.sh to start cluster.&nbsp;
>
>
>
> > > lxdb can startup spark through hadoop yarn ,and then spark executor
>
>
>
> > > process Embedded start hbase region server service .&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > you can operate lxdb database throuth spark sql api(hive) or mysql api.
>
>
>
> > > 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
>
>
>
> > > 2.the sql`s condition (filter or group by agg) will predicate to hbase
> ,
>
>
>
> > > 3.hbase used lucene index to filter data in region server.
>
>
>
> > > all of the spark,hbase,lucene is Embedded Integrated together,it is
>
>
>
> > > not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es
> +
>
>
>
> > > hbase+spark Solution.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == Background ==
>
>
>
> > > === Multiple copies of data ===
>
>
>
> > > Apache HBase+Elastic Search is the most popular Solution on full text
>
>
>
> > > search ,but it`s weak on Online AnalyticalProcessing.
>
>
>
> > > so most of the time the Production System used spark(or hive or impala
> or
>
>
>
> > > presto) ,hbase,solr/es at the same time.Multiple copies of data are
>
>
>
> > stored
>
>
>
> > > in multiple systems,multiple systems has different Api .Data
> consistency
>
>
>
> > > is difficult to guarantee.For the above reasons we merger
>
>
>
> > > spark,hbase,elastic into one project .it`s target is used one copy of
>
>
>
> > > data,one cluster,one api to solve olap,kv,full text...database
> scenarios.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Merging and splitting of lucene indexes(hstore) acrocess different
>
>
>
> > > machine on hdfs ===
>
>
>
> > > As we all know solr/es store file in local fileSystem,it`s shard num
> must
>
>
>
> > > be a fix num,but if we store index on hdfs,the index can split able
> like
>
>
>
> > > hbase hstore,it can split or merge acorss machine nodes ,this is very
>
>
>
> > > usefull for distribute database ,it depend malloc how much resource on
> a
>
>
>
> > > table,most of time the records of a table is different by time by time
> so
>
>
>
> > > the num of shards always need adjust,if index store local it can`t
> split
>
>
>
> > > acroces throw different machine ,but lucene index store on hdfs it`s
> can
>
>
>
> > > do it.
>
>
>
> > > whether the number of pieces can be flexibly adjusted, whether it has
> the
>
>
>
> > > ability of elastic scaling, in a distributed database is particularly
>
>
>
> > > important
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === solved Insufficient of&nbsp; secondary indexes ===
>
>
>
> > > some people use hbase secondary index like Phoenix prjoect. but those
>
>
>
> > > programme base on the hbase rowkey has a lot of redundancy,He can't
>
>
>
> > create
>
>
>
> > > too many indexes,Data inflation rate is too high,so used lucene index
>
>
>
> > > instand of secondary is the best chooses.&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === we add an lucene index for spark olap===&nbsp;
>
>
>
> > > Most of OLAP systems has violent scanning problems and Poor timeliness
> of
>
>
>
> > > data like hive,spark sql,impala or some of the mpp database.
>
>
>
> > > 1.They used violent scans to calculate the data.but another choice is
> add
>
>
>
> > > index to the big data.some of the time using index can greatly improve
>
>
>
> > the
>
>
>
> > > performance of the original brute force scanning. i think&nbsp; that
> just
>
>
>
> > > like the traditional database, indexing technology can greatly improve
>
>
>
> > the
>
>
>
> > > performance of the speed database.
>
>
>
> > > 2.Another problem of thoses database or system, Most of them are an
>
>
>
> > > offline system or batch system,lxdb `s target is realtime append
>
>
>
> > ,realtime
>
>
>
> > > kv update just like hbase.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > ==future==
>
>
>
> > > === lucene on parquet ===
>
>
>
> > > recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm
> files
>
>
>
> > > to&nbsp; like parquet or orc format.
>
>
>
> > > To solve the performance problem of traversing Lucene index.To solve
> the
>
>
>
> > > problem that opening Lucene file needs to load files such as tip into
>
>
>
> > > memory, which leads to slow opening Lucene index file,To enable Lucene
> to
>
>
>
> > > store multi column joint index by column, which is used to handle some
>
>
>
> > > logic such as multi table join and materialized view ,mulity fields
> group
>
>
>
> > > by by invert index,The current Lucene index has many problems because
> of
>
>
>
> > > too many file pointers and single column problems,We want to modify
>
>
>
> > Lucene
>
>
>
> > > to make it more suitable for HDFS, not only for full-text retrieval,
> but
>
>
>
> > > also better at statistical analysis, which is a real database level
>
>
>
> > > index,We want Lucene to be splitable, which can separate storage from
>
>
>
> > > computation.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > ===&nbsp; supporting all kinds of Predicate pushdown
> calculation&nbsp;===
>
>
>
> > > We find that if we can combine the calculation method with the data
>
>
>
> > > closely, we can give more play to the performance of the database.
> Index
>
>
>
> > > is only a way of calculating push down. For example, storage push down,
>
>
>
> > we
>
>
>
> > > can store the index on the SSD device, and the data part on the SATA
>
>
>
> > > device. We can store the data that are often grouped together in
> advance,
>
>
>
> > > instead of calculating line by line, We can give important tables or
>
>
>
> > > columns to dedicated devices and resources, but these hbases are still
>
>
>
> > > lacking, which we need to further improve
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Distribution of intervention data ===
>
>
>
> > > we can used row key to intervention data to different nodes ,it can do
>
>
>
> > > many interestest things
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Resource control, resource isolation ===
>
>
>
> > > lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp;
>
>
>
> > we
>
>
>
> > > can do it , I can control the priority of SQL so that Lucene with
> higher
>
>
>
> > > priority can get faster IO resources.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == Status ==
>
>
>
> > > since 2011 I released the first open source version on Alibaba&nbsp;
> ,At
>
>
>
> > > that time, mdrill used 10 nodes 48g machines to support 400 billion
> data.
>
>
>
> > > the first index on hdfs is from this version.it`s one year ahead of
> the
>
>
>
> > > community.&nbsp; https://github.com/alibaba/mdrill .
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > since 2014 i stoped mdrill project update for the reason of i join into
>
>
>
> > > tencent . in our team we developed&nbsp; hermes project ,we also build
>
>
>
> > > lucene on hdfs , hermes now realtime import 1000 billion rows of data
> per
>
>
>
> > > day.It's the largest database I've ever developed ,
>
>
>
> > > https://plus.tencent.com/bigdata/hermes
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > since 2018 I set up my own company called luxin, Lu Xin is the Chinese
>
>
>
> > > pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
>
>
>
> > > lucene.xin ,mail domain is lucene.cn.
>
>
>
> > > luxin`s first version of lxdb is called lsql,it`s means lucene
> sql.&nbsp;
>
>
>
> > > it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
>
>
>
> > > cluster use lsql. it`s process about 200 billions per day ,amount of
>
>
>
> > 20000
>
>
>
> > > billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > since 2010 In the case of COVID-19 our team decide to developed the
> next
>
>
>
> > > generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
>
>
>
> > > hbase to lsql To solve the update problem.nowadays we have finish the
>
>
>
> > > first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == Known Risks ==
>
>
>
> > > ==Meritocracy ==
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > lxdb has been deployed in production and is applying more than 200
> lines
>
>
>
> > > of business. It has demonstrated great performance benefits and has
>
>
>
> > proved
>
>
>
> > > to be a better way for reporting and analysis based big data. Still We
>
>
>
> > > look forward to growing a rich user and developer community.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Orphaned products ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > The core developers currently work full-time for Luxin.
>
>
>
> > > lxdb is widely adopted by many companies and individuals. There's no
>
>
>
> > > realistic chance of it becoming orphaned. and we have a number of 1000
>
>
>
> > > person tencent qq Instant messaging group
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Inexperience with Open Source===
>
>
>
> > >
>
>
>
> > > The core developers are all active users and followers of open source.
>
>
>
> > > They are already committers and contributors to the lxdb project.&nbsp;
>
>
>
> > > developed yannian mu has tens years on open source project,&nbsp;
> jstorm
>
>
>
> > > https://github.com/alibaba/jstorm and
>
>
>
> > > mdrill&nbsp;https://github.com/alibaba/mdrill
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Homogenous Developers ===&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > The most of core developers are from luxin for the Closed source
> products
>
>
>
> > > reason, but when lxdb was open sourced, lxdb will received a lot of bug
>
>
>
> > > fixes and enhancements from other developers not working at luxin.Where
>
>
>
> > > did you learn it from and where did you return it.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > ===Reliance on Salaried Developers ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Lxin invested in lxdb as the&nbsp; solution and some of its key
> engineers
>
>
>
> > > are working full time on the project. In addition, since there is a
>
>
>
> > > growing Big Data need for scalable solutions, we look forward to other
>
>
>
> > > Apache developers and researchers to contribute to the project. Also
> key
>
>
>
> > > to addressing the risk associated with relying on Salaried developers
>
>
>
> > from
>
>
>
> > > a single entity is to increase the diversity of the contributors and
>
>
>
> > > actively lobby , Apache lxdb intends to do this.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === An Excessive Fascination with the Apache Brand ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Lxdb is proposing to enter incubation at Apache in order to help
> efforts
>
>
>
> > > to diversify the committer-base, not so much to capitalize on the
> Apache
>
>
>
> > > brand. The Lxdb project is in production use already inside lxdb, but
> is
>
>
>
> > > not expected to be an lxdb product for external customers. As such, the
>
>
>
> > > lxdb project is not seeking to use the Apache brand as a marketing
> tool.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Documentation===&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Information about Palo can be found at
> https://github.com/lucene-cn/lxdb
>
>
>
> > .
>
>
>
> > > The following links provide more information about lxdb in open source:
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > * wiki site: https://github.com/lucene-cn/lxdb/wiki
>
>
>
> > > * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
>
>
>
> > > * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
>
>
>
> > > * lxin home page: http://www.lucene.xin
>
>
>
> > >
>
>
>
> > > * lsql document: http://docs.lucene.xin/lsql/v21/
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > ##Initial Source
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > lxdb will development source code under an Apache license at
>
>
>
> > > https://github.com/lucene-cn/lxdb.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Core Developers ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Currently most of the core developers of LXDB are working in the
> research
>
>
>
> > > Team of luxin.
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > - yannian mu (dev)&nbsp;
>
>
>
> > > - yu chen (dev)&nbsp;
>
>
>
> > > - guangshi hao (dev)&nbsp;
>
>
>
> > > - wei sun (dev)&nbsp;
>
>
>
> > > - qihua zheng (dev)&nbsp;
>
>
>
> > > - xin wang (dev)&nbsp;
>
>
>
> > > - qingsong liu (dev)&nbsp;
>
>
>
> > > - anxing zhou (Tester)&nbsp;
>
>
>
> > > - jiajun duan (Tester)&nbsp;
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == External Dependencies ==
>
>
>
> > >
>
>
>
> > > As all dependencies are managed using Apache Maven
>
>
>
> > > Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp;
> &nbsp;
>
>
>
> > > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
>
>
>
> > > lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
>
>
>
> > > &nbsp; &nbsp; &nbsp; true
>
>
>
> > > zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License
>
>
>
> > 2.0&nbsp;
>
>
>
> > > &nbsp; &nbsp; &nbsp; &nbsp; true
>
>
>
> > > hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
>
>
>
> > > &nbsp; &nbsp; &nbsp; &nbsp; true
>
>
>
> > > spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
>
>
>
> > > true
>
>
>
> > > hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
>
>
>
> > > License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
>
>
>
> > > hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
>
>
>
> > true
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > == Required Resources ==
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Mailing lists ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > &nbsp;* lxdb-private (PMC discussion)
>
>
>
> > > &nbsp;* lxdb-dev (developer discussion)
>
>
>
> > > &nbsp;* lxdb-user (user discussion)
>
>
>
> > > &nbsp;* lxdb-commits (SCM commits)
>
>
>
> > > &nbsp;* lxdb-issues (JIRA issue feed)
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > === Subversion Directory ===
>
>
>
> > >
>
>
>
> > >
>
>
>
> > > Instead of subversion, LXDB prefers to git as source control
>
>
>
> > > management system: git://git.apache.org/lxdb
>
>
>
> >
>
>
>
> >
>
>
>
> >
>
>
>
> >
>
>
>
> >
>
>
>
> > --
>
>
>
> > Sent from: http://apache-incubator-general.996316.n3.nabble.com/
>
>
>
> >
>
>
>
> > ---------------------------------------------------------------------
>
>
>
> > To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>
>
>
> > For additional commands, e-mail: general-help@incubator.apache.org
>
>
>
> >
>
>
>
> >
>
>
>

Re: Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by "fp@lucene.cn" <fp...@lucene.cn>.
Hi Furkan Kamaci


Thank you for your proposal, I will start to improve and prepare




1.Find an experienced mentor to guide you.



     todo



2.Start to translate your documentation to English.



3.Open source your project. How can we have a comment on your project if



we cannot see anything about it?







     give me some time,I discussed with my team, my English is too poor.







4) Gain contributors to your project. At least you should show your



intention to have committers/contributors out of your company. Eliminate



the risk of being non-meritocratic management of the project.







That's what I have to do







5) Structure your proposal. Explain why people need this project, which



problems do current projects have and how you managed to handle them. We



should understand is it a bundle of other projects, a completely new



project, or a wrapper of other projects which eliminates the shortcomings



of them.



6) Find a suitable name for your project in order to not try to solve



trademark problems that may lose your time if you enter the incubation.







ok i thike a new name ,for example like hydrogen sql 















fp@lucene.cn  yannian mu



 



From: Furkan KAMACI



Date: 2021-02-28 18:51



To: general



Subject: Re: [Proposal] lxdb - proposal for Apache Incubation



Hi,



 



Actually you have a detailed documentation which explains which approach



you have compared to similar systems and performance metrics of following



them i.e. reducing storage 10 to the 100 times or having low latency



queries.



 



My advices are (some of them are same with Sheng's and Liang's ):



 



1) Find an experienced mentor to guide you.



 



2) Start to translate your documentation to English.



 



3) Open source your project. How can we have a comment on your project if



we cannot see anything about it?



 



4) Gain contributors to your project. At least you should show your



intention to have committers/contributors out of your company. Eliminate



the risk of being non-meritocratic management of the project.



 



5) Structure your proposal. Explain why people need this project, which



problems do current projects have and how you managed to handle them. We



should understand is it a bundle of other projects, a completely new



project, or a wrapper of other projects which eliminates the shortcomings



of them.



 



6) Find a suitable name for your project in order to not try to solve



trademark problems that may lose your time if you enter the incubation.



 



Kind Regards,



Furkan KAMACI



 



 



On Sun, Feb 28, 2021 at 1:02 PM Liang Chen <ch...@gmail.com> wrote:



 



> Hi



>



> It would be better if you could find an experienced IPMC member to help you



> for preparing the proposal.



> Based on Sheng Wu input, i have one more comment : can you please explain



> what are the different with other similar data analysis DB?  you can



> consider explaining from use cases perspective.



>



> Regards



> Liang



>



>



> fp wrote



> > Dear Apache Incubator Community,



> >



> >



> > Please accept the following proposal for presentation and discussion:



> > https://github.com/lucene-cn/lxdb/wiki



> >



> >



> > LXDB is a high-performance,OLAP,full text search database.it`s base on



> > hbase,but replaced hfile with lucene index to support more effective



> > secondary indexes,it`s also base on spark sql,so that you can used sql



> api



> > to visit data and do olap calculate. and also the lucene index is store



> on



> > hdfs (not local disk).



> >



> >



> > In our Production System, LXDB supported 200+ clusters,some of the single



> > cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000



> > billion rows for total), one of the biggest single table has 200million



> > lucene index on LXDB.



> >



> >



> > Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),



> HDFS,



> > Lucene.We have merged these separated projects again,LXDB&nbsp;equals



> > spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10



> > years to complete these merging operations.But the purpose is no longer a



> > search engine, but a database.



> >



> >



> >



> >



> >



> > Best regards



> > &nbsp; yannian mu



> >



> >



> >



> >



> > LXDB Proposal



> > == Abstract ==



> > LXDB is a high-performance,OLAP,full text search database.



> >



> >



> > === it`s base on hbase,but replaced hfile with lucene index to support



> > more effective secondary indexes.===&nbsp;



> > we modify hbase region server ,we&nbsp; change hfile to lucene,when put



> > data we put&nbsp; document to lucene instande of&nbsp; put data to hfile



> > lucene index store on region server&nbsp;&nbsp;(it is not sote in



> > different cluster like elstice search+hbase ,it takes to copy of data)



> >



> >



> > === it`s base on spark sql for olap===&nbsp;



> > we Integrated spark and hbase together ,it`s useage like this ,



> > 1.unpackage lxdb.tar.gz&nbsp;



> > 2.config hadoop_config path,



> > 3.run start-all.sh to start cluster.&nbsp;



> > lxdb can startup spark through hadoop yarn ,and then spark executor



> > process Embedded start hbase region server service .&nbsp;



> >



> >



> > you can operate lxdb database throuth spark sql api(hive) or mysql api.



> > 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .



> > 2.the sql`s condition (filter or group by agg) will predicate to hbase ,



> > 3.hbase used lucene index to filter data in region server.



> > all of the spark,hbase,lucene is Embedded Integrated together,it is



> > not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es +



> > hbase+spark Solution.



> >



> >



> > == Background ==



> > === Multiple copies of data ===



> > Apache HBase+Elastic Search is the most popular Solution on full text



> > search ,but it`s weak on Online AnalyticalProcessing.



> > so most of the time the Production System used spark(or hive or impala or



> > presto) ,hbase,solr/es at the same time.Multiple copies of data are



> stored



> > in multiple systems,multiple systems has different Api .Data consistency



> > is difficult to guarantee.For the above reasons we merger



> > spark,hbase,elastic into one project .it`s target is used one copy of



> > data,one cluster,one api to solve olap,kv,full text...database scenarios.



> >



> >



> > === Merging and splitting of lucene indexes(hstore) acrocess different



> > machine on hdfs ===



> > As we all know solr/es store file in local fileSystem,it`s shard num must



> > be a fix num,but if we store index on hdfs,the index can split able like



> > hbase hstore,it can split or merge acorss machine nodes ,this is very



> > usefull for distribute database ,it depend malloc how much resource on a



> > table,most of time the records of a table is different by time by time so



> > the num of shards always need adjust,if index store local it can`t split



> > acroces throw different machine ,but lucene index store on hdfs it`s can



> > do it.



> > whether the number of pieces can be flexibly adjusted, whether it has the



> > ability of elastic scaling, in a distributed database is particularly



> > important



> >



> >



> >



> > === solved Insufficient of&nbsp; secondary indexes ===



> > some people use hbase secondary index like Phoenix prjoect. but those



> > programme base on the hbase rowkey has a lot of redundancy,He can't



> create



> > too many indexes,Data inflation rate is too high,so used lucene index



> > instand of secondary is the best chooses.&nbsp;



> >



> >



> > === we add an lucene index for spark olap===&nbsp;



> > Most of OLAP systems has violent scanning problems and Poor timeliness of



> > data like hive,spark sql,impala or some of the mpp database.



> > 1.They used violent scans to calculate the data.but another choice is add



> > index to the big data.some of the time using index can greatly improve



> the



> > performance of the original brute force scanning. i think&nbsp; that just



> > like the traditional database, indexing technology can greatly improve



> the



> > performance of the speed database.



> > 2.Another problem of thoses database or system, Most of them are an



> > offline system or batch system,lxdb `s target is realtime append



> ,realtime



> > kv update just like hbase.



> >



> >



> > ==future==



> > === lucene on parquet ===



> > recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files



> > to&nbsp; like parquet or orc format.



> > To solve the performance problem of traversing Lucene index.To solve the



> > problem that opening Lucene file needs to load files such as tip into



> > memory, which leads to slow opening Lucene index file,To enable Lucene to



> > store multi column joint index by column, which is used to handle some



> > logic such as multi table join and materialized view ,mulity fields group



> > by by invert index,The current Lucene index has many problems because of



> > too many file pointers and single column problems,We want to modify



> Lucene



> > to make it more suitable for HDFS, not only for full-text retrieval, but



> > also better at statistical analysis, which is a real database level



> > index,We want Lucene to be splitable, which can separate storage from



> > computation.



> >



> >



> >



> >



> > ===&nbsp; supporting all kinds of Predicate pushdown calculation&nbsp;===



> > We find that if we can combine the calculation method with the data



> > closely, we can give more play to the performance of the database. Index



> > is only a way of calculating push down. For example, storage push down,



> we



> > can store the index on the SSD device, and the data part on the SATA



> > device. We can store the data that are often grouped together in advance,



> > instead of calculating line by line, We can give important tables or



> > columns to dedicated devices and resources, but these hbases are still



> > lacking, which we need to further improve



> >



> >



> > === Distribution of intervention data ===



> > we can used row key to intervention data to different nodes ,it can do



> > many interestest things



> >



> >



> > === Resource control, resource isolation ===



> > lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp;



> we



> > can do it , I can control the priority of SQL so that Lucene with higher



> > priority can get faster IO resources.



> >



> >



> > == Status ==



> > since 2011 I released the first open source version on Alibaba&nbsp; ,At



> > that time, mdrill used 10 nodes 48g machines to support 400 billion data.



> > the first index on hdfs is from this version.it`s one year ahead of the



> > community.&nbsp; https://github.com/alibaba/mdrill .



> >



> >



> > since 2014 i stoped mdrill project update for the reason of i join into



> > tencent . in our team we developed&nbsp; hermes project ,we also build



> > lucene on hdfs , hermes now realtime import 1000 billion rows of data per



> > day.It's the largest database I've ever developed ,



> > https://plus.tencent.com/bigdata/hermes



> >



> >



> > since 2018 I set up my own company called luxin, Lu Xin is the Chinese



> > pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is



> > lucene.xin ,mail domain is lucene.cn.



> > luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp;



> > it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of



> > cluster use lsql. it`s process about 200 billions per day ,amount of



> 20000



> > billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;



> >



> >



> > since 2010 In the case of COVID-19 our team decide to developed the next



> > generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add



> > hbase to lsql To solve the update problem.nowadays we have finish the



> > first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki



> >



> >



> >



> >



> >



> >



> >



> > == Known Risks ==



> > ==Meritocracy ==



> >



> >



> > lxdb has been deployed in production and is applying more than 200 lines



> > of business. It has demonstrated great performance benefits and has



> proved



> > to be a better way for reporting and analysis based big data. Still We



> > look forward to growing a rich user and developer community.



> >



> >



> > === Orphaned products ===



> >



> >



> > The core developers currently work full-time for Luxin.



> > lxdb is widely adopted by many companies and individuals. There's no



> > realistic chance of it becoming orphaned. and we have a number of 1000



> > person tencent qq Instant messaging group



> >



> >



> >



> > === Inexperience with Open Source===



> >



> > The core developers are all active users and followers of open source.



> > They are already committers and contributors to the lxdb project.&nbsp;



> > developed yannian mu has tens years on open source project,&nbsp; jstorm



> > https://github.com/alibaba/jstorm and



> > mdrill&nbsp;https://github.com/alibaba/mdrill



> >



> >



> >



> >



> > === Homogenous Developers ===&nbsp;



> >



> >



> > The most of core developers are from luxin for the Closed source products



> > reason, but when lxdb was open sourced, lxdb will received a lot of bug



> > fixes and enhancements from other developers not working at luxin.Where



> > did you learn it from and where did you return it.



> >



> >



> >



> >



> >



> > ===Reliance on Salaried Developers ===



> >



> >



> > Lxin invested in lxdb as the&nbsp; solution and some of its key engineers



> > are working full time on the project. In addition, since there is a



> > growing Big Data need for scalable solutions, we look forward to other



> > Apache developers and researchers to contribute to the project. Also key



> > to addressing the risk associated with relying on Salaried developers



> from



> > a single entity is to increase the diversity of the contributors and



> > actively lobby , Apache lxdb intends to do this.



> >



> >



> > === An Excessive Fascination with the Apache Brand ===



> >



> >



> > Lxdb is proposing to enter incubation at Apache in order to help efforts



> > to diversify the committer-base, not so much to capitalize on the Apache



> > brand. The Lxdb project is in production use already inside lxdb, but is



> > not expected to be an lxdb product for external customers. As such, the



> > lxdb project is not seeking to use the Apache brand as a marketing tool.



> >



> >



> >



> >



> >



> > === Documentation===&nbsp;



> >



> >



> > Information about Palo can be found at https://github.com/lucene-cn/lxdb



> .



> > The following links provide more information about lxdb in open source:



> >



> >



> > * wiki site: https://github.com/lucene-cn/lxdb/wiki



> > * Issue Tracking: https://github.com/lucene-cn/lxdb/issues



> > * Overview: https://github.com/lucene-cn/lxdb/wiki/intro



> > * lxin home page: http://www.lucene.xin



> >



> > * lsql document: http://docs.lucene.xin/lsql/v21/



> >



> >



> >



> > ##Initial Source



> >



> >



> > lxdb will development source code under an Apache license at



> > https://github.com/lucene-cn/lxdb.



> >



> >



> >



> >



> >



> >



> > === Core Developers ===



> >



> >



> >



> > Currently most of the core developers of LXDB are working in the research



> > Team of luxin.



> >



> >



> > - yannian mu (dev)&nbsp;



> > - yu chen (dev)&nbsp;



> > - guangshi hao (dev)&nbsp;



> > - wei sun (dev)&nbsp;



> > - qihua zheng (dev)&nbsp;



> > - xin wang (dev)&nbsp;



> > - qingsong liu (dev)&nbsp;



> > - anxing zhou (Tester)&nbsp;



> > - jiajun duan (Tester)&nbsp;



> >



> >



> >



> > == External Dependencies ==



> >



> > As all dependencies are managed using Apache Maven



> > Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp;



> > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?



> > lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;



> > &nbsp; &nbsp; &nbsp; true



> > zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License



> 2.0&nbsp;



> > &nbsp; &nbsp; &nbsp; &nbsp; true



> > hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;



> > &nbsp; &nbsp; &nbsp; &nbsp; true



> > spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;



> > true



> > hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache



> > License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true



> > hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;



> true



> >



> >



> >



> >



> > == Required Resources ==



> >



> >



> > === Mailing lists ===



> >



> >



> > &nbsp;* lxdb-private (PMC discussion)



> > &nbsp;* lxdb-dev (developer discussion)



> > &nbsp;* lxdb-user (user discussion)



> > &nbsp;* lxdb-commits (SCM commits)



> > &nbsp;* lxdb-issues (JIRA issue feed)



> >



> >



> > === Subversion Directory ===



> >



> >



> > Instead of subversion, LXDB prefers to git as source control



> > management system: git://git.apache.org/lxdb



>



>



>



>



>



> --



> Sent from: http://apache-incubator-general.996316.n3.nabble.com/



>



> ---------------------------------------------------------------------



> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org



> For additional commands, e-mail: general-help@incubator.apache.org



>



>



Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by Furkan KAMACI <fu...@gmail.com>.
Hi,

Actually you have a detailed documentation which explains which approach
you have compared to similar systems and performance metrics of following
them i.e. reducing storage 10 to the 100 times or having low latency
queries.

My advices are (some of them are same with Sheng's and Liang's ):

1) Find an experienced mentor to guide you.

2) Start to translate your documentation to English.

3) Open source your project. How can we have a comment on your project if
we cannot see anything about it?

4) Gain contributors to your project. At least you should show your
intention to have committers/contributors out of your company. Eliminate
the risk of being non-meritocratic management of the project.

5) Structure your proposal. Explain why people need this project, which
problems do current projects have and how you managed to handle them. We
should understand is it a bundle of other projects, a completely new
project, or a wrapper of other projects which eliminates the shortcomings
of them.

6) Find a suitable name for your project in order to not try to solve
trademark problems that may lose your time if you enter the incubation.

Kind Regards,
Furkan KAMACI


On Sun, Feb 28, 2021 at 1:02 PM Liang Chen <ch...@gmail.com> wrote:

> Hi
>
> It would be better if you could find an experienced IPMC member to help you
> for preparing the proposal.
> Based on Sheng Wu input, i have one more comment : can you please explain
> what are the different with other similar data analysis DB?  you can
> consider explaining from use cases perspective.
>
> Regards
> Liang
>
>
> fp wrote
> > Dear Apache Incubator Community,
> >
> >
> > Please accept the following proposal for presentation and discussion:
> > https://github.com/lucene-cn/lxdb/wiki
> >
> >
> > LXDB is a high-performance,OLAP,full text search database.it`s base on
> > hbase,but replaced hfile with lucene index to support more effective
> > secondary indexes,it`s also base on spark sql,so that you can used sql
> api
> > to visit data and do olap calculate. and also the lucene index is store
> on
> > hdfs (not local disk).
> >
> >
> > In our Production System, LXDB supported 200+ clusters,some of the single
> > cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
> > billion rows for total), one of the biggest single table has 200million
> > lucene index on LXDB.
> >
> >
> > Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),
> HDFS,
> > Lucene.We have merged these separated projects again,LXDB&nbsp;equals
> > spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10
> > years to complete these merging operations.But the purpose is no longer a
> > search engine, but a database.
> >
> >
> >
> >
> >
> > Best regards
> > &nbsp; yannian mu
> >
> >
> >
> >
> > LXDB Proposal
> > == Abstract ==
> > LXDB is a high-performance,OLAP,full text search database.
> >
> >
> > === it`s base on hbase,but replaced hfile with lucene index to support
> > more effective secondary indexes.===&nbsp;
> > we modify hbase region server ,we&nbsp; change hfile to lucene,when put
> > data we put&nbsp; document to lucene instande of&nbsp; put data to hfile
> > lucene index store on region server&nbsp;&nbsp;(it is not sote in
> > different cluster like elstice search+hbase ,it takes to copy of data)
> >
> >
> > === it`s base on spark sql for olap===&nbsp;
> > we Integrated spark and hbase together ,it`s useage like this ,
> > 1.unpackage lxdb.tar.gz&nbsp;
> > 2.config hadoop_config path,
> > 3.run start-all.sh to start cluster.&nbsp;
> > lxdb can startup spark through hadoop yarn ,and then spark executor
> > process Embedded start hbase region server service .&nbsp;
> >
> >
> > you can operate lxdb database throuth spark sql api(hive) or mysql api.
> > 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
> > 2.the sql`s condition (filter or group by agg) will predicate to hbase ,
> > 3.hbase used lucene index to filter data in region server.
> > all of the spark,hbase,lucene is Embedded Integrated together,it is
> > not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es +
> > hbase+spark Solution.
> >
> >
> > == Background ==
> > === Multiple copies of data ===
> > Apache HBase+Elastic Search is the most popular Solution on full text
> > search ,but it`s weak on Online AnalyticalProcessing.
> > so most of the time the Production System used spark(or hive or impala or
> > presto) ,hbase,solr/es at the same time.Multiple copies of data are
> stored
> > in multiple systems,multiple systems has different Api .Data consistency
> > is difficult to guarantee.For the above reasons we merger
> > spark,hbase,elastic into one project .it`s target is used one copy of
> > data,one cluster,one api to solve olap,kv,full text...database scenarios.
> >
> >
> > === Merging and splitting of lucene indexes(hstore) acrocess different
> > machine on hdfs ===
> > As we all know solr/es store file in local fileSystem,it`s shard num must
> > be a fix num,but if we store index on hdfs,the index can split able like
> > hbase hstore,it can split or merge acorss machine nodes ,this is very
> > usefull for distribute database ,it depend malloc how much resource on a
> > table,most of time the records of a table is different by time by time so
> > the num of shards always need adjust,if index store local it can`t split
> > acroces throw different machine ,but lucene index store on hdfs it`s can
> > do it.
> > whether the number of pieces can be flexibly adjusted, whether it has the
> > ability of elastic scaling, in a distributed database is particularly
> > important
> >
> >
> >
> > === solved Insufficient of&nbsp; secondary indexes ===
> > some people use hbase secondary index like Phoenix prjoect. but those
> > programme base on the hbase rowkey has a lot of redundancy,He can't
> create
> > too many indexes,Data inflation rate is too high,so used lucene index
> > instand of secondary is the best chooses.&nbsp;
> >
> >
> > === we add an lucene index for spark olap===&nbsp;
> > Most of OLAP systems has violent scanning problems and Poor timeliness of
> > data like hive,spark sql,impala or some of the mpp database.
> > 1.They used violent scans to calculate the data.but another choice is add
> > index to the big data.some of the time using index can greatly improve
> the
> > performance of the original brute force scanning. i think&nbsp; that just
> > like the traditional database, indexing technology can greatly improve
> the
> > performance of the speed database.
> > 2.Another problem of thoses database or system, Most of them are an
> > offline system or batch system,lxdb `s target is realtime append
> ,realtime
> > kv update just like hbase.
> >
> >
> > ==future==
> > === lucene on parquet ===
> > recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files
> > to&nbsp; like parquet or orc format.
> > To solve the performance problem of traversing Lucene index.To solve the
> > problem that opening Lucene file needs to load files such as tip into
> > memory, which leads to slow opening Lucene index file,To enable Lucene to
> > store multi column joint index by column, which is used to handle some
> > logic such as multi table join and materialized view ,mulity fields group
> > by by invert index,The current Lucene index has many problems because of
> > too many file pointers and single column problems,We want to modify
> Lucene
> > to make it more suitable for HDFS, not only for full-text retrieval, but
> > also better at statistical analysis, which is a real database level
> > index,We want Lucene to be splitable, which can separate storage from
> > computation.
> >
> >
> >
> >
> > ===&nbsp; supporting all kinds of Predicate pushdown calculation&nbsp;===
> > We find that if we can combine the calculation method with the data
> > closely, we can give more play to the performance of the database. Index
> > is only a way of calculating push down. For example, storage push down,
> we
> > can store the index on the SSD device, and the data part on the SATA
> > device. We can store the data that are often grouped together in advance,
> > instead of calculating line by line, We can give important tables or
> > columns to dedicated devices and resources, but these hbases are still
> > lacking, which we need to further improve
> >
> >
> > === Distribution of intervention data ===
> > we can used row key to intervention data to different nodes ,it can do
> > many interestest things
> >
> >
> > === Resource control, resource isolation ===
> > lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp;
> we
> > can do it , I can control the priority of SQL so that Lucene with higher
> > priority can get faster IO resources.
> >
> >
> > == Status ==
> > since 2011 I released the first open source version on Alibaba&nbsp; ,At
> > that time, mdrill used 10 nodes 48g machines to support 400 billion data.
> > the first index on hdfs is from this version.it`s one year ahead of the
> > community.&nbsp; https://github.com/alibaba/mdrill .
> >
> >
> > since 2014 i stoped mdrill project update for the reason of i join into
> > tencent . in our team we developed&nbsp; hermes project ,we also build
> > lucene on hdfs , hermes now realtime import 1000 billion rows of data per
> > day.It's the largest database I've ever developed ,
> > https://plus.tencent.com/bigdata/hermes
> >
> >
> > since 2018 I set up my own company called luxin, Lu Xin is the Chinese
> > pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
> > lucene.xin ,mail domain is lucene.cn.
> > luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp;
> > it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
> > cluster use lsql. it`s process about 200 billions per day ,amount of
> 20000
> > billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;
> >
> >
> > since 2010 In the case of COVID-19 our team decide to developed the next
> > generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
> > hbase to lsql To solve the update problem.nowadays we have finish the
> > first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki
> >
> >
> >
> >
> >
> >
> >
> > == Known Risks ==
> > ==Meritocracy ==
> >
> >
> > lxdb has been deployed in production and is applying more than 200 lines
> > of business. It has demonstrated great performance benefits and has
> proved
> > to be a better way for reporting and analysis based big data. Still We
> > look forward to growing a rich user and developer community.
> >
> >
> > === Orphaned products ===
> >
> >
> > The core developers currently work full-time for Luxin.
> > lxdb is widely adopted by many companies and individuals. There's no
> > realistic chance of it becoming orphaned. and we have a number of 1000
> > person tencent qq Instant messaging group
> >
> >
> >
> > === Inexperience with Open Source===
> >
> > The core developers are all active users and followers of open source.
> > They are already committers and contributors to the lxdb project.&nbsp;
> > developed yannian mu has tens years on open source project,&nbsp; jstorm
> > https://github.com/alibaba/jstorm and
> > mdrill&nbsp;https://github.com/alibaba/mdrill
> >
> >
> >
> >
> > === Homogenous Developers ===&nbsp;
> >
> >
> > The most of core developers are from luxin for the Closed source products
> > reason, but when lxdb was open sourced, lxdb will received a lot of bug
> > fixes and enhancements from other developers not working at luxin.Where
> > did you learn it from and where did you return it.
> >
> >
> >
> >
> >
> > ===Reliance on Salaried Developers ===
> >
> >
> > Lxin invested in lxdb as the&nbsp; solution and some of its key engineers
> > are working full time on the project. In addition, since there is a
> > growing Big Data need for scalable solutions, we look forward to other
> > Apache developers and researchers to contribute to the project. Also key
> > to addressing the risk associated with relying on Salaried developers
> from
> > a single entity is to increase the diversity of the contributors and
> > actively lobby , Apache lxdb intends to do this.
> >
> >
> > === An Excessive Fascination with the Apache Brand ===
> >
> >
> > Lxdb is proposing to enter incubation at Apache in order to help efforts
> > to diversify the committer-base, not so much to capitalize on the Apache
> > brand. The Lxdb project is in production use already inside lxdb, but is
> > not expected to be an lxdb product for external customers. As such, the
> > lxdb project is not seeking to use the Apache brand as a marketing tool.
> >
> >
> >
> >
> >
> > === Documentation===&nbsp;
> >
> >
> > Information about Palo can be found at https://github.com/lucene-cn/lxdb
> .
> > The following links provide more information about lxdb in open source:
> >
> >
> > * wiki site: https://github.com/lucene-cn/lxdb/wiki
> > * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
> > * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
> > * lxin home page: http://www.lucene.xin
> >
> > * lsql document: http://docs.lucene.xin/lsql/v21/
> >
> >
> >
> > ##Initial Source
> >
> >
> > lxdb will development source code under an Apache license at
> > https://github.com/lucene-cn/lxdb.
> >
> >
> >
> >
> >
> >
> > === Core Developers ===
> >
> >
> >
> > Currently most of the core developers of LXDB are working in the research
> > Team of luxin.
> >
> >
> > - yannian mu (dev)&nbsp;
> > - yu chen (dev)&nbsp;
> > - guangshi hao (dev)&nbsp;
> > - wei sun (dev)&nbsp;
> > - qihua zheng (dev)&nbsp;
> > - xin wang (dev)&nbsp;
> > - qingsong liu (dev)&nbsp;
> > - anxing zhou (Tester)&nbsp;
> > - jiajun duan (Tester)&nbsp;
> >
> >
> >
> > == External Dependencies ==
> >
> > As all dependencies are managed using Apache Maven
> > Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp;
> > &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
> > lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
> > &nbsp; &nbsp; &nbsp; true
> > zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License
> 2.0&nbsp;
> > &nbsp; &nbsp; &nbsp; &nbsp; true
> > hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
> > &nbsp; &nbsp; &nbsp; &nbsp; true
> > spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> > true
> > hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
> > License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> > hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> true
> >
> >
> >
> >
> > == Required Resources ==
> >
> >
> > === Mailing lists ===
> >
> >
> > &nbsp;* lxdb-private (PMC discussion)
> > &nbsp;* lxdb-dev (developer discussion)
> > &nbsp;* lxdb-user (user discussion)
> > &nbsp;* lxdb-commits (SCM commits)
> > &nbsp;* lxdb-issues (JIRA issue feed)
> >
> >
> > === Subversion Directory ===
> >
> >
> > Instead of subversion, LXDB prefers to git as source control
> > management system: git://git.apache.org/lxdb
>
>
>
>
>
> --
> Sent from: http://apache-incubator-general.996316.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by "fp@lucene.cn" <fp...@lucene.cn>.
Hi Liang Chen

Thank you very much for your busy time to answer my question
My reply is as follows.

======

1:It would be better if you could find an experienced IPMC member to help you

for preparing the proposal.

>I am trying to find those PMC who are willing to help me,After all, it was a very heavy job


2:Based on Sheng Wu input, i have one more comment : can you please explain  what are the different with other similar data analysis DB?  you can  consider explaining from use cases perspective.

####Different from analytic db####


>I'm not particularly familiar with analyticdb, so I just looked up some information through the search engine. If there is any misunderstanding, please criticize and correct me
>Most of the time they are really similar,Analyticdb is a very excellent database, but its technical principles can hardly be found on the Internet. From my personal point of view, they may have the following differences
#1) Analyticdb is a cloud native data warehouse in the full sense,This is also the feature they added to the new edition, which supports the separation of storage and computing, and the time-sharing flexibility of resources on demand. The same piece of data can start different computing resources at different computing nodes according to different computing 

However, lxdb is not a real cloud native database. Although we store the Lucene index on HDFS, we can only separate the storage from computing. At present, when the Lucene itself is opened for the first time, the index information such as tip must be preloaded into memory, which leads to the persistent opening of Lucene in the resident process, Therefore, lxdb has not been able to separate computing from computing, that is, it has not been able to distribute computing resources to different processes according to different queries. This has always been a pity of lxdb, so I have been trying these years

At present, cloud native database has great market potential, and we are willing to try it,And I know that it's not difficult to change Lucene like this, or it's less difficult than integrating spark, HBase and Lucene together. 

(大多时候两者非常相似,但目前Analyticdb 是一个完全意义上的云原生数据仓库,我注意到他们最近加入的这个特性,支持存储于计算分离,资源按需分时弹性,同一份数据可以根据不同的计算在不同的计算节点启动不同的计算资源)
(但是lxdb还不是一个真正意义上的云原生数据库,虽然我们将lucene索引存放在了hdfs之上,但这只能做到存储与计算分离,目前由于lucene本身第一次打开的时候必须将tip等索引信息预加载到内存里,导致了lucene必须在常驻进程中持久化的打开,因此lxdb还做不到计算与计算分离,也就是还不能根据不同的查询,将计算资源分散到不同的进程中去,这一直是lxdb的一个遗憾,所以这些年我一直在不断的修改lucene索引,准备将其倒排表,正排都统一成parquet这类云原生的格式来解决这个问题,这样既能提高倒排查询的性能,又能解决所以需要预先加载的问题)
(目前来看云原生数据库有非常大的市场潜力,我们也愿意在这方面进行尝试,而且我知道,将lucene这样改动起来并不难,或者说难度小于将spark,hbase,lucene集成在一起的难度)

#2)Analyticdb can't be built by itself, it can only run on the cloud platform provided by it,Must be purchased with the underlying cloud environment, which sometimes gives users more restrictions. Lxdb is based on Hadoop platform. As long as users have Hadoop environment, lxdb can directly start services through yard, which is suitable for private deployment and deployment on the cloud, and it doesn't limit any manufacturers. It is relatively open

#3)I feel that it is more like a batch engine,It is more like a scene of centralized import and batch query,At least his cloud native model should be like this,Or I didn't find the user manual for real-time import
, while lxdb is a real-time engine with low data latency,Relatively speaking, it is easier for batch engine to realize cloud native, while it is more difficult for real-time millisecond delay engine to realize the separation of storage and computing. It needs a snapshot mechanism to record the data change at a certain time, so as to realize the separation of computing and computing between different nodes


#4 According to the official documents, see specifications and restrictions, the best configuration is C32. The number of nodes supported by C32 is less than 128, and the storage capacity is 1PB. In the production environment, lxdb has 904 nodes, 50pb disk capacity, and 70% storage utilization,Of course, it can be inaccurate and unfair to adb.

(AnalyticDB不能自己搭建,只能运行在其提供的云平台之上,必须伴随底层云环境一同购买,这有时会给用户比较多的限制.而lxdb则是基于hadoop平台的,只要用户有hadoop环境,lxdb则可以直接通过yarn启动服务,即适合私有化部署也适合在云上的部署,也不限制任何的厂商,相对来说比较开放.)
(我感觉其更像是一个批处理引擎,更像是一次集中导入,之后批量查询的场景,起码他的云原生模式应该是这样的,或者我也没找到实时导入的使用手册,而lxdb则是一个实时的引擎,数据延迟很低,相对来说批处理引擎实现云原生更容易,而实时毫秒级延迟的引擎实现存储于计算分离比较难,需要有快照机制,记录某一时刻的数据变化,才能实现计算与计算的在不同节点之间的分离)
从官方提供的文档逆向反推,see 详细规格与限制说明来看,最好的配置C32支持的节点数在128一下,存储容量在1PB,而lxdb在生产环境目前真实的节点数904台,磁盘容量50PB,存储使用率70%,当然这有可能对adb来说不准确,也不公平,仅供参考


####Different from carbondata and clickhouse####
When carbondata appeared in 2015, it was a product that shocked me very much. Adding a layer of index to big data is what I have been doing all these years. I didn't expect that there would be a team in this world with the same idea as me,They are all based on Hadoop, and even the startup is based on spark on yarn

Everyone is based on spark, and its core is the underlying data structure of spark. We can improve the speed of spark by unique data format such as index,Whether the data has an index and whether the index is stored on the local disk or HDFS is a significant feature that distinguishes us from other analytical databases, such as hive, spark SQL, impala and some MAPP databases,On this point, we are consistent with carbontata

Our team later spent a certain amount of energy to do a test with carbontata, and the positioning in some directions is still very different,

As for Clickhouse, I didn't come across many projects before. Until one day, when I was recruiting in the group, someone asked me, is your product as fast as Clickhouse? Therefore, I knew that there was such a good product in the industry,



#1 Coarse grained index vs fine-grained index, or index stored by block and index not stored by block,
We found that the writing speed of carbondata and Clickhouse is very fast, while we used lxdb and elastic search at the same time, because both of them are based on Lucene, which is an order of magnitude lower than the former two

#2 Later, we found that the main difference lies in the way of index. One is the index by block, and the other is the overall global index. The former is very fast in storage, and it is easier to separate index and calculation. Even carbondata is a real cloud native database (the Clickhouse data is stored locally, not cloud native), But the benefit is not only the improvement of single column filtering, but also the improvement of multi condition combination filtering and the convenience of updating. If the former is not handled properly, it is easy to cause full scan, but there will be a high cost to realize updating, The latter can be combined with BitSet or bloom filter to realize the combination of multi column conditions, and the global index is more suitable for updating. Therefore, lxdb and es have the characteristics of real-time updating. This is why we are different from carbondata. We inherit a HBase in comparison, and the main purpose is to realize the real-time updating of kV level, In the future, if lxdb wants to take a step on the cloud native Road, it is bound to make some innovations and changes in the index format of Lucene

#3 Because lxdb is bound to HBase in the future, OLTP at kV level is also a direction in the future
#4 In terms of statistical analysis, the performance of docvalues used by Lucene is not as good as that of carbondata and clickhouse,Because of this reason, I spent some experience to improve the performance of random reading on HDFS, and the speed can be increased by 100-200 times. But I think the code to modify HDFS will lead to poor compatibility of our products in the customer platform in the future, and will force customers to replace Hadoop with our version. I didn't choose this scheme in the end, This is the address of my improvement project https://github.com/lucene-cn/lxhadoop
One of the ideas that came to my mind later is to replace the format of parquet with the inverted and forward row of Lucene, so that I can carry out multi condition full-text retrieval. The multi column feature of parquet allows me to avoid the performance problem of random reading by efficiently traversing the inverted table



(carbondata在2015年出现的时候,是一个让我非常震惊的产品,给大数据加一层索引,是我这些年一直做的事情,没想到在这个世界上还能有一个团队跟我的想法一样,都是基于hadoop,甚至启动也都是基于spark on yarn)
(大家都是基于spark,其核心也都是动了spark底层的数据结构,通过独特的数据格式如索引来达到给spark提升速度的目的,所以是否有索引,以及索引是存储在本地磁盘还是存储在hdfs上是我们区分与其他分析型数据库的一个显著特性,如与hive,spark-sql,impala以及一些mapp数据库,而在这一点上,我们跟carbondata是一致的)

我们团队后来花了一定的精力跟carbondata做了一个测试,在一些方向上的定位,还是有很大的不同

至于clickhouse 之前我在项目中碰到的并不多,直到有一天,我在群里招聘的时候,有一个人问我,你这个产品有clickhouse快么,因此我才知道业界还有一个这么牛的一个产品

#1粗粒度的索引 vs 细粒度的索引,按块存储的索引与非按块存储的索引
(我们测试发现carbondata与clickhouse的写入速度非常非常的快,而我们同时使用lxdb与elastic search进行测试 因为两者都是基于lucene,发现比前两者相差一个数量级)
#2(后来我们发现主要差别在索引的方式,一个是按块的索引一个是整体的全局的索引,前者入库速度非常快,而且更容易实现索引与计算分离,甚至carbondata也是一个真正意义上的云原生数据库(clickhouse数据存储在本地,不能是云原生的),而整体的全局的索引需要不断的合并segments会有入库性能损耗.但带来的益处则是不仅仅是在单列筛选过滤上的提升,在多条件组合筛选性能的提升以及更新上的便利,前者处理不好容易导致full scan,而要实现更新则会有较大的代价,后者则可以通过结合bitset 或bloom filter实现多列条件的组合筛选,全局的索引更适合更新,故lxdb和es则都具备实时更新的特性,这也是为什么我们与carbondata不同的地方,我们对比下多继承了一个hbase进来,主要目的也是为了实现kv层次的实时更新,而未来lxdb如果想在云原生的路上要走一步,势必就要在lucene的索引格式上做一些创新和变更)
#3(而未来lxdb的因为与hbase做了绑定,kv层次的oltp也是未来一个方向)
#4 在统计分析性能上lucene采用的docvalues大量随机读的表现不如carbondata,因为这个原因,我花了一些经历改进hdfs上的随机读的性能,速度能提升100~200倍,但是我觉得这个要修改hdfs的代码,会导致未来我们产品在客户平台的兼容性不好,会强迫客户将hadoop更换为我们的版本,我最终没有选择这个方案 ,这个是我改进的项目地址 https://github.com/lucene-cn/lxhadoop
我后来想到的一个思路就是将parquet的格式替换到lucene的倒排与正排上,这样我既能进行多条件的全文检索,在检索的时候parquet多列的特性又能让我通过高效的遍历倒排表来规避随机读的性能问题

















fp@lucene.cn  yannian mu



 



From: Liang Chen



Date: 2021-02-28 18:02



To: general



Subject: Re: [Proposal] lxdb - proposal for Apache Incubation



Hi



 



It would be better if you could find an experienced IPMC member to help you



for preparing the proposal.



Based on Sheng Wu input, i have one more comment : can you please explain



what are the different with other similar data analysis DB?  you can



consider explaining from use cases perspective.



 



Regards



Liang



 



 



fp wrote



> Dear Apache Incubator Community,



>



>



> Please accept the following proposal for presentation and discussion:



> https://github.com/lucene-cn/lxdb/wiki



>



>



> LXDB is a high-performance,OLAP,full text search database.it`s base on



> hbase,but replaced hfile with lucene index to support more effective



> secondary indexes,it`s also base on spark sql,so that you can used sql api



> to visit data and do olap calculate. and also the lucene index is store on



> hdfs (not local disk).



>



>



> In our Production System, LXDB supported 200+ clusters,some of the single



> cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000



> billion rows for total), one of the biggest single table has 200million



> lucene index on LXDB.



>



>



> Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS,



> Lucene.We have merged these separated projects again,LXDB&nbsp;equals



> spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10



> years to complete these merging operations.But the purpose is no longer a



> search engine, but a database.



>



>



>



>



>



> Best regards



> &nbsp; yannian mu



>



>



>



>



> LXDB Proposal



> == Abstract ==



> LXDB is a high-performance,OLAP,full text search database.



>



>



> === it`s base on hbase,but replaced hfile with lucene index to support



> more effective secondary indexes.===&nbsp;



> we modify hbase region server ,we&nbsp; change hfile to lucene,when put



> data we put&nbsp; document to lucene instande of&nbsp; put data to hfile



> lucene index store on region server&nbsp;&nbsp;(it is not sote in



> different cluster like elstice search+hbase ,it takes to copy of data)



>



>



> === it`s base on spark sql for olap===&nbsp;



> we Integrated spark and hbase together ,it`s useage like this ,



> 1.unpackage lxdb.tar.gz&nbsp;



> 2.config hadoop_config path,



> 3.run start-all.sh to start cluster.&nbsp;



> lxdb can startup spark through hadoop yarn ,and then spark executor



> process Embedded start hbase region server service .&nbsp;



>



>



> you can operate lxdb database throuth spark sql api(hive) or mysql api.



> 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .



> 2.the sql`s condition (filter or group by agg) will predicate to hbase ,



> 3.hbase used lucene index to filter data in region server.



> all of the spark,hbase,lucene is Embedded Integrated together,it is



> not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es +



> hbase+spark Solution.



>



>



> == Background ==



> === Multiple copies of data ===



> Apache HBase+Elastic Search is the most popular Solution on full text



> search ,but it`s weak on Online AnalyticalProcessing.



> so most of the time the Production System used spark(or hive or impala or



> presto) ,hbase,solr/es at the same time.Multiple copies of data are stored



> in multiple systems,multiple systems has different Api .Data consistency



> is difficult to guarantee.For the above reasons we merger



> spark,hbase,elastic into one project .it`s target is used one copy of



> data,one cluster,one api to solve olap,kv,full text...database scenarios.



>



>



> === Merging and splitting of lucene indexes(hstore) acrocess different



> machine on hdfs ===



> As we all know solr/es store file in local fileSystem,it`s shard num must



> be a fix num,but if we store index on hdfs,the index can split able like



> hbase hstore,it can split or merge acorss machine nodes ,this is very



> usefull for distribute database ,it depend malloc how much resource on a



> table,most of time the records of a table is different by time by time so



> the num of shards always need adjust,if index store local it can`t split



> acroces throw different machine ,but lucene index store on hdfs it`s can



> do it.



> whether the number of pieces can be flexibly adjusted, whether it has the



> ability of elastic scaling, in a distributed database is particularly



> important



>



>



>



> === solved Insufficient of&nbsp; secondary indexes ===



> some people use hbase secondary index like Phoenix prjoect. but those



> programme base on the hbase rowkey has a lot of redundancy,He can't create



> too many indexes,Data inflation rate is too high,so used lucene index



> instand of secondary is the best chooses.&nbsp;



>



>



> === we add an lucene index for spark olap===&nbsp;



> Most of OLAP systems has violent scanning problems and Poor timeliness of



> data like hive,spark sql,impala or some of the mpp database.



> 1.They used violent scans to calculate the data.but another choice is add



> index to the big data.some of the time using index can greatly improve the



> performance of the original brute force scanning. i think&nbsp; that just



> like the traditional database, indexing technology can greatly improve the



> performance of the speed database.



> 2.Another problem of thoses database or system, Most of them are an



> offline system or batch system,lxdb `s target is realtime append ,realtime



> kv update just like hbase.



>



>



> ==future==



> === lucene on parquet ===



> recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files



> to&nbsp; like parquet or orc format.



> To solve the performance problem of traversing Lucene index.To solve the



> problem that opening Lucene file needs to load files such as tip into



> memory, which leads to slow opening Lucene index file,To enable Lucene to



> store multi column joint index by column, which is used to handle some



> logic such as multi table join and materialized view ,mulity fields group



> by by invert index,The current Lucene index has many problems because of



> too many file pointers and single column problems,We want to modify Lucene



> to make it more suitable for HDFS, not only for full-text retrieval, but



> also better at statistical analysis, which is a real database level



> index,We want Lucene to be splitable, which can separate storage from



> computation.



>



>



>



>



> ===&nbsp; supporting all kinds of Predicate pushdown calculation&nbsp;===



> We find that if we can combine the calculation method with the data



> closely, we can give more play to the performance of the database. Index



> is only a way of calculating push down. For example, storage push down, we



> can store the index on the SSD device, and the data part on the SATA



> device. We can store the data that are often grouped together in advance,



> instead of calculating line by line, We can give important tables or



> columns to dedicated devices and resources, but these hbases are still



> lacking, which we need to further improve



>



>



> === Distribution of intervention data ===



> we can used row key to intervention data to different nodes ,it can do



> many interestest things



>



>



> === Resource control, resource isolation ===



> lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp; we



> can do it , I can control the priority of SQL so that Lucene with higher



> priority can get faster IO resources.



>



>



> == Status ==



> since 2011 I released the first open source version on Alibaba&nbsp; ,At



> that time, mdrill used 10 nodes 48g machines to support 400 billion data.



> the first index on hdfs is from this version.it`s one year ahead of the



> community.&nbsp; https://github.com/alibaba/mdrill .



>



>



> since 2014 i stoped mdrill project update for the reason of i join into



> tencent . in our team we developed&nbsp; hermes project ,we also build



> lucene on hdfs , hermes now realtime import 1000 billion rows of data per



> day.It's the largest database I've ever developed ,



> https://plus.tencent.com/bigdata/hermes



>



>



> since 2018 I set up my own company called luxin, Lu Xin is the Chinese



> pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is



> lucene.xin ,mail domain is lucene.cn.



> luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp;



> it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of



> cluster use lsql. it`s process about 200 billions per day ,amount of 20000



> billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;



>



>



> since 2010 In the case of COVID-19 our team decide to developed the next



> generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add



> hbase to lsql To solve the update problem.nowadays we have finish the



> first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki



>



>



>



>



>



>



>



> == Known Risks ==



> ==Meritocracy ==



>



>



> lxdb has been deployed in production and is applying more than 200 lines



> of business. It has demonstrated great performance benefits and has proved



> to be a better way for reporting and analysis based big data. Still We



> look forward to growing a rich user and developer community.



>



>



> === Orphaned products ===



>



>



> The core developers currently work full-time for Luxin.



> lxdb is widely adopted by many companies and individuals. There's no



> realistic chance of it becoming orphaned. and we have a number of 1000



> person tencent qq Instant messaging group



>



>



>



> === Inexperience with Open Source===



>



> The core developers are all active users and followers of open source.



> They are already committers and contributors to the lxdb project.&nbsp;



> developed yannian mu has tens years on open source project,&nbsp; jstorm



> https://github.com/alibaba/jstorm and



> mdrill&nbsp;https://github.com/alibaba/mdrill



>



>



>



>



> === Homogenous Developers ===&nbsp;



>



>



> The most of core developers are from luxin for the Closed source products



> reason, but when lxdb was open sourced, lxdb will received a lot of bug



> fixes and enhancements from other developers not working at luxin.Where



> did you learn it from and where did you return it.



>



>



>



>



>



> ===Reliance on Salaried Developers ===



>



>



> Lxin invested in lxdb as the&nbsp; solution and some of its key engineers



> are working full time on the project. In addition, since there is a



> growing Big Data need for scalable solutions, we look forward to other



> Apache developers and researchers to contribute to the project. Also key



> to addressing the risk associated with relying on Salaried developers from



> a single entity is to increase the diversity of the contributors and



> actively lobby , Apache lxdb intends to do this.



>



>



> === An Excessive Fascination with the Apache Brand ===



>



>



> Lxdb is proposing to enter incubation at Apache in order to help efforts



> to diversify the committer-base, not so much to capitalize on the Apache



> brand. The Lxdb project is in production use already inside lxdb, but is



> not expected to be an lxdb product for external customers. As such, the



> lxdb project is not seeking to use the Apache brand as a marketing tool.



>



>



>



>



>



> === Documentation===&nbsp;



>



>



> Information about Palo can be found at https://github.com/lucene-cn/lxdb.



> The following links provide more information about lxdb in open source:



>



>



> * wiki site: https://github.com/lucene-cn/lxdb/wiki



> * Issue Tracking: https://github.com/lucene-cn/lxdb/issues



> * Overview: https://github.com/lucene-cn/lxdb/wiki/intro



> * lxin home page: http://www.lucene.xin



>



> * lsql document: http://docs.lucene.xin/lsql/v21/



>



>



>



> ##Initial Source



>



>



> lxdb will development source code under an Apache license at



> https://github.com/lucene-cn/lxdb.



>



>



>



>



>



>



> === Core Developers ===



>



>



>



> Currently most of the core developers of LXDB are working in the research



> Team of luxin.



>



>



> - yannian mu (dev)&nbsp;



> - yu chen (dev)&nbsp;



> - guangshi hao (dev)&nbsp;



> - wei sun (dev)&nbsp;



> - qihua zheng (dev)&nbsp;



> - xin wang (dev)&nbsp;



> - qingsong liu (dev)&nbsp;



> - anxing zhou (Tester)&nbsp;



> - jiajun duan (Tester)&nbsp;



>



>



>



> == External Dependencies ==



>



> As all dependencies are managed using Apache Maven



> Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp;



> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?



> lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;



> &nbsp; &nbsp; &nbsp; true



> zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp;



> &nbsp; &nbsp; &nbsp; &nbsp; true



> hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;



> &nbsp; &nbsp; &nbsp; &nbsp; true



> spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;



> true



> hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache



> License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true



> hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true



>



>



>



>



> == Required Resources ==



>



>



> === Mailing lists ===



>



>



> &nbsp;* lxdb-private (PMC discussion)



> &nbsp;* lxdb-dev (developer discussion)



> &nbsp;* lxdb-user (user discussion)



> &nbsp;* lxdb-commits (SCM commits)



> &nbsp;* lxdb-issues (JIRA issue feed)



>



>



> === Subversion Directory ===



>



>



> Instead of subversion, LXDB prefers to git as source control



> management system: git://git.apache.org/lxdb



 



 



 



 



 



--



Sent from: http://apache-incubator-general.996316.n3.nabble.com/



 



---------------------------------------------------------------------



To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org



For additional commands, e-mail: general-help@incubator.apache.org



 



 



Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by Liang Chen <ch...@gmail.com>.
Hi

It would be better if you could find an experienced IPMC member to help you
for preparing the proposal.
Based on Sheng Wu input, i have one more comment : can you please explain
what are the different with other similar data analysis DB?  you can
consider explaining from use cases perspective.

Regards
Liang


fp wrote
> Dear Apache Incubator Community,
> 
> 
> Please accept the following proposal for presentation and discussion:
> https://github.com/lucene-cn/lxdb/wiki
> 
> 
> LXDB is a high-performance,OLAP,full text search database.it`s base on
> hbase,but replaced hfile with lucene index to support more effective
> secondary indexes,it`s also base on spark sql,so that you can used sql api
> to visit data and do olap calculate. and also the lucene index is store on
> hdfs (not local disk).
> 
> 
> In our Production System, LXDB supported 200+ clusters,some of the single
> cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
> billion rows for total), one of the biggest single table has 200million
> lucene index on LXDB.
> 
> 
> Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS,
> Lucene.We have merged these separated projects again,LXDB&nbsp;equals
> spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10
> years to complete these merging operations.But the purpose is no longer a
> search engine, but a database.
> 
> 
> 
> 
> 
> Best regards
> &nbsp; yannian mu
> 
> 
> 
> 
> LXDB Proposal
> == Abstract ==
> LXDB is a high-performance,OLAP,full text search database.
> 
> 
> === it`s base on hbase,but replaced hfile with lucene index to support
> more effective secondary indexes.===&nbsp;
> we modify hbase region server ,we&nbsp; change hfile to lucene,when put
> data we put&nbsp; document to lucene instande of&nbsp; put data to hfile
> lucene index store on region server&nbsp;&nbsp;(it is not sote in
> different cluster like elstice search+hbase ,it takes to copy of data)
> 
> 
> === it`s base on spark sql for olap===&nbsp;
> we Integrated spark and hbase together ,it`s useage like this ,
> 1.unpackage lxdb.tar.gz&nbsp;
> 2.config hadoop_config path,
> 3.run start-all.sh to start cluster.&nbsp;
> lxdb can startup spark through hadoop yarn ,and then spark executor
> process Embedded start hbase region server service .&nbsp;
> 
> 
> you can operate lxdb database throuth spark sql api(hive) or mysql api.
> 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
> 2.the sql`s condition (filter or group by agg) will predicate to hbase ,
> 3.hbase used lucene index to filter data in region server.
> all of the spark,hbase,lucene is Embedded Integrated together,it is
> not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es +
> hbase+spark Solution.
> 
> 
> == Background ==
> === Multiple copies of data ===
> Apache HBase+Elastic Search is the most popular Solution on full text
> search ,but it`s weak on Online AnalyticalProcessing.
> so most of the time the Production System used spark(or hive or impala or
> presto) ,hbase,solr/es at the same time.Multiple copies of data are stored
> in multiple systems,multiple systems has different Api .Data consistency
> is difficult to guarantee.For the above reasons we merger
> spark,hbase,elastic into one project .it`s target is used one copy of
> data,one cluster,one api to solve olap,kv,full text...database scenarios.
> 
> 
> === Merging and splitting of lucene indexes(hstore) acrocess different
> machine on hdfs ===
> As we all know solr/es store file in local fileSystem,it`s shard num must
> be a fix num,but if we store index on hdfs,the index can split able like
> hbase hstore,it can split or merge acorss machine nodes ,this is very
> usefull for distribute database ,it depend malloc how much resource on a
> table,most of time the records of a table is different by time by time so
> the num of shards always need adjust,if index store local it can`t split
> acroces throw different machine ,but lucene index store on hdfs it`s can
> do it.
> whether the number of pieces can be flexibly adjusted, whether it has the
> ability of elastic scaling, in a distributed database is particularly
> important
> 
> 
> 
> === solved Insufficient of&nbsp; secondary indexes ===
> some people use hbase secondary index like Phoenix prjoect. but those
> programme base on the hbase rowkey has a lot of redundancy,He can't create
> too many indexes,Data inflation rate is too high,so used lucene index
> instand of secondary is the best chooses.&nbsp;
> 
> 
> === we add an lucene index for spark olap===&nbsp;
> Most of OLAP systems has violent scanning problems and Poor timeliness of
> data like hive,spark sql,impala or some of the mpp database.
> 1.They used violent scans to calculate the data.but another choice is add
> index to the big data.some of the time using index can greatly improve the
> performance of the original brute force scanning. i think&nbsp; that just
> like the traditional database, indexing technology can greatly improve the
> performance of the speed database.
> 2.Another problem of thoses database or system, Most of them are an
> offline system or batch system,lxdb `s target is realtime append ,realtime
> kv update just like hbase.
> 
> 
> ==future==
> === lucene on parquet ===
> recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files
> to&nbsp; like parquet or orc format.
> To solve the performance problem of traversing Lucene index.To solve the
> problem that opening Lucene file needs to load files such as tip into
> memory, which leads to slow opening Lucene index file,To enable Lucene to
> store multi column joint index by column, which is used to handle some
> logic such as multi table join and materialized view ,mulity fields group
> by by invert index,The current Lucene index has many problems because of
> too many file pointers and single column problems,We want to modify Lucene
> to make it more suitable for HDFS, not only for full-text retrieval, but
> also better at statistical analysis, which is a real database level
> index,We want Lucene to be splitable, which can separate storage from
> computation.
> 
> 
> 
> 
> ===&nbsp; supporting all kinds of Predicate pushdown calculation&nbsp;===
> We find that if we can combine the calculation method with the data
> closely, we can give more play to the performance of the database. Index
> is only a way of calculating push down. For example, storage push down, we
> can store the index on the SSD device, and the data part on the SATA
> device. We can store the data that are often grouped together in advance,
> instead of calculating line by line, We can give important tables or
> columns to dedicated devices and resources, but these hbases are still
> lacking, which we need to further improve
> 
> 
> === Distribution of intervention data ===
> we can used row key to intervention data to different nodes ,it can do
> many interestest things
> 
> 
> === Resource control, resource isolation ===
> lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp; we
> can do it , I can control the priority of SQL so that Lucene with higher
> priority can get faster IO resources.
> 
> 
> == Status ==
> since 2011 I released the first open source version on Alibaba&nbsp; ,At
> that time, mdrill used 10 nodes 48g machines to support 400 billion data.
> the first index on hdfs is from this version.it`s one year ahead of the
> community.&nbsp; https://github.com/alibaba/mdrill .
> 
> 
> since 2014 i stoped mdrill project update for the reason of i join into
> tencent . in our team we developed&nbsp; hermes project ,we also build
> lucene on hdfs , hermes now realtime import 1000 billion rows of data per
> day.It's the largest database I've ever developed ,
> https://plus.tencent.com/bigdata/hermes
> 
> 
> since 2018 I set up my own company called luxin, Lu Xin is the Chinese
> pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
> lucene.xin ,mail domain is lucene.cn.
> luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp;
> it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
> cluster use lsql. it`s process about 200 billions per day ,amount of 20000
> billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;
> 
> 
> since 2010 In the case of COVID-19 our team decide to developed the next
> generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
> hbase to lsql To solve the update problem.nowadays we have finish the
> first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki
> 
> 
> 
> 
> 
> 
> 
> == Known Risks ==
> ==Meritocracy ==
> 
> 
> lxdb has been deployed in production and is applying more than 200 lines
> of business. It has demonstrated great performance benefits and has proved
> to be a better way for reporting and analysis based big data. Still We
> look forward to growing a rich user and developer community.
> 
> 
> === Orphaned products ===
> 
> 
> The core developers currently work full-time for Luxin.
> lxdb is widely adopted by many companies and individuals. There's no
> realistic chance of it becoming orphaned. and we have a number of 1000
> person tencent qq Instant messaging group
> 
> 
> 
> === Inexperience with Open Source===
> 
> The core developers are all active users and followers of open source.
> They are already committers and contributors to the lxdb project.&nbsp;
> developed yannian mu has tens years on open source project,&nbsp; jstorm
> https://github.com/alibaba/jstorm and
> mdrill&nbsp;https://github.com/alibaba/mdrill
> 
> 
> 
> 
> === Homogenous Developers ===&nbsp;
> 
> 
> The most of core developers are from luxin for the Closed source products
> reason, but when lxdb was open sourced, lxdb will received a lot of bug
> fixes and enhancements from other developers not working at luxin.Where
> did you learn it from and where did you return it.
> 
> 
> 
> 
> 
> ===Reliance on Salaried Developers ===
> 
> 
> Lxin invested in lxdb as the&nbsp; solution and some of its key engineers
> are working full time on the project. In addition, since there is a
> growing Big Data need for scalable solutions, we look forward to other
> Apache developers and researchers to contribute to the project. Also key
> to addressing the risk associated with relying on Salaried developers from
> a single entity is to increase the diversity of the contributors and
> actively lobby , Apache lxdb intends to do this.
> 
> 
> === An Excessive Fascination with the Apache Brand ===
> 
> 
> Lxdb is proposing to enter incubation at Apache in order to help efforts
> to diversify the committer-base, not so much to capitalize on the Apache
> brand. The Lxdb project is in production use already inside lxdb, but is
> not expected to be an lxdb product for external customers. As such, the
> lxdb project is not seeking to use the Apache brand as a marketing tool.
> 
> 
> 
> 
> 
> === Documentation===&nbsp;
> 
> 
> Information about Palo can be found at https://github.com/lucene-cn/lxdb.
> The following links provide more information about lxdb in open source:
> 
> 
> * wiki site: https://github.com/lucene-cn/lxdb/wiki
> * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
> * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
> * lxin home page: http://www.lucene.xin
> 
> * lsql document: http://docs.lucene.xin/lsql/v21/
> 
> 
> 
> ##Initial Source
> 
> 
> lxdb will development source code under an Apache license at
> https://github.com/lucene-cn/lxdb.
> 
> 
> 
> 
> 
> 
> === Core Developers ===
> 
> 
> 
> Currently most of the core developers of LXDB are working in the research
> Team of luxin.
> 
> 
> - yannian mu (dev)&nbsp;
> - yu chen (dev)&nbsp;
> - guangshi hao (dev)&nbsp;
> - wei sun (dev)&nbsp;
> - qihua zheng (dev)&nbsp;
> - xin wang (dev)&nbsp;
> - qingsong liu (dev)&nbsp;
> - anxing zhou (Tester)&nbsp;
> - jiajun duan (Tester)&nbsp;
> 
> 
> 
> == External Dependencies ==
> 
> As all dependencies are managed using Apache Maven
> Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
> lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; true
> zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; true
> hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; true
> spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> true
> hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
> License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> 
> 
> 
> 
> == Required Resources ==
> 
> 
> === Mailing lists ===
> 
> 
> &nbsp;* lxdb-private (PMC discussion)
> &nbsp;* lxdb-dev (developer discussion)
> &nbsp;* lxdb-user (user discussion)
> &nbsp;* lxdb-commits (SCM commits)
> &nbsp;* lxdb-issues (JIRA issue feed)
> 
> 
> === Subversion Directory ===
> 
> 
> Instead of subversion, LXDB prefers to git as source control
> management system: git://git.apache.org/lxdb





--
Sent from: http://apache-incubator-general.996316.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org


Re: [Proposal] lxdb - proposal for Apache Incubation

Posted by Sheng Wu <wu...@gmail.com>.
Hi

Since you are proposing a new project to a global foundation, you should at
least keep your documentation in English. Your provided links are Chinese,
which for most IPMC people, it is not readable.
And since this project is close-source, please provide the dependencies.
And as you repeated said the original projects, is this project created
100% on your own, is it including something from Alibaba/Tencent? As there
is no open-source, I can't verify.
Due to this is close-source, we also need you to be clear about whether you
are going to submit SGA and open source to the public.

The most important, `lucene` is an Apache trademark and Apache project,
this makes me have concerns about the branding violation.

At last, typically, we(incubator) expect you to have open-sourced the
project, and at least have a small community and first adoption out of your
company.

To join the incubator, you also need at least 3 IPMC members and 1
Champion(Apache member or officer) to help you understand the incubator.

Sheng Wu 吴晟
Twitter, wusheng1108


fp <fp...@lucene.cn> 于2021年2月27日周六 下午6:40写道:

> Dear Apache Incubator Community,
>
>
> Please accept the following proposal for presentation and discussion:
> https://github.com/lucene-cn/lxdb/wiki
>
>
> LXDB is a high-performance,OLAP,full text search database.it`s base on
> hbase,but replaced hfile with lucene index to support more effective
> secondary indexes,it`s also base on spark sql,so that you can used sql api
> to visit data and do olap calculate. and also the lucene index is store on
> hdfs (not local disk).
>
>
> In our Production System, LXDB supported 200+ clusters,some of the single
> cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
> billion rows for total), one of the biggest single table has 200million
> lucene index on LXDB.
>
>
> Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS,
> Lucene.We have merged these separated projects again,LXDB equals spark
> sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10 years to
> complete these merging operations.But the purpose is no longer a search
> engine, but a database.
>
>
>
>
> Best regards
> &nbsp; yannian mu
>
>
>
>
> LXDB Proposal
> == Abstract ==
> LXDB is a high-performance,OLAP,full text search database.
>
>
> === it`s base on hbase,but replaced hfile with lucene index to support
> more effective secondary indexes.===
> we modify hbase region server ,we&nbsp; change hfile to lucene,when put
> data we put&nbsp; document to lucene instande of&nbsp; put data to hfile
> lucene index store on region server&nbsp; (it is not sote in different
> cluster like elstice search+hbase ,it takes to copy of data)
>
>
> === it`s base on spark sql for olap===
> we Integrated spark and hbase together ,it`s useage like this ,
> 1.unpackage lxdb.tar.gz
> 2.config hadoop_config path,
> 3.run start-all.sh to start cluster.
> lxdb can startup spark through hadoop yarn ,and then spark executor
> process Embedded start hbase region server service .
>
>
> you can operate lxdb database throuth spark sql api(hive) or mysql api.
> 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
> 2.the sql`s condition (filter or group by agg) will predicate to hbase ,
> 3.hbase used lucene index to filter data in region server.
> all of the spark,hbase,lucene is Embedded Integrated together,it is
> not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es +
> hbase+spark Solution.
>
>
> == Background ==
> === Multiple copies of data ===
> Apache HBase+Elastic Search is the most popular Solution on full text
> search ,but it`s weak on Online AnalyticalProcessing.
> so most of the time the Production System used spark(or hive or impala or
> presto) ,hbase,solr/es at the same time.Multiple copies of data are stored
> in multiple systems,multiple systems has different Api .Data consistency is
> difficult to guarantee.For the above reasons we merger spark,hbase,elastic
> into one project .it`s target is used one copy of data,one cluster,one api
> to solve olap,kv,full text...database scenarios.
>
>
> === Merging and splitting of lucene indexes(hstore) acrocess different
> machine on hdfs ===
> As we all know solr/es store file in local fileSystem,it`s shard num must
> be a fix num,but if we store index on hdfs,the index can split able like
> hbase hstore,it can split or merge acorss machine nodes ,this is very
> usefull for distribute database ,it depend malloc how much resource on a
> table,most of time the records of a table is different by time by time so
> the num of shards always need adjust,if index store local it can`t split
> acroces throw different machine ,but lucene index store on hdfs it`s can do
> it.
> whether the number of pieces can be flexibly adjusted, whether it has the
> ability of elastic scaling, in a distributed database is particularly
> important
>
>
> === solved Insufficient of&nbsp; secondary indexes ===
> some people use hbase secondary index like Phoenix prjoect. but those
> programme base on the hbase rowkey has a lot of redundancy,He can't create
> too many indexes,Data inflation rate is too high,so used lucene index
> instand of secondary is the best chooses.
>
>
> === we add an lucene index for spark olap===
> Most of OLAP systems has violent scanning problems and Poor timeliness of
> data like hive,spark sql,impala or some of the mpp database.
> 1.They used violent scans to calculate the data.but another choice is add
> index to the big data.some of the time using index can greatly improve the
> performance of the original brute force scanning. i think&nbsp; that just
> like the traditional database, indexing technology can greatly improve the
> performance of the speed database.
> 2.Another problem of thoses database or system, Most of them are an
> offline system or batch system,lxdb `s target is realtime append ,realtime
> kv update just like hbase.
>
>
> ==future==
> === lucene on parquet ===
> recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files
> to&nbsp; like parquet or orc format.
> To solve the performance problem of traversing Lucene index.To solve the
> problem that opening Lucene file needs to load files such as tip into
> memory, which leads to slow opening Lucene index file,To enable Lucene to
> store multi column joint index by column, which is used to handle some
> logic such as multi table join and materialized view ,mulity fields group
> by by invert index,The current Lucene index has many problems because of
> too many file pointers and single column problems,We want to modify Lucene
> to make it more suitable for HDFS, not only for full-text retrieval, but
> also better at statistical analysis, which is a real database level
> index,We want Lucene to be splitable, which can separate storage from
> computation.
>
>
> ===&nbsp; supporting all kinds of Predicate pushdown calculation ===
> We find that if we can combine the calculation method with the data
> closely, we can give more play to the performance of the database. Index is
> only a way of calculating push down. For example, storage push down, we can
> store the index on the SSD device, and the data part on the SATA device. We
> can store the data that are often grouped together in advance, instead of
> calculating line by line, We can give important tables or columns to
> dedicated devices and resources, but these hbases are still lacking, which
> we need to further improve
>
>
> === Distribution of intervention data ===
> we can used row key to intervention data to different nodes ,it can do
> many interestest things
>
>
> === Resource control, resource isolation ===
> lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp; we
> can do it , I can control the priority of SQL so that Lucene with higher
> priority can get faster IO resources.
>
>
> == Status ==
> since 2011 I released the first open source version on Alibaba&nbsp; ,At
> that time, mdrill used 10 nodes 48g machines to support 400 billion data.
> the first index on hdfs is from this version.it`s one year ahead of the
> community.&nbsp; https://github.com/alibaba/mdrill .
>
>
> since 2014 i stoped mdrill project update for the reason of i join into
> tencent . in our team we developed&nbsp; hermes project ,we also build
> lucene on hdfs , hermes now realtime import 1000 billion rows of data per
> day.It's the largest database I've ever developed ,
> https://plus.tencent.com/bigdata/hermes
>
>
> since 2018 I set up my own company called luxin, Lu Xin is the Chinese
> pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
> lucene.xin ,mail domain is lucene.cn.
> luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp;
> it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of cluster
> use lsql. it`s process about 200 billions per day ,amount of 20000 billions
> rows in one&nbsp; single cluster. (1000 nodes)
>
>
> since 2010 In the case of COVID-19 our team decide to developed the next
> generation of lsql called lxdb(lx=lucene pronunciation ). we add hbase to
> lsql To solve the update problem.nowadays we have finish the first version
> of lxdb. https://github.com/lucene-cn/lxdb/wiki
>
>
>
>
> == Known Risks ==
> ==Meritocracy ==
>
>
> lxdb has been deployed in production and is applying more than 200 lines
> of business. It has demonstrated great performance benefits and has proved
> to be a better way for reporting and analysis based big data. Still We look
> forward to growing a rich user and developer community.
> === Orphaned products ===
>
>
> The core developers currently work full-time for Luxin.
> lxdb is widely adopted by many companies and individuals. There's no
> realistic chance of it becoming orphaned. and we have a number of 1000
> person tencent qq Instant messaging group
>
>
> === Inexperience with Open Source===
> The core developers are all active users and followers of open source.
> They are already committers and contributors to the lxdb project.&nbsp;
> developed yannian mu has tens years on open source project,&nbsp; jstorm
> https://github.com/alibaba/jstorm and mdrill
> https://github.com/alibaba/mdrill
>
>
>
>
> === Homogenous Developers ===
>
>
> The most of core developers are from luxin for the Closed source products
> reason, but when lxdb was open sourced, lxdb will received a lot of bug
> fixes and enhancements from other developers not working at luxin.Where did
> you learn it from and where did you return it.
>
>
>
>
> ===Reliance on Salaried Developers ===
>
>
> Lxin invested in lxdb as the&nbsp; solution and some of its key engineers
> are working full time on the project. In addition, since there is a growing
> Big Data need for scalable solutions, we look forward to other Apache
> developers and researchers to contribute to the project. Also key to
> addressing the risk associated with relying on Salaried developers from a
> single entity is to increase the diversity of the contributors and actively
> lobby , Apache lxdb intends to do this.
>
>
> === An Excessive Fascination with the Apache Brand ===
>
>
> Lxdb is proposing to enter incubation at Apache in order to help efforts
> to diversify the committer-base, not so much to capitalize on the Apache
> brand. The Lxdb project is in production use already inside lxdb, but is
> not expected to be an lxdb product for external customers. As such, the
> lxdb project is not seeking to use the Apache brand as a marketing tool.
>
>
>
>
> === Documentation===
>
>
> Information about Palo can be found at https://github.com/lucene-cn/lxdb.
> The following links provide more information about lxdb in open source:
>
>
> * wiki site: https://github.com/lucene-cn/lxdb/wiki
> * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
> * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
> * lxin home page: http://www.lucene.xin
> * lsql document: http://docs.lucene.xin/lsql/v21/
>
>
> ##Initial Source
>
>
> lxdb will development source code under an Apache license at
> https://github.com/lucene-cn/lxdb.
>
>
>
>
> === Core Developers ===
>
>
> Currently most of the core developers of LXDB are working in the research
> Team of luxin.
>
>
> - yannian mu (dev)
> - yu chen (dev)
> - guangshi hao (dev)
> - wei sun (dev)
> - qihua zheng (dev)
> - xin wang (dev)
> - qingsong liu (dev)
> - anxing zhou (Tester)
> - jiajun duan (Tester)
>
>
> == External Dependencies ==
> As all dependencies are managed using Apache Maven
> Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
> lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; true
> zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; true
> hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; true
> spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
> License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
>
>
> == Required Resources ==
>
>
> === Mailing lists ===
>
>
> &nbsp;* lxdb-private (PMC discussion)
> &nbsp;* lxdb-dev (developer discussion)
> &nbsp;* lxdb-user (user discussion)
> &nbsp;* lxdb-commits (SCM commits)
> &nbsp;* lxdb-issues (JIRA issue feed)
>
>
> === Subversion Directory ===
>
>
> Instead of subversion, LXDB prefers to git as source control
> management system: git://git.apache.org/lxdb

[Proposal] lxdb - proposal for Apache Incubation

Posted by fp <fp...@lucene.cn>.
Dear Apache Incubator Community,


Please accept the following proposal for presentation and discussion:
https://github.com/lucene-cn/lxdb/wiki


LXDB is a high-performance,OLAP,full text search database.it`s base on hbase,but replaced hfile with lucene index to support more effective secondary indexes,it`s also base on spark sql,so that you can used sql api to visit data and do olap calculate. and also the lucene index is store on hdfs (not local disk).


In our Production System, LXDB supported 200+ clusters,some of the single cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000 billion rows for total), one of the biggest single table has 200million lucene index on LXDB.


Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS, Lucene.We have merged these separated projects again,LXDB equals spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10 years to complete these merging operations.But the purpose is no longer a search engine, but a database.




Best regards
&nbsp; yannian mu




LXDB Proposal
== Abstract ==
LXDB is a high-performance,OLAP,full text search database.


=== it`s base on hbase,but replaced hfile with lucene index to support more effective secondary indexes.===
we modify hbase region server ,we&nbsp; change hfile to lucene,when put data we put&nbsp; document to lucene instande of&nbsp; put data to hfile
lucene index store on region server&nbsp; (it is not sote in different cluster like elstice search+hbase ,it takes to copy of data)


=== it`s base on spark sql for olap===
we Integrated spark and hbase together ,it`s useage like this ,
1.unpackage lxdb.tar.gz
2.config hadoop_config path,
3.run start-all.sh to start cluster.
lxdb can startup spark through hadoop yarn ,and then spark executor process Embedded start hbase region server service .


you can operate lxdb database throuth spark sql api(hive) or mysql api.
1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
2.the sql`s condition (filter or group by agg) will predicate to hbase ,
3.hbase used lucene index to filter data in region server.
all of the spark,hbase,lucene is Embedded Integrated together,it is not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es + hbase+spark Solution.


== Background ==
=== Multiple copies of data ===
Apache HBase+Elastic Search is the most popular Solution on full text search ,but it`s weak on Online AnalyticalProcessing.
so most of the time the Production System used spark(or hive or impala or presto) ,hbase,solr/es at the same time.Multiple copies of data are stored in multiple systems,multiple systems has different Api .Data consistency is difficult to guarantee.For the above reasons we merger spark,hbase,elastic into one project .it`s target is used one copy of data,one cluster,one api to solve olap,kv,full text...database scenarios.


=== Merging and splitting of lucene indexes(hstore) acrocess different machine on hdfs ===
As we all know solr/es store file in local fileSystem,it`s shard num must be a fix num,but if we store index on hdfs,the index can split able like hbase hstore,it can split or merge acorss machine nodes ,this is very usefull for distribute database ,it depend malloc how much resource on a table,most of time the records of a table is different by time by time so the num of shards always need adjust,if index store local it can`t split acroces throw different machine ,but lucene index store on hdfs it`s can do it.
whether the number of pieces can be flexibly adjusted, whether it has the ability of elastic scaling, in a distributed database is particularly important


=== solved Insufficient of&nbsp; secondary indexes ===
some people use hbase secondary index like Phoenix prjoect. but those programme base on the hbase rowkey has a lot of redundancy,He can't create too many indexes,Data inflation rate is too high,so used lucene index instand of secondary is the best chooses.


=== we add an lucene index for spark olap===
Most of OLAP systems has violent scanning problems and Poor timeliness of data like hive,spark sql,impala or some of the mpp database.
1.They used violent scans to calculate the data.but another choice is add index to the big data.some of the time using index can greatly improve the performance of the original brute force scanning. i think&nbsp; that just like the traditional database, indexing technology can greatly improve the performance of the speed database.
2.Another problem of thoses database or system, Most of them are an offline system or batch system,lxdb `s target is realtime append ,realtime kv update just like hbase.


==future==
=== lucene on parquet ===
recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files to&nbsp; like parquet or orc format.
To solve the performance problem of traversing Lucene index.To solve the problem that opening Lucene file needs to load files such as tip into memory, which leads to slow opening Lucene index file,To enable Lucene to store multi column joint index by column, which is used to handle some logic such as multi table join and materialized view ,mulity fields group by by invert index,The current Lucene index has many problems because of too many file pointers and single column problems,We want to modify Lucene to make it more suitable for HDFS, not only for full-text retrieval, but also better at statistical analysis, which is a real database level index,We want Lucene to be splitable, which can separate storage from computation.


===&nbsp; supporting all kinds of Predicate pushdown calculation ===
We find that if we can combine the calculation method with the data closely, we can give more play to the performance of the database. Index is only a way of calculating push down. For example, storage push down, we can store the index on the SSD device, and the data part on the SATA device. We can store the data that are often grouped together in advance, instead of calculating line by line, We can give important tables or columns to dedicated devices and resources, but these hbases are still lacking, which we need to further improve


=== Distribution of intervention data ===
we can used row key to intervention data to different nodes ,it can do many interestest things


=== Resource control, resource isolation ===
lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp; we can do it , I can control the priority of SQL so that Lucene with higher priority can get faster IO resources.


== Status ==
since 2011 I released the first open source version on Alibaba&nbsp; ,At that time, mdrill used 10 nodes 48g machines to support 400 billion data. the first index on hdfs is from this version.it`s one year ahead of the community.&nbsp; https://github.com/alibaba/mdrill .


since 2014 i stoped mdrill project update for the reason of i join into tencent . in our team we developed&nbsp; hermes project ,we also build lucene on hdfs , hermes now realtime import 1000 billion rows of data per day.It's the largest database I've ever developed , https://plus.tencent.com/bigdata/hermes


since 2018 I set up my own company called luxin, Lu Xin is the Chinese pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is lucene.xin ,mail domain is lucene.cn.
luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp; it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of cluster use lsql. it`s process about 200 billions per day ,amount of 20000 billions rows in one&nbsp; single cluster. (1000 nodes)


since 2010 In the case of COVID-19 our team decide to developed the next generation of lsql called lxdb(lx=lucene pronunciation ). we add hbase to lsql To solve the update problem.nowadays we have finish the first version of lxdb. https://github.com/lucene-cn/lxdb/wiki




== Known Risks ==
==Meritocracy ==


lxdb has been deployed in production and is applying more than 200 lines of business. It has demonstrated great performance benefits and has proved to be a better way for reporting and analysis based big data. Still We look forward to growing a rich user and developer community.
=== Orphaned products ===


The core developers currently work full-time for Luxin.
lxdb is widely adopted by many companies and individuals. There's no
realistic chance of it becoming orphaned. and we have a number of 1000 person tencent qq Instant messaging group


=== Inexperience with Open Source===
The core developers are all active users and followers of open source. They are already committers and contributors to the lxdb project.&nbsp; developed yannian mu has tens years on open source project,&nbsp; jstorm https://github.com/alibaba/jstorm and mdrill https://github.com/alibaba/mdrill




=== Homogenous Developers ===


The most of core developers are from luxin for the Closed source products reason, but when lxdb was open sourced, lxdb will received a lot of bug fixes and enhancements from other developers not working at luxin.Where did you learn it from and where did you return it.




===Reliance on Salaried Developers ===


Lxin invested in lxdb as the&nbsp; solution and some of its key engineers are working full time on the project. In addition, since there is a growing Big Data need for scalable solutions, we look forward to other Apache developers and researchers to contribute to the project. Also key to addressing the risk associated with relying on Salaried developers from a single entity is to increase the diversity of the contributors and actively lobby , Apache lxdb intends to do this.


=== An Excessive Fascination with the Apache Brand ===


Lxdb is proposing to enter incubation at Apache in order to help efforts to diversify the committer-base, not so much to capitalize on the Apache brand. The Lxdb project is in production use already inside lxdb, but is not expected to be an lxdb product for external customers. As such, the lxdb project is not seeking to use the Apache brand as a marketing tool.




=== Documentation===


Information about Palo can be found at https://github.com/lucene-cn/lxdb. The following links provide more information about lxdb in open source:


* wiki site: https://github.com/lucene-cn/lxdb/wiki
* Issue Tracking: https://github.com/lucene-cn/lxdb/issues
* Overview: https://github.com/lucene-cn/lxdb/wiki/intro
* lxin home page: http://www.lucene.xin
* lsql document: http://docs.lucene.xin/lsql/v21/


##Initial Source


lxdb will development source code under an Apache license at https://github.com/lucene-cn/lxdb.




=== Core Developers ===


Currently most of the core developers of LXDB are working in the research Team of luxin.


- yannian mu (dev)
- yu chen (dev)
- guangshi hao (dev)
- wei sun (dev)
- qihua zheng (dev)
- xin wang (dev)
- qingsong liu (dev)
- anxing zhou (Tester)
- jiajun duan (Tester)


== External Dependencies ==
As all dependencies are managed using Apache Maven
Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true


== Required Resources ==


=== Mailing lists ===


&nbsp;* lxdb-private (PMC discussion)
&nbsp;* lxdb-dev (developer discussion)
&nbsp;* lxdb-user (user discussion)
&nbsp;* lxdb-commits (SCM commits)
&nbsp;* lxdb-issues (JIRA issue feed)


=== Subversion Directory ===


Instead of subversion, LXDB prefers to git as source control
management system: git://git.apache.org/lxdb