You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@doris.apache.org by GitBox <gi...@apache.org> on 2018/11/23 06:48:51 UTC

[GitHub] lide-reed commented on a change in pull request #345: Update README.md

lide-reed commented on a change in pull request #345: Update README.md
URL: https://github.com/apache/incubator-doris/pull/345#discussion_r235855214

##########
File path: README.md
##########
@@ -1,215 +1,127 @@
-# Introduction to Apache Doris (incubating)
+# Apache Doris (incubating)

-Apache Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris not only provides batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability. The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.
+Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. It open-sourced by Baidu.

-## 1. Background
+## 1. License

-In Baidu, the largest Chinese search engine, we run a two-tiered data warehousing system for data processing, reporting and analysis. Similar to lambda architecture, the whole data warehouse comprises data processing and data serving. Data processing does the heavy lifting of big data: cleaning data, merging and transforming it, analyzing it and preparing it for use by end user queries; data serving is designed to serve queries against that data for different use cases. Currently data processing includes batch data processing and stream data processing technology, like Hadoop, Spark and Storm; Doris is a SQL data warehouse for serving online and interactive data reporting and analysis querying.
+[Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)

-Prior to Doris, different tools were deployed to solve diverse requirements in many ways. For example, the advertising platform needs to provide some detailed statistics associated with each served ad for every advertiser. The platform must support continuous updates, both new rows and incremental updates to existing rows within minutes. It must support latency-sensitive users serving live customer reports with very low latency requirements and batch ad-hoc multiple dimensions data analysis requiring very high throughput. In the past,this platform was built on top of sharded MySQL. But with the growth of data, MySQL cannot meet the requirements. Then, based on our existing KV system, we developed our own proprietary distributed statistical database. But, the simple KV storage was not efficient on scan performance. Because the system depends on many other systems, it is very complex to operate and maintain. Using RPC API, more complex querying usually required code programming, but users wants an MPP SQL engine. In addition to advertising system, a large number of internal BI Reporting / Analysis, also used a variety of tools. Some used the combination of SparkSQL / Impala + HDFS / HBASE. Some used MySQL to store the results that were prepared by distributed MapReduce computing. Some also bought commercial databases to use.
+## 2. Technology
+Doris mainly integrates the technology of Google Mesa and Apache Impala, and it based on a column-oriented storage engine and can communicate by MySQL client.

-However, when a use case requires the simultaneous availability of capabilities that cannot all be provided by a single tool, users were forced to build hybrid architectures that stitch multiple tools together. Users often choose to ingest and update data in one storage system, but later reorganize this data to optimize for an analytical reporting use-case served from another. Our users had been successfully deploying and maintaining these hybrid architectures, but we believe that they shouldn't need to accept their inherent complexity. A storage system built to provide great performance across a broad range of workloads provides a more elegant solution to the problems that hybrid architectures aim to solve. Doris is the solution. Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris provides bulk-batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability.
+## 3. User cases
+Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis.

-Generally speaking, Doris is the technology combination of Google Mesa and Apache Impala. Mesa is a highly scalable analytic data storage system that stores critical measurement data related to Google's Internet advertising business. Mesa is designed to satisfy complex and challenging set of users' and systems' requirements, including near real-time data ingestion and query ability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. At present, by virtue of its superior performance and rich functionality, Impala has been comparable to many commercial MPP database query engine. Mesa can satisfy the needs of many of our storage requirements, however Mesa itself does not provide a SQL query engine; Impala is a very good MPP SQL query engine, but the lack of a perfect distributed storage engine. So in the end we chose the combination of these two technologies.
+Doris not only provides batch data loading, but also provides near real-time mini-batch data loading.

-Learning from Mesa's data model, we developed a distributed storage engine. Unlike Mesa, this storage engine does not rely on any distributed file system. Then we deeply integrate this storage engine with Impala query engine. Query compiling, query execution coordination and catalog management of storage engine are integrated to be frontend daemon; query execution and data storage are integrated to be backend daemon. With this integration, we implemented a single, full-featured, high performance state the art of MPP database, as well as maintaining the simplicity.
+Doris also provides high availability, reliability, fault tolerance, and scalability.

-## 2. System Overview
+The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris.

-Doris' implementation consists of two daemons: frontend (FE) and backend (BE). The following figures gives the overview of architecture and usage.
+## 4. Compile and install

-![](./docs/resources/palo_architecture.jpg)
+Currently support Docker environment and Linux OS:
+Docker（Linux/Windows/Mac), Ubuntu 16.04+ and CentOS 7.5+

Review comment:
We need to resolve llvm compile issue.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@doris.apache.org
For additional commands, e-mail: dev-help@doris.apache.org