You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@doris.apache.org by GitBox <gi...@apache.org> on 2019/07/22 06:59:47 UTC
[GitHub] [incubator-doris] wuyunfeng opened a new issue #1525: Expose data pruned-filter-scan ability for Integrating Spark and Doris

wuyunfeng opened a new issue #1525: Expose data pruned-filter-scan ability for Integrating Spark and Doris
URL: https://github.com/apache/incubator-doris/issues/1525
 
 
   ### Background
   1. Recently years,  Machine Learning  intensively are integrated to computing system such as Spark MLib for Spark etc.  if Doris expose the data for Computing System such as Spark,  then we can make full use of the Computing System's MLib gains a competitive edge with state-of-the art Machine Learning on large data sets, collaborate, and productionize models at massive scale
   
   2. When Doris processing large query, if memory exceed the memory limitation, this query would fail, but Spark can write the intermediate result back to Disk gracefully， the Integration for Spark and Doris not only resolved this problem, but also Spark can take full advantage of the scan performance of the column-stride of Doris Storage Engine which can provide extra push-down filter ability save a lot of network IO .
   
   3. Nowadays, enterprise  store lots data on different Storage Service such as Mysql、Elasticsearch、Doris、HDFS、NFS、Table etc. They need a way to analyze all this data with conjunctive query，Spark already get through with almost these Service except Doris.
   
   ### How to realize
   
   1. Doris FE  is responsible for pruning the related tablet, decide which predicate can used to filter data, providing single node query plan fragment, then packed all those things and return client without scheduling these to backend instance in contrast to before. In this way, FE no longer  would coordinate the query lifecycle
   
   2. Doris BE is responsible for assembling all parameters from client, such as tabletIds、version、encoded plan fragment etc which mostly generated by `Doris FE `, and then execute this plan fragment instance, all the row results return by this plan fragment would be pushed to a attached blocking queue firstly,  when client call `get_next` to iterate the result, fetch this batched result from  blocking queue and answer the client.
   
   ### Additional  API 
   #### Doris FE  HTTP Transport Protocol
   1.  GET Table Schema
   
   ```
   GET /{cluster}/{database}/{table}/_schema
   ```
   
   2. GET Query Plan
   
   ```
   POST /{cluster}/{database}/{table}/_query_plan
   {
     "sql": "select k2, k4 from table where k1 > 2 and k3 > 4"
   }
   ```
   #### Doris BE  Thrift Transport Protocol
   
   ```
   // scan service expose ability of scanning data ability to other compute system
   service **TDorisExternalService** {
       // doris will build  a scan context for this session, context_id returned if success
       TScanOpenResult open(1: TScanOpenParams params);
   
       // return the batch_size of data
       TScanBatchResult getNext(1: TScanNextBatchParams params);
   
       // release the context resource associated with the context_id
       TScanCloseResult close(1: TScanCloseParams params);
   }
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@doris.apache.org
For additional commands, e-mail: dev-help@doris.apache.org