You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2019/10/23 09:31:21 UTC
[GitHub] [incubator-doris] wuyunfeng opened a new issue #2048: [POC] Add doc value scan optimization for Doris on Elasticsearch

wuyunfeng opened a new issue #2048: [POC] Add doc value scan optimization for Doris on Elasticsearch
URL: https://github.com/apache/incubator-doris/issues/2048
 
 
   # Test Data Set
   
   ## Elasticsearch Index schema:
   
   ```
   {
      "test": {
         "mappings": {
            "doc": {
               "properties": {
                  "k1": {
                     "type": "date",
                     "format": "yyyyMMdd"
                  },
                  "k2": {
                     "type": "keyword"
                  },
                  "k3": {
                     "type": "text",
                     "analyzer": "wordseg-segment"
                  },
                  "k4": {
                     "type": "integer"
                  }
               }
            }
         }
      }
   }
   ```
   
   ## document example：
   
   ```
   "_source": {
                  "k1": "20190617",
                  "k2": "1462853203791",
                  "k3": "Elastic Beats 是一组轻量型的数据采集器，可以方便地将数据发送给 Elasticsearch 服务。由于是轻量型的，Beats 不会产生太多的运行时开销，因此，可以在硬件资源有限的设备（如 IoT 设备、边缘设备或嵌入式设备）上运行和收集数据。如果您需要收集数据，但没有资源来运行资源密集型数据收集器，那么 Beats 会是您最佳的选择。这种无处不在（涵盖所有联网设备）的数据收集方式，让您能够快速检测到异常情况做出反应，例如系统范围内的问题和安全事件等。
   当然，Beats 并不局限于资源有限的系统，它们还可用于具有更多可用硬件资源的系统",
                  "k4": 1
               }
            }
   ```
   
   ### `test` index has about 4.5million documents
   
   ##  performance comparison：
   
   ### _source:
   
   #### set batch_size = 10000
   
   sql statement | time-consuming |
   ---|---|---
   select count(*) from test | 2min 48s |
   select count(*) from test where esquery('k4', '{"match": {"u_query": "格力空调"}}'); | 1s |
   
   - 20745 documents satisfy the match query: {"match": {"k2": "格力空调"}}
   
   ### doc_value:
   
   #### set batch_size = 10000
   
   sql statement | time-consuming|
   ---|---|---
   select count(\*) from test | 7.56s |
   select count(\*) from test where esquery('k4', '{"match": {"k2": "格力空调"}}'); | 0.15s |
   
   - 20745 documents satisfy the match query: {"match": {"k2": "格力空调"}}
   
   # conclusion
   
   On analysis scenario, reading `doc_value` can have better performance than reading `_source`.
   
   I will keep on working for this PR

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org