You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/04/02 09:45:32 UTC

[GitHub] [incubator-doris] zbtzbtzbt opened a new pull request #8829: [Enhancement] Opt Hyperloglog

zbtzbtzbt opened a new pull request #8829:
URL: https://github.com/apache/incubator-doris/pull/8829


   # Proposed changes
   
   
   
   ## Problem Summary:
   
   In meituan, pr https://github.com/apache/incubator-doris/pull/6625 was revert due to the oom probleam.
   currently, we are trying to modify the old hyperloglog, based on pr https://github.com/apache/incubator-doris/pull/8555, we did some works.
   via some test, we find it better than old hll, and better than apache:master hll.
   
   Changes summary:
   
   - use SIMD max tp speed up heavy function _merge_registers
   - use phmap::flat_hash_set rather than std::set
   - replace std::max
   - other small changes
   
   
   ------
   
   ## test data:
   ### query speed test1
   
   - opt_hll: 402.7 secs
   - apache_master: 580.5 secs
   
   ```
   select 
       dt, HLL_UNION_AGG(union_id)
   from
       xxx
   where
       dt = 20211130 and hour <= 19 and minute > 10
   group by dt;
   ```
   
   <img width="1098" alt="opt_hll_for_incubator_doris" src="https://user-images.githubusercontent.com/35688959/161377191-e757ae67-9cf7-4460-b98d-eda56efb5cbb.png">
   
   
   <img width="1100" alt="apache_doris_hll" src="https://user-images.githubusercontent.com/35688959/161377190-89cf5e5a-d20f-4507-ae26-5c4079106c9f.png">
   
   
   ### query speed test2
   
   - opt_hll: 1 min 31 secs
   - apache_master: 2 min 02 secs
   
   ```
   insert into xxx_table
   select a,
          b,
          dt,
          xxx xxx
          union_id
     from (
       select
          xxx xxx
          cast(unhex(id) as hll) union_id
     from _tmp a
       where dt = 20211130
    ) c;
   ```
   
   ### memory consume test1
   
   sum memory cost:
   
   - opt_hll: 90.56G
   - apache_master: 99.78G
   
   <img width="801" alt="compare1" src="https://user-images.githubusercontent.com/35688959/161376265-9f56ffda-1202-4d6f-9b1b-cc20996db89d.png">
   
   ### memory consume test2
   
   sum memory cost:
   
   - opt_hll: 90.56G
   - apache_master: 99.78G
   
   <img width="799" alt="hll2" src="https://user-images.githubusercontent.com/35688959/161376881-3b0434e8-5f2b-46d6-a702-b6e55c53b1d0.png">
   
   
   
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   4. Has document been added or modified: (Yes/No/No Need)
   5. Does it need to update dependencies: (Yes/No)
   6. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org