You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/07/23 15:13:30 UTC

[GitHub] [incubator-druid] leventov opened a new issue #5526: Simplify index merging algorithm

leventov opened a new issue #5526: Simplify index merging algorithm
URL: https://github.com/apache/incubator-druid/issues/5526
 
 
   Current process seems to be overcomplicated.
   
   The current steps:
   
    1) Produce per-index bitmaps, using unsorted per-index encoding and per-index row numbers: `IncrementalIndexAdapter.processRows()` (called from `IncrementalIndexAdapter` constructor)
    2) Sort per-index dictionary. See `StringDimensionIndexer.SortedDimensionDictionary`: it has three O(N)-sized structures: for
      a) sorted-to-unsorted encoding conversion,
      b) unsorted-to-sorted encoding conversion,
      c) values in the sorted order itself (for iteration later; however it's materialization is not needed for sure, the list from `DimensionDictionary` and sorted-to-unsorted encoding could be combined)
     Additionally, when preparing those three structures, *another* O(N) structure is temporarily created (a TreeMap)
    3) When merging per-index sorted dictionaries, O(N) per-index-sorted-to-global-sorted encoding conversion structures are created: see `IndexMerger.DictionaryMergeIterator`
    4) Before global merge of rows, encoded row values are converted *twice*: firstly, per-index-unsorted-to-sorted, using the structure from 2.a), and secondly, per-index-sorted-to-globally-sorted, using the structure from 3).
    5) When merging rows of indexes into a global row stream, per-index-row-number-to-global-row-number conversion structures are created: see `IndexMergerV9. mergeIndexesAndWriteColumns()`.
    6) After writing out the global stream of rows, per-index bitmaps are merged. This merge requires the structures from 2.b) and 5). See `StringDimensionMergerV9.mergeBitmaps()`.
   
   I don't see why it couldn't be simplified to the following:
   
    1) Create a sorted-to-unsorted encoding conversion. It could be done by creating `int[]`, fill it with 0, 1, .., and then sort it with strategy, delegating comparison to the `DimensionDictionary`'s array of String values. I. e. one O(N) structure is created on this step.
    2) When merging per-index sorted dictionaries (as explained above, it could be done by iterating the structure, created on the previous step, and composing it with `DimensionDictionary`'s array), create per-index-unsorted-to-global-sorted encoding conversion: O(N) structure.
    3) Before global merge of rows, convert encoded row values *once*, using the structure, created on the previous step.
    4) While merging and writing out a global row stream, global bitmaps are created as we go.
   
   The main question is why we need to create per-index bitmaps in the very beginning, then create a lot of auxiliary structures to keep them usable until the very end, and them merge. If we could(?) *just* create bitmaps in the end.
   
   @jon-wei @gianm @jihoonson 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org