You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2019/06/14 20:34:16 UTC

[GitHub] [incubator-pinot] buchireddy commented on issue #4317: Support variable length Offline Dictionary Indexes for bytes, strings and maps to save on storage

buchireddy commented on issue #4317: Support variable length Offline Dictionary Indexes for bytes, strings and maps to save on storage
URL: https://github.com/apache/incubator-pinot/issues/4317#issuecomment-502254249
 
 
   I've implemented a solution based on the approached discussed in the description and did some benchmarks with **String dictionary** to see the latency with VariableLength dictionary and the storage improvements it brings compared to the FixedLength dictionary. Here are the results and observations.
   
   **Time taken to lookup 10M values:**
   <img width="882" alt="TimeVsDictionarySizesCharts" src="https://user-images.githubusercontent.com/945283/59487242-edc20280-8e30-11e9-92d7-1891379d4639.png">
   **Observation:** As can be clearly seen in the graph, as strings are getting bigger, the variable lengh dictionary is giving much better lookup latencies. When the dictionary size (cardinality) is >1M and the string sizes are small (<100), FixedLength dictionary has better lookup latencies though. 
   
   **Storage requirements of VarLength dictionary:**
   Since the variable length dictionary doesn't do any padding, it saves the space for all the cases where the strings in the dictionary aren't of equal length. Hence, this graph plots the % storage savings with VarLength dictionary instead of absolute values.
   <img width="612" alt="DictSizeVsStorageSavingsChart" src="https://user-images.githubusercontent.com/945283/59487464-9ff9ca00-8e31-11e9-9e60-13228766662e.png">
   **Observation:** If the strings in the dictionary are of different lengths, VarLength dictionary saves 40% space compared to the fixed length dictionary. 
   
   Again thanks @kishoreg for all the guidance on this.
   
   P.S: All raw values from the benchmarking are available at https://docs.google.com/spreadsheets/d/1iOLyhD4AUZw3JsdOkmH6h36KWYalVeUBIVcty6Pnv0E/edit?usp=sharing so feel free to copy/comment on the results.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org