You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/01/29 00:23:30 UTC

[GitHub] [druid] ccaominh opened a new pull request #9278: Speed up joins on indexed tables with string keys

ccaominh opened a new pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278
 
 
   ### Description
   
   When joining on index tables with string keys, caching the computation of row id to row numbers improves performance on the `JoinAndLookupBenchmark.joinIndexTableStringKey`* benchmarks by about 10% if the column cache is enabled an by about 100% if the column cache is disabled. 
   
   #### Before
   
   ```
   Benchmark                            (columnCacheSizeBytes)   Score   Error  Units
   joinIndexedTableStringKey                                 0  41.899 ± 0.688  ms/op
   joinIndexedTableStringKey                             16384  22.707 ± 0.309  ms/op
   joinIndexedTableStringKeyWithFilter                       0  41.879 ± 0.507  ms/op
   joinIndexedTableStringKeyWithFilter                   16384  22.314 ± 0.114  ms/op
   ```
   
   #### After
   
   ```
   Benchmark                            (columnCacheSizeBytes)   Score   Error  Units
   joinIndexedTableStringKey                                 0  20.527 ± 0.751  ms/op
   joinIndexedTableStringKey                             16384  20.804 ± 0.206  ms/op
   joinIndexedTableStringKeyWithFilter                       0  21.374 ± 0.299  ms/op
   joinIndexedTableStringKeyWithFilter                   16384  19.723 ± 0.390  ms/op
   ```
   
   (See https://github.com/apache/druid/pull/9267 for the `JoinAndLookupBenchmark` implementation.)
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on issue #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
ccaominh commented on issue #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#issuecomment-579887920
 
 
   > How do these numbers compare to the standard lookup case?
   
   Lookups are about 5x faster:
   
   ```
   Benchmark                               (columnCacheSizeBytes)  Score   Error  Units
   lookupVirtualColumnStringKey                                 0  3.374 ± 0.065  ms/op
   lookupVirtualColumnStringKey                             16384  3.364 ± 0.054  ms/op
   lookupVirtualColumnStringKeyWithFilter                       0  4.553 ± 0.114  ms/op
   lookupVirtualColumnStringKeyWithFilter                   16384  5.529 ± 0.305  ms/op
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r372180417
 
 

 ##########
 File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
 ##########
 @@ -240,13 +243,36 @@ private void advanceCurrentRow()
    */
   private static class ConditionMatcherFactory implements ColumnProcessorFactory<Supplier<IntIterator>>
   {
+    private static final int MAX_NUM_CACHE = 10;
+    private static final int CACHE_MAX_SIZE = 1000;
+
     private final ValueType keyType;
     private final IndexedTable.Index index;
 
+    // DimensionSelector -> (int) dimension id -> (IntList) row numbers
+    private final LoadingCache<DimensionSelector, LoadingCache<Integer, IntList>> dimensionCaches;
+
     ConditionMatcherFactory(ValueType keyType, IndexedTable.Index index)
     {
       this.keyType = keyType;
       this.index = index;
+
+      this.dimensionCaches =
+          Caffeine.newBuilder()
+                  .maximumSize(MAX_NUM_CACHE)
+                  .build(
+                      selector ->
+                          Caffeine.newBuilder()
 
 Review comment:
   How does this compare to the LruEvalCache in SingleStringInputCachingExpressionColumnValueSelector? I think they're trying to solve the same problem, so whichever approach works better, both that class and this one should use the best approach.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r374411013
 
 

 ##########
 File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
 ##########
 @@ -308,4 +353,112 @@ public ValueType defaultType()
       return () -> IntIterators.EMPTY_ITERATOR;
     }
   }
+
+  @VisibleForTesting
+  static class LruLoadingHashMap<K, V> extends LinkedHashMap<K, V>
+  {
+    private final int maxSize;
+    private final Function<K, V> loader;
+
+    LruLoadingHashMap(int maxSize, Function<K, V> loader)
+    {
+      super(capacity(maxSize));
+      this.maxSize = maxSize;
+      this.loader = loader;
+    }
+
+    V getAndLoadIfAbsent(K key)
+    {
+      return computeIfAbsent(key, loader);
+    }
+
+    @Override
+    protected boolean removeEldestEntry(Map.Entry eldest)
+    {
+      return size() > maxSize;
+    }
+
+    private static int capacity(int expectedSize)
+    {
+      // This is the calculation used in JDK8 to resize when a putAll happens; it seems to be the most conservative
+      // calculation we can make. 0.75 is the default load factor.
+      return (int) ((float) expectedSize / 0.75F + 1.0F);
+    }
+  }
+
+  private interface Int2IntListMap
+  {
+    IntList getAndLoadIfAbsent(int key);
+  }
+
+  /**
+   * Lookup table for keys in the range from 0 to maxSize - 1
+   */
+  @VisibleForTesting
+  static class Int2IntListLookupTable implements Int2IntListMap
+  {
+    private final IntList[] lookup;
+    private final IntFunction<IntList> loader;
+
+    Int2IntListLookupTable(int maxSize, IntFunction<IntList> loader)
+    {
+      this.loader = loader;
+      this.lookup = new IntList[maxSize];
+      Arrays.fill(lookup, null);
 
 Review comment:
   I thought Java guaranteed that object arrays will start with all `null`s.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r374409826
 
 

 ##########
 File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
 ##########
 @@ -263,8 +299,17 @@ public ValueType defaultType()
         final IndexedInts row = selector.getRow();
 
         if (row.size() == 1) {
-          final String key = selector.lookupName(row.get(0));
-          return index.find(key).iterator();
+          int dimensionId = row.get(0);
+          final IntList rowNumbers;
+          if (selector.getValueCardinality() == DimensionDictionarySelector.CARDINALITY_UNKNOWN) {
 
 Review comment:
   This won't change from row to row, so it's likely going to be better to check it outside hot code (i.e. return a different `Supplier<IntIterator>` based on the result of calling `selector.getValueCardinality()`).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#issuecomment-579574495
 
 
   How do these numbers compare to the standard lookup case?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r372179952
 
 

 ##########
 File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
 ##########
 @@ -262,8 +288,9 @@ public ValueType defaultType()
         final IndexedInts row = selector.getRow();
 
         if (row.size() == 1) {
-          final String key = selector.lookupName(row.get(0));
-          return index.find(key).iterator();
+          int dimensionId = row.get(0);
+          //noinspection ConstantConditions (cache cannot return nulls since nulls are never stored in cache)
+          return dimensionCaches.get(selector).get(dimensionId).iterator();
 
 Review comment:
   This is only safe when the selector has a 'real' dictionary, i.e. when `selector.getValueCardinality()` does not return `DimensionSelector.CARDINALITY_UNKNOWN`. If it is unknown then the ids are not valid outside the context of a specific row. So, in that case you'll need to fall back to the slower code.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r372180206
 
 

 ##########
 File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
 ##########
 @@ -240,13 +243,36 @@ private void advanceCurrentRow()
    */
   private static class ConditionMatcherFactory implements ColumnProcessorFactory<Supplier<IntIterator>>
   {
+    private static final int MAX_NUM_CACHE = 10;
+    private static final int CACHE_MAX_SIZE = 1000;
+
     private final ValueType keyType;
     private final IndexedTable.Index index;
 
+    // DimensionSelector -> (int) dimension id -> (IntList) row numbers
+    private final LoadingCache<DimensionSelector, LoadingCache<Integer, IntList>> dimensionCaches;
+
     ConditionMatcherFactory(ValueType keyType, IndexedTable.Index index)
     {
       this.keyType = keyType;
       this.index = index;
+
+      this.dimensionCaches =
+          Caffeine.newBuilder()
+                  .maximumSize(MAX_NUM_CACHE)
+                  .build(
+                      selector ->
+                          Caffeine.newBuilder()
+                                  .maximumSize(CACHE_MAX_SIZE)
+                                  .build(dimensionId -> getRowNumbers(selector, dimensionId))
+                  );
+
+    }
+
+    private IntList getRowNumbers(DimensionSelector selector, int dimensionId)
+    {
+      final String key = selector.lookupName(dimensionId);
 
 Review comment:
   For each selector, `dimensionId`s are contiguous and go from `0` (inclusive) to `selector.getValueCardinality()` (exclusive). If the cardinality is less than `CACHE_MAX_SIZE` then you could use an `IntList[]` as a cache. It should be faster.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on a change in pull request #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r373751136
 
 

 ##########
 File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
 ##########
 @@ -240,13 +243,36 @@ private void advanceCurrentRow()
    */
   private static class ConditionMatcherFactory implements ColumnProcessorFactory<Supplier<IntIterator>>
   {
+    private static final int MAX_NUM_CACHE = 10;
+    private static final int CACHE_MAX_SIZE = 1000;
+
     private final ValueType keyType;
     private final IndexedTable.Index index;
 
+    // DimensionSelector -> (int) dimension id -> (IntList) row numbers
+    private final LoadingCache<DimensionSelector, LoadingCache<Integer, IntList>> dimensionCaches;
+
     ConditionMatcherFactory(ValueType keyType, IndexedTable.Index index)
     {
       this.keyType = keyType;
       this.index = index;
+
+      this.dimensionCaches =
+          Caffeine.newBuilder()
+                  .maximumSize(MAX_NUM_CACHE)
+                  .build(
+                      selector ->
+                          Caffeine.newBuilder()
 
 Review comment:
   On `JoinAndLookupBenchmark.joinIndexedTableStringKey`, using the `LruEvalCache` approach is about 10% faster than using `Caffeine`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on issue #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
ccaominh commented on issue #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#issuecomment-580981349
 
 
   Benchmark numbers with the improved cache implementation:
   
   #### Original
   ```
   Benchmark                               (columnCacheSizeBytes)   Score   Error  Units
   joinIndexedTableStringKey                                    0  42.339 ± 0.323  ms/op
   joinIndexedTableStringKey                                16384  22.634 ± 0.247  ms/op
   joinIndexedTableStringKeyWithFilter                          0  38.170 ± 0.268  ms/op
   joinIndexedTableStringKeyWithFilter                      16384  24.974 ± 0.462  ms/op
   lookupVirtualColumnStringKey                                 0   4.040 ± 0.042  ms/op
   lookupVirtualColumnStringKey                             16384   3.194 ± 0.011  ms/op
   lookupVirtualColumnStringKeyWithFilter                       0   5.288 ± 0.027  ms/op
   lookupVirtualColumnStringKeyWithFilter                   16384   5.175 ± 0.042  ms/op
   ```
   
   #### Int2IntListLruCache
   ```
   Benchmark                               (columnCacheSizeBytes)   Score   Error  Units
   joinIndexedTableStringKey                                    0  14.557 ± 0.151  ms/op
   joinIndexedTableStringKey                                16384  15.167 ± 0.075  ms/op
   joinIndexedTableStringKeyWithFilter                          0  16.093 ± 0.517  ms/op
   joinIndexedTableStringKeyWithFilter                      16384  16.434 ± 0.185  ms/op
   ```
   
   #### Int2IntListLookupTable
   ```
   Benchmark                               (columnCacheSizeBytes)   Score   Error  Units
   joinIndexedTableStringKey                                    0  12.759 ± 0.125  ms/op
   joinIndexedTableStringKey                                16384  12.439 ± 0.481  ms/op
   joinIndexedTableStringKeyWithFilter                          0  14.139 ± 0.127  ms/op
   joinIndexedTableStringKeyWithFilter                      16384  14.163 ± 0.176  ms/op
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r372601023
 
 

 ##########
 File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
 ##########
 @@ -240,13 +243,36 @@ private void advanceCurrentRow()
    */
   private static class ConditionMatcherFactory implements ColumnProcessorFactory<Supplier<IntIterator>>
   {
+    private static final int MAX_NUM_CACHE = 10;
+    private static final int CACHE_MAX_SIZE = 1000;
+
     private final ValueType keyType;
     private final IndexedTable.Index index;
 
+    // DimensionSelector -> (int) dimension id -> (IntList) row numbers
+    private final LoadingCache<DimensionSelector, LoadingCache<Integer, IntList>> dimensionCaches;
+
     ConditionMatcherFactory(ValueType keyType, IndexedTable.Index index)
     {
       this.keyType = keyType;
       this.index = index;
+
+      this.dimensionCaches =
+          Caffeine.newBuilder()
+                  .maximumSize(MAX_NUM_CACHE)
+                  .build(
+                      selector ->
+                          Caffeine.newBuilder()
+                                  .maximumSize(CACHE_MAX_SIZE)
+                                  .build(dimensionId -> getRowNumbers(selector, dimensionId))
+                  );
+
+    }
+
+    private IntList getRowNumbers(DimensionSelector selector, int dimensionId)
+    {
+      final String key = selector.lookupName(dimensionId);
 
 Review comment:
   (Assuming the dictionary is real, i.e., `selector.getValueCardinality() != DimensionSelector.CARDINALITY_UNKNOWN`)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm merged pull request #9278: Speed up joins on indexed tables with string keys

Posted by GitBox <gi...@apache.org>.
gianm merged pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org