You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/01/29 00:23:30 UTC
[GitHub] [druid] ccaominh opened a new pull request #9278: Speed up joins on
indexed tables with string keys
ccaominh opened a new pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278
### Description
When joining on index tables with string keys, caching the computation of row id to row numbers improves performance on the `JoinAndLookupBenchmark.joinIndexTableStringKey`* benchmarks by about 10% if the column cache is enabled an by about 100% if the column cache is disabled.
#### Before
```
Benchmark (columnCacheSizeBytes) Score Error Units
joinIndexedTableStringKey 0 41.899 ± 0.688 ms/op
joinIndexedTableStringKey 16384 22.707 ± 0.309 ms/op
joinIndexedTableStringKeyWithFilter 0 41.879 ± 0.507 ms/op
joinIndexedTableStringKeyWithFilter 16384 22.314 ± 0.114 ms/op
```
#### After
```
Benchmark (columnCacheSizeBytes) Score Error Units
joinIndexedTableStringKey 0 20.527 ± 0.751 ms/op
joinIndexedTableStringKey 16384 20.804 ± 0.206 ms/op
joinIndexedTableStringKeyWithFilter 0 21.374 ± 0.299 ms/op
joinIndexedTableStringKeyWithFilter 16384 19.723 ± 0.390 ms/op
```
(See https://github.com/apache/druid/pull/9267 for the `JoinAndLookupBenchmark` implementation.)
<hr>
This PR has:
- [x] been self-reviewed.
- [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] ccaominh commented on issue #9278: Speed up joins on
indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
ccaominh commented on issue #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#issuecomment-579887920
> How do these numbers compare to the standard lookup case?
Lookups are about 5x faster:
```
Benchmark (columnCacheSizeBytes) Score Error Units
lookupVirtualColumnStringKey 0 3.374 ± 0.065 ms/op
lookupVirtualColumnStringKey 16384 3.364 ± 0.054 ms/op
lookupVirtualColumnStringKeyWithFilter 0 4.553 ± 0.114 ms/op
lookupVirtualColumnStringKeyWithFilter 16384 5.529 ± 0.305 ms/op
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up
joins on indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r372180417
##########
File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
##########
@@ -240,13 +243,36 @@ private void advanceCurrentRow()
*/
private static class ConditionMatcherFactory implements ColumnProcessorFactory<Supplier<IntIterator>>
{
+ private static final int MAX_NUM_CACHE = 10;
+ private static final int CACHE_MAX_SIZE = 1000;
+
private final ValueType keyType;
private final IndexedTable.Index index;
+ // DimensionSelector -> (int) dimension id -> (IntList) row numbers
+ private final LoadingCache<DimensionSelector, LoadingCache<Integer, IntList>> dimensionCaches;
+
ConditionMatcherFactory(ValueType keyType, IndexedTable.Index index)
{
this.keyType = keyType;
this.index = index;
+
+ this.dimensionCaches =
+ Caffeine.newBuilder()
+ .maximumSize(MAX_NUM_CACHE)
+ .build(
+ selector ->
+ Caffeine.newBuilder()
Review comment:
How does this compare to the LruEvalCache in SingleStringInputCachingExpressionColumnValueSelector? I think they're trying to solve the same problem, so whichever approach works better, both that class and this one should use the best approach.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up
joins on indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r374411013
##########
File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
##########
@@ -308,4 +353,112 @@ public ValueType defaultType()
return () -> IntIterators.EMPTY_ITERATOR;
}
}
+
+ @VisibleForTesting
+ static class LruLoadingHashMap<K, V> extends LinkedHashMap<K, V>
+ {
+ private final int maxSize;
+ private final Function<K, V> loader;
+
+ LruLoadingHashMap(int maxSize, Function<K, V> loader)
+ {
+ super(capacity(maxSize));
+ this.maxSize = maxSize;
+ this.loader = loader;
+ }
+
+ V getAndLoadIfAbsent(K key)
+ {
+ return computeIfAbsent(key, loader);
+ }
+
+ @Override
+ protected boolean removeEldestEntry(Map.Entry eldest)
+ {
+ return size() > maxSize;
+ }
+
+ private static int capacity(int expectedSize)
+ {
+ // This is the calculation used in JDK8 to resize when a putAll happens; it seems to be the most conservative
+ // calculation we can make. 0.75 is the default load factor.
+ return (int) ((float) expectedSize / 0.75F + 1.0F);
+ }
+ }
+
+ private interface Int2IntListMap
+ {
+ IntList getAndLoadIfAbsent(int key);
+ }
+
+ /**
+ * Lookup table for keys in the range from 0 to maxSize - 1
+ */
+ @VisibleForTesting
+ static class Int2IntListLookupTable implements Int2IntListMap
+ {
+ private final IntList[] lookup;
+ private final IntFunction<IntList> loader;
+
+ Int2IntListLookupTable(int maxSize, IntFunction<IntList> loader)
+ {
+ this.loader = loader;
+ this.lookup = new IntList[maxSize];
+ Arrays.fill(lookup, null);
Review comment:
I thought Java guaranteed that object arrays will start with all `null`s.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up
joins on indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r374409826
##########
File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
##########
@@ -263,8 +299,17 @@ public ValueType defaultType()
final IndexedInts row = selector.getRow();
if (row.size() == 1) {
- final String key = selector.lookupName(row.get(0));
- return index.find(key).iterator();
+ int dimensionId = row.get(0);
+ final IntList rowNumbers;
+ if (selector.getValueCardinality() == DimensionDictionarySelector.CARDINALITY_UNKNOWN) {
Review comment:
This won't change from row to row, so it's likely going to be better to check it outside hot code (i.e. return a different `Supplier<IntIterator>` based on the result of calling `selector.getValueCardinality()`).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] gianm commented on issue #9278: Speed up joins on indexed
tables with string keys
Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#issuecomment-579574495
How do these numbers compare to the standard lookup case?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up
joins on indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r372179952
##########
File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
##########
@@ -262,8 +288,9 @@ public ValueType defaultType()
final IndexedInts row = selector.getRow();
if (row.size() == 1) {
- final String key = selector.lookupName(row.get(0));
- return index.find(key).iterator();
+ int dimensionId = row.get(0);
+ //noinspection ConstantConditions (cache cannot return nulls since nulls are never stored in cache)
+ return dimensionCaches.get(selector).get(dimensionId).iterator();
Review comment:
This is only safe when the selector has a 'real' dictionary, i.e. when `selector.getValueCardinality()` does not return `DimensionSelector.CARDINALITY_UNKNOWN`. If it is unknown then the ids are not valid outside the context of a specific row. So, in that case you'll need to fall back to the slower code.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up
joins on indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r372180206
##########
File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
##########
@@ -240,13 +243,36 @@ private void advanceCurrentRow()
*/
private static class ConditionMatcherFactory implements ColumnProcessorFactory<Supplier<IntIterator>>
{
+ private static final int MAX_NUM_CACHE = 10;
+ private static final int CACHE_MAX_SIZE = 1000;
+
private final ValueType keyType;
private final IndexedTable.Index index;
+ // DimensionSelector -> (int) dimension id -> (IntList) row numbers
+ private final LoadingCache<DimensionSelector, LoadingCache<Integer, IntList>> dimensionCaches;
+
ConditionMatcherFactory(ValueType keyType, IndexedTable.Index index)
{
this.keyType = keyType;
this.index = index;
+
+ this.dimensionCaches =
+ Caffeine.newBuilder()
+ .maximumSize(MAX_NUM_CACHE)
+ .build(
+ selector ->
+ Caffeine.newBuilder()
+ .maximumSize(CACHE_MAX_SIZE)
+ .build(dimensionId -> getRowNumbers(selector, dimensionId))
+ );
+
+ }
+
+ private IntList getRowNumbers(DimensionSelector selector, int dimensionId)
+ {
+ final String key = selector.lookupName(dimensionId);
Review comment:
For each selector, `dimensionId`s are contiguous and go from `0` (inclusive) to `selector.getValueCardinality()` (exclusive). If the cardinality is less than `CACHE_MAX_SIZE` then you could use an `IntList[]` as a cache. It should be faster.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] ccaominh commented on a change in pull request #9278: Speed
up joins on indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r373751136
##########
File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
##########
@@ -240,13 +243,36 @@ private void advanceCurrentRow()
*/
private static class ConditionMatcherFactory implements ColumnProcessorFactory<Supplier<IntIterator>>
{
+ private static final int MAX_NUM_CACHE = 10;
+ private static final int CACHE_MAX_SIZE = 1000;
+
private final ValueType keyType;
private final IndexedTable.Index index;
+ // DimensionSelector -> (int) dimension id -> (IntList) row numbers
+ private final LoadingCache<DimensionSelector, LoadingCache<Integer, IntList>> dimensionCaches;
+
ConditionMatcherFactory(ValueType keyType, IndexedTable.Index index)
{
this.keyType = keyType;
this.index = index;
+
+ this.dimensionCaches =
+ Caffeine.newBuilder()
+ .maximumSize(MAX_NUM_CACHE)
+ .build(
+ selector ->
+ Caffeine.newBuilder()
Review comment:
On `JoinAndLookupBenchmark.joinIndexedTableStringKey`, using the `LruEvalCache` approach is about 10% faster than using `Caffeine`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] ccaominh commented on issue #9278: Speed up joins on
indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
ccaominh commented on issue #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#issuecomment-580981349
Benchmark numbers with the improved cache implementation:
#### Original
```
Benchmark (columnCacheSizeBytes) Score Error Units
joinIndexedTableStringKey 0 42.339 ± 0.323 ms/op
joinIndexedTableStringKey 16384 22.634 ± 0.247 ms/op
joinIndexedTableStringKeyWithFilter 0 38.170 ± 0.268 ms/op
joinIndexedTableStringKeyWithFilter 16384 24.974 ± 0.462 ms/op
lookupVirtualColumnStringKey 0 4.040 ± 0.042 ms/op
lookupVirtualColumnStringKey 16384 3.194 ± 0.011 ms/op
lookupVirtualColumnStringKeyWithFilter 0 5.288 ± 0.027 ms/op
lookupVirtualColumnStringKeyWithFilter 16384 5.175 ± 0.042 ms/op
```
#### Int2IntListLruCache
```
Benchmark (columnCacheSizeBytes) Score Error Units
joinIndexedTableStringKey 0 14.557 ± 0.151 ms/op
joinIndexedTableStringKey 16384 15.167 ± 0.075 ms/op
joinIndexedTableStringKeyWithFilter 0 16.093 ± 0.517 ms/op
joinIndexedTableStringKeyWithFilter 16384 16.434 ± 0.185 ms/op
```
#### Int2IntListLookupTable
```
Benchmark (columnCacheSizeBytes) Score Error Units
joinIndexedTableStringKey 0 12.759 ± 0.125 ms/op
joinIndexedTableStringKey 16384 12.439 ± 0.481 ms/op
joinIndexedTableStringKeyWithFilter 0 14.139 ± 0.127 ms/op
joinIndexedTableStringKeyWithFilter 16384 14.163 ± 0.176 ms/op
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] gianm commented on a change in pull request #9278: Speed up
joins on indexed tables with string keys
Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278#discussion_r372601023
##########
File path: processing/src/main/java/org/apache/druid/segment/join/table/IndexedTableJoinMatcher.java
##########
@@ -240,13 +243,36 @@ private void advanceCurrentRow()
*/
private static class ConditionMatcherFactory implements ColumnProcessorFactory<Supplier<IntIterator>>
{
+ private static final int MAX_NUM_CACHE = 10;
+ private static final int CACHE_MAX_SIZE = 1000;
+
private final ValueType keyType;
private final IndexedTable.Index index;
+ // DimensionSelector -> (int) dimension id -> (IntList) row numbers
+ private final LoadingCache<DimensionSelector, LoadingCache<Integer, IntList>> dimensionCaches;
+
ConditionMatcherFactory(ValueType keyType, IndexedTable.Index index)
{
this.keyType = keyType;
this.index = index;
+
+ this.dimensionCaches =
+ Caffeine.newBuilder()
+ .maximumSize(MAX_NUM_CACHE)
+ .build(
+ selector ->
+ Caffeine.newBuilder()
+ .maximumSize(CACHE_MAX_SIZE)
+ .build(dimensionId -> getRowNumbers(selector, dimensionId))
+ );
+
+ }
+
+ private IntList getRowNumbers(DimensionSelector selector, int dimensionId)
+ {
+ final String key = selector.lookupName(dimensionId);
Review comment:
(Assuming the dictionary is real, i.e., `selector.getValueCardinality() != DimensionSelector.CARDINALITY_UNKNOWN`)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org
[GitHub] [druid] gianm merged pull request #9278: Speed up joins on indexed
tables with string keys
Posted by GitBox <gi...@apache.org>.
gianm merged pull request #9278: Speed up joins on indexed tables with string keys
URL: https://github.com/apache/druid/pull/9278
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org