You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/04 22:59:14 UTC
[GitHub] [druid] clintropolis commented on pull request #10224: Segment backed broadcast join IndexedTable

clintropolis commented on pull request #10224:
URL: https://github.com/apache/druid/pull/10224#issuecomment-668866723


   I made some fixes to this PR because while I was testing some stuff I ran into a leak of buffers, because the columns were not being closed that were created by creating the `Reader`. In response to this, I've modified `HashJoinSegmentStorageAdapter` to feed a `Closer` that runs in the sequence baggage similar to `QueryableIndexCursorSequenceBuilder`, and thread it into the join matcher/cursors down into `IndexedTableColumnSelectorFactory` and into `IndexedTableColumnValueSelector` and `IndexedTableDimensionSelector`, so that the `IndexedTable.Reader` they create can be closed when the cursors sequence is processed.
   
   I've also added a new method to the `IndexedTable` interface:
   ```java
     @Nullable
     default ColumnSelectorFactory makeColumnSelectorFactory(ReadableOffset offset, boolean descending, Closer closer)
     {
       return null;
     }
   ```
   
   to allow it to directly provide `ColumnSelectorFactory` to `IndexedTableJoinMatcher` instead of using `IndexedTableColumnSelectorFactory`. This allows `BroadcastSegmentIndexedTable` to supply a `QueryableIndexColumnSelectorFactory` directly from the underlying segment instead of all the indirection of using a `Reader` through the standard selectors. The performance improvement seems pretty significant (from some benchmarks I've been playing with that are not in this PR, will add later, projection '6' in this case is selecting 3 columns from the rhs table):
   
   before:
   ```
   Benchmark                                                 (lhsJoinColumn)  (projection)  (rhsJoinColumn)  (rowsPerSegment)  (rowsPerTableSegment)  (tableType)  Mode  Cnt    Score    Error  Units
   IndexedTableBenchmark.hashJoinCursorColumnValueSelectors          dimZipf             6       dimUniform             50000                  50000      segment  avgt    5   77.403 ± 57.426  ms/op
   ```
   
   after:
   ```
   Benchmark                                                 (lhsJoinColumn)  (projection)  (rhsJoinColumn)  (rowsPerSegment)  (rowsPerTableSegment)  (tableType)  Mode  Cnt    Score    Error  Units
   IndexedTableBenchmark.hashJoinCursorColumnValueSelectors          dimZipf             6       dimUniform             50000                  50000      segment  avgt    5  20.060 ± 1.627  ms/op
   ```
   
   I'm sure there are a lot more improvements that could be made, but will save that for future work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org