You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2018/11/03 02:00:30 UTC

[GitHub] Ben-Zvi opened a new pull request #1522: Drill 6735: Implement Semi-Join for the Hash-Join operator

Ben-Zvi opened a new pull request #1522: Drill 6735: Implement Semi-Join for the Hash-Join operator
URL: https://github.com/apache/drill/pull/1522

This work builds on "DRILL-6798: Planner changes to support semi-join", and makes the Hash-Join perform the Semi-Join internally, by not probing/joining with any build-side duplicate key.

The work is broken into three commits. The **first commit**:
Part of the changes is to move the `semiJoin` flag from the planner (via `HashJoinPOP`, etc.).
The main change finishes probing immediately after the first probe by the outer row (not checking for more matches) - see `executeProbePhase()`. (Then the loop continues to the next outer row).
Another change is not to output the (key) columns from the inner side. Thus they are skipped when building the output schema, and `numberOfBuildSideColumns` is set to zero so copying of the outer columns to the outgoing container starts from the first column there.
Another improvement is not allocating or using the Hash-Join "Helper".
The **second commit**:
Addresses cases of many key-duplicate inner rows by using the hash-table from the very beginning to detect duplicate incoming rows and just skip them (i.e., not copy them into the partitions). In case of spilling, the hash table is reset, and then reused. In case of no spill, the hash table is used as is (no need to build it again).
A new option was added to control this feature. This feature adds some overhead (e.g., hash table resizing), but can save lots of storage space and related overhead.
The **third commit**:
Tries to address the overhead of the "skip duplicates" feature, by performing an initial "run time stats" and then turning the feature off if there were not too many duplicates (< %20). The decision is made after reading about half-min-hash-table-size in each partition.

**comments:**

1. The Memory-Calculator was not changed, but may need to -- The Hash Join Helper is not used (less memory), but in case of "skip duplicates" - the Hash-Table is allocated (and grows) early, like in Hash Agg (thus needs to be accounted for).
2. The SI-Intersect operation is a little similar (also ignores inner duplicates), but this work did not merge the other (i.e. Intersect does build the Helper, etc.)
3. Another future possibility is making the Merge-Join support Semi Join.
4. Also fixed a JsonProperty of HashTableConfig -- see Vlad's comment in PR #1248 of DRILL-6027 .
5. Here are results from some initial performance testing (embedded mode, on a Mac):
(All tests are simple self join semi-join, with no spilling)
For a **4.8M** *distinct key* rows:
- Old: 31 sec.
- Semi: 16 sec.
- Skip-duplicates: 20 sec.
- With the third commit (stop skipping): 17 sec.

For a **2.8M** rows with only **18K** distinct (i.e. about 150 duplicates per one):
- Old: 2.4 sec.
- Semi: 3.1 sec.
- Skip-duplicates: 2.2 sec.
- With the third commit (stop skipping): 2.4 sec.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services