You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by "Ben Becker (JIRA)" <ji...@apache.org> on 2013/08/16 05:55:52 UTC

[jira] [Created] (DRILL-173) Join operator should reuse ValueVectors when duplicate keys are present

Ben Becker created DRILL-173:
--------------------------------

             Summary: Join operator should reuse ValueVectors when duplicate keys are present
                 Key: DRILL-173
                 URL: https://issues.apache.org/jira/browse/DRILL-173
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: Alpha
            Reporter: Ben Becker


There are cases where joining two record batches can result in redundant work.  Consider a merge join performed on two tables (*t1* and *t2*) with duplicate keys on both sides:

h5. t1
|| key || value ||
| 2 | 'a' |
| 2 | 'b' |

h5. t2
|| key || value ||
| 2 | 'A' |
| 2 | 'B' |
| 2 | 'C' |

The resulting table will contain the cross product of all key values '2':

|| key || t1.value || t2.value ||
| 2 | 'a' | 'A' |
| 2 | 'a' | 'B' |
| 2 | 'a' | 'C' |
| 2 | 'b' | 'A' |
| 2 | 'b' | 'B' |
| 2 | 'b' | 'C' |

The current implementation iteratively copies t2.value from the incoming vectors.  Ideally, the t2.value vector would only be iteratively constructed the first pass; after that it can be copied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira