You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by arina-ielchiieva <gi...@git.apache.org> on 2017/03/23 14:19:05 UTC

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/794

    DRILL-5375: Nested loop join: return correct result for left join

    With this fix nested loop join will correctly process INNER and LEFT joins with non-equality conditions.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-5375

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/794.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #794
    
----
commit 71628e70a525d9bd27b4f5f56259dce84c75154d
Author: Arina Ielchiieva <ar...@gmail.com>
Date:   2017-03-22T15:07:23Z

    DRILL-5375: Nested loop join: return correct result for left join

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #794: DRILL-5375: Nested loop join: return correct result for le...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/794
  
    
    Thanks for bringing up this point. I have done some investigation and found out that implicit casts for nested loop join are already included during materialization.
    Join condition is transformed into FunctionCall [1] which is later on materialized using ExpressionTreeMaterializer [2]. 
    `ExpressionTreeMaterializer.visitFunctionCall` includes section which implicit casts [3].
    Actually these casts are more enhanced that during hash and merge joins.
    For example, during hash and merge joins only casts between numeric types, date and timestamp, varchar and varbinary are supported, i.e.
    join by int and varchar columns won't be performed. The following error will be returned: `Join only supports implicit casts between 1. Numeric data
     2. Varchar, Varbinary data 3. Date, Timestamp data Left type: INT, Right type: VARCHAR. Add explicit casts to avoid this error`. 
    In our case nested loop join will be able to perform join by int and varchar columns without adding explicit casts.
    
    [1] https://github.com/arina-ielchiieva/drill/blob/71628e70a525d9bd27b4f5f56259dce84c75154d/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/NestedLoopJoinPrel.java#L95
    [2] https://github.com/arina-ielchiieva/drill/blob/71628e70a525d9bd27b4f5f56259dce84c75154d/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinBatch.java#L265
    [3] https://github.com/apache/drill/blob/9411b26ece34ed8b2f498deea5e41f1901eb1013/exec/java-exec/src/main/java/org/apache/drill/exec/expr/ExpressionTreeMaterializer.java#L362


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108035893
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillOptiq.java ---
    @@ -70,27 +70,65 @@
       private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(DrillOptiq.class);
     
       /**
    -   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax.
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using one input.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param input data input
    +   * @param expr expression to be converted
    +   * @return converted expression
        */
       public static LogicalExpression toDrill(DrillParseContext context, RelNode input, RexNode expr) {
    -    final RexToDrill visitor = new RexToDrill(context, input);
    +    return toDrill(context, Lists.newArrayList(input), expr);
    +  }
    +
    +  /**
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using multiple inputs.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param inputs multiple data inputs
    +   * @param expr expression to be converted
    +   * @return converted expression
    +   */
    +  public static LogicalExpression toDrill(DrillParseContext context, List<RelNode> inputs, RexNode expr) {
    +    final RexToDrill visitor = new RexToDrill(context, inputs);
         return expr.accept(visitor);
       }
     
       private static class RexToDrill extends RexVisitorImpl<LogicalExpression> {
    -    private final RelNode input;
    +    private final List<RelNode> inputs;
         private final DrillParseContext context;
    +    private final List<RelDataTypeField> fieldList;
     
    -    RexToDrill(DrillParseContext context, RelNode input) {
    +    RexToDrill(DrillParseContext context, List<RelNode> inputs) {
           super(true);
           this.context = context;
    -      this.input = input;
    +      this.inputs = inputs;
    +      this.fieldList = Lists.newArrayList();
    +      /*
    +         Fields are enumerated by their presence order in input. Details {@link org.apache.calcite.rex.RexInputRef}.
    +         Thus we can merge field list from several inputs by adding them into the list in order of appearance.
    +         Each field index in the list will match field index in the RexInputRef instance which will allow us
    +         to retrieve field from filed list by index in {@link #visitInputRef(RexInputRef)} method. Example:
    +
    +         Query: select t1.c1, t2.c1. t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +
    +         Input 1: $0
    +         Input 2: $1, $2
    +
    +         Result: $0, $1, $2
    +       */
    +      for (RelNode input : inputs) {
    --- End diff --
    
    I have only allowed to pass multiple inputs instead of one as it needed to non-equi joins plus I have merged all fields into one list from all inputs for performance improvement. You are correct it's Calcite that ensures fields numeration. I have just added comment explaining this behavior to ease code review and as some help for future developers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #794: DRILL-5375: Nested loop join: return correct result for le...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on the issue:

    https://github.com/apache/drill/pull/794
  
    A general comment about filter evaluation: In HashJoin and MergeJoin we do implicit casting of one of the join columns to process the join condition (see invocations of JoinUtils.addLeastRestrictiveCasts()).  Since you are adding filter evaluation to NestedLoopJoin, I would think we need this there as well.  However, I am ok if you want to create a separate JIRA for doing that.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108035986
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java ---
    @@ -105,6 +103,29 @@
       public static final PositiveLongValidator PARQUET_ROWGROUP_FILTER_PUSHDOWN_PLANNING_THRESHOLD = new PositiveLongValidator(PARQUET_ROWGROUP_FILTER_PUSHDOWN_PLANNING_THRESHOLD_KEY,
           Long.MAX_VALUE, 10000);
     
    +  /*
    +     Enables rules that re-write query joins in the most optimal way.
    +     Though its turned on be default and its value in query optimization is undeniable, user may want turn off such
    +     optimization to leave join order indicated in sql query unchanged.
    +
    +      For example:
    +      Currently only nested loop join allows non-equi join conditions usage.
    +      During planning stage nested loop join will be chosen when non-equi join is detected
    +      and {@link #NLJOIN_FOR_SCALAR} set to false. Though query performance may not be the most optimal in such case,
    +      user may use such workaround to execute queries with non-equi joins.
    +
    +      Nested loop join allows only INNER and LEFT join usage and implies that right input is smaller that left input.
    +      During LEFT join when join optimization is enabled and detected that right input is larger that left,
    +      join will be optimized: left and right inputs will be flipped and LEFT join type will be changed to RIGHT one.
    +      If query contains non-equi joins, after such optimization it will fail, since nested loop does not allow
    +      RIGHT join. In this case if user accepts probability of non optimal performance, he may turn off join optimization.
    +      Turning off join optimization, makes sense only if user are not sure that right output is less or equal to left,
    +      otherwise join optimization can be left turned on.
    +
    +      Note: once hash and merge joins will allow non-equi join conditions,
    +      the need to turn off join optimization may go away.
    +   */
    +  public static final BooleanValidator JOIN_OPTIMIZATION = new BooleanValidator("planner.enable_join_optimization", true);
    --- End diff --
    
    JOIN_OPTIMIZATION enables two rules `DRILL_JOIN_TO_MULTIJOIN_RULE` and `DRILL_LOPT_OPTIMIZE_JOIN_RULE` which are applicable for any types of joins. That's why naming is quite broad, I believe these two rules are not only responsible for join swap but for all other join optimization techniques. In our use case, user may want to disable them when he doesn't won't join swap to be performed but there may other reasons. Though as I have noted, when we implement non-equality joins for hash and merge joins, we may remove this configuration parameter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r109205170
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java ---
    @@ -105,6 +103,29 @@
       public static final PositiveLongValidator PARQUET_ROWGROUP_FILTER_PUSHDOWN_PLANNING_THRESHOLD = new PositiveLongValidator(PARQUET_ROWGROUP_FILTER_PUSHDOWN_PLANNING_THRESHOLD_KEY,
           Long.MAX_VALUE, 10000);
     
    +  /*
    +     Enables rules that re-write query joins in the most optimal way.
    +     Though its turned on be default and its value in query optimization is undeniable, user may want turn off such
    +     optimization to leave join order indicated in sql query unchanged.
    +
    +      For example:
    +      Currently only nested loop join allows non-equi join conditions usage.
    +      During planning stage nested loop join will be chosen when non-equi join is detected
    +      and {@link #NLJOIN_FOR_SCALAR} set to false. Though query performance may not be the most optimal in such case,
    +      user may use such workaround to execute queries with non-equi joins.
    +
    +      Nested loop join allows only INNER and LEFT join usage and implies that right input is smaller that left input.
    +      During LEFT join when join optimization is enabled and detected that right input is larger that left,
    +      join will be optimized: left and right inputs will be flipped and LEFT join type will be changed to RIGHT one.
    +      If query contains non-equi joins, after such optimization it will fail, since nested loop does not allow
    +      RIGHT join. In this case if user accepts probability of non optimal performance, he may turn off join optimization.
    +      Turning off join optimization, makes sense only if user are not sure that right output is less or equal to left,
    +      otherwise join optimization can be left turned on.
    +
    +      Note: once hash and merge joins will allow non-equi join conditions,
    +      the need to turn off join optimization may go away.
    +   */
    +  public static final BooleanValidator JOIN_OPTIMIZATION = new BooleanValidator("planner.enable_join_optimization", true);
    --- End diff --
    
    Makes sense, I have opened the Jira https://issues.apache.org/jira/browse/DRILL-5408.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108382080
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java ---
    @@ -40,132 +41,133 @@
       // Record count of the left batch currently being processed
       private int leftRecordCount = 0;
     
    -  // List of record counts  per batch in the hyper container
    +  // List of record counts per batch in the hyper container
       private List<Integer> rightCounts = null;
     
       // Output batch
       private NestedLoopJoinBatch outgoing = null;
     
    -  // Next right batch to process
    -  private int nextRightBatchToProcess = 0;
    -
    -  // Next record in the current right batch to process
    -  private int nextRightRecordToProcess = 0;
    -
    -  // Next record in the left batch to process
    -  private int nextLeftRecordToProcess = 0;
    +  // Iteration status tracker
    +  private IterationStatusTracker tracker = new IterationStatusTracker();
     
       /**
        * Method initializes necessary state and invokes the doSetup() to set the
    -   * input and output value vector references
    +   * input and output value vector references.
    +   *
        * @param context Fragment context
        * @param left Current left input batch being processed
        * @param rightContainer Hyper container
    +   * @param rightCounts Counts for each right container
        * @param outgoing Output batch
        */
    -  public void setupNestedLoopJoin(FragmentContext context, RecordBatch left,
    +  public void setupNestedLoopJoin(FragmentContext context,
    +                                  RecordBatch left,
                                       ExpandableHyperContainer rightContainer,
                                       LinkedList<Integer> rightCounts,
                                       NestedLoopJoinBatch outgoing) {
         this.left = left;
    -    leftRecordCount = left.getRecordCount();
    +    this.leftRecordCount = left.getRecordCount();
         this.rightCounts = rightCounts;
         this.outgoing = outgoing;
     
         doSetup(context, rightContainer, left, outgoing);
       }
     
       /**
    -   * This method is the core of the nested loop join. For every record on the right we go over
    -   * the left batch and produce the cross product output
    +   * Main entry point for producing the output records. Thin wrapper around populateOutgoingBatch(), this method
    +   * controls which left batch we are processing and fetches the next left input batch once we exhaust the current one.
    +   *
    +   * @param joinType join type (INNER ot LEFT)
    +   * @return the number of records produced in the output batch
    +   */
    +  public int outputRecords(JoinRelType joinType) {
    +    int outputIndex = 0;
    +    while (leftRecordCount != 0) {
    +      outputIndex = populateOutgoingBatch(joinType, outputIndex);
    +      if (outputIndex >= NestedLoopJoinBatch.MAX_BATCH_SIZE) {
    +        break;
    +      }
    +      // reset state and get next left batch
    +      resetAndGetNextLeft();
    +    }
    +    return outputIndex;
    +  }
    +
    +  /**
    +   * This method is the core of the nested loop join.For each left batch record looks for matching record
    +   * from the list of right batches. Match is checked by calling {@link #doEval(int, int, int)} method.
    +   * If matching record is found both left and right records are written into output batch,
    +   * otherwise if join type is LEFT, than only left record is written, right batch record values will be null.
    +   *
    +   * @param joinType join type (INNER or LEFT)
        * @param outputIndex index to start emitting records at
        * @return final outputIndex after producing records in the output batch
        */
    -  private int populateOutgoingBatch(int outputIndex) {
    -
    -    // Total number of batches on the right side
    -    int totalRightBatches = rightCounts.size();
    -
    -    // Total number of records on the left
    -    int localLeftRecordCount = leftRecordCount;
    -
    -    /*
    -     * The below logic is the core of the NLJ. To have better performance we copy the instance members into local
    -     * method variables, once we are done with the loop we need to update the instance variables to reflect the new
    -     * state. To avoid code duplication of resetting the instance members at every exit point in the loop we are using
    -     * 'goto'
    -     */
    -    int localNextRightBatchToProcess = nextRightBatchToProcess;
    -    int localNextRightRecordToProcess = nextRightRecordToProcess;
    -    int localNextLeftRecordToProcess = nextLeftRecordToProcess;
    -
    -    outer: {
    -
    -      for (; localNextRightBatchToProcess< totalRightBatches; localNextRightBatchToProcess++) { // for every batch on the right
    -        int compositeIndexPart = localNextRightBatchToProcess << 16;
    -        int rightRecordCount = rightCounts.get(localNextRightBatchToProcess);
    -
    -        for (; localNextRightRecordToProcess < rightRecordCount; localNextRightRecordToProcess++) { // for every record in this right batch
    -          for (; localNextLeftRecordToProcess < localLeftRecordCount; localNextLeftRecordToProcess++) { // for every record in the left batch
    -
    +  private int populateOutgoingBatch(JoinRelType joinType, int outputIndex) {
    +    // copy index and match counters as local variables to speed up processing
    +    int nextRightBatchToProcess = tracker.getNextRightBatchToProcess();
    +    int nextRightRecordToProcess = tracker.getNextRightRecordToProcess();
    +    int nextLeftRecordToProcess = tracker.getNextLeftRecordToProcess();
    +    boolean rightRecordMatched = tracker.isRightRecordMatched();
    +
    +    outer:
    +    // for every record in the left batch
    +    for (; nextLeftRecordToProcess < leftRecordCount; nextLeftRecordToProcess++) {
    +      // for every batch on the right
    +      for (; nextRightBatchToProcess < rightCounts.size(); nextRightBatchToProcess++) {
    +        int rightRecordCount = rightCounts.get(nextRightBatchToProcess);
    --- End diff --
    
    Here we keep track of right records within one right record batch. We have list of right records counts `List<Integer> rightCounts`. This list is populated with record counts from each right input record batch [1], `getRecordCount()` returns int [2]. 
    
    [1] https://github.com/apache/drill/blob/2af709f43d01f341b2a52c6473ea49d6761fdc61/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinBatch.java#L349
    
    [2] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L229


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r107982807
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinBatch.java ---
    @@ -214,26 +226,62 @@ private boolean hasMore(IterOutcome outcome) {
     
       /**
        * Method generates the runtime code needed for NLJ. Other than the setup method to set the input and output value
    -   * vector references we implement two more methods
    -   * 1. emitLeft()  -> Project record from the left side
    -   * 2. emitRight() -> Project record from the right side (which is a hyper container)
    +   * vector references we implement three more methods
    +   * 1. doEval() -> Evaluates if record from left side matches record from the right side
    +   * 2. emitLeft() -> Project record from the left side
    +   * 3. emitRight() -> Project record from the right side (which is a hyper container)
        * @return the runtime generated class that implements the NestedLoopJoin interface
    -   * @throws IOException
    -   * @throws ClassTransformationException
        */
    -  private NestedLoopJoin setupWorker() throws IOException, ClassTransformationException {
    -    final CodeGenerator<NestedLoopJoin> nLJCodeGenerator = CodeGenerator.get(NestedLoopJoin.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
    +  private NestedLoopJoin setupWorker() throws IOException, ClassTransformationException, SchemaChangeException {
    +    final CodeGenerator<NestedLoopJoin> nLJCodeGenerator = CodeGenerator.get(
    +        NestedLoopJoin.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
         nLJCodeGenerator.plainJavaCapable(true);
         // Uncomment out this line to debug the generated code.
     //    nLJCodeGenerator.saveCodeForDebugging(true);
         final ClassGenerator<NestedLoopJoin> nLJClassGenerator = nLJCodeGenerator.getRoot();
     
    +    // generate doEval
    +    final ErrorCollector collector = new ErrorCollectorImpl();
    +
    +
    +    /*
    +        Logical expression may contain fields from left and right batches. During code generation (materialization)
    +        we need to indicate from which input field should be taken. Mapping sets can work with only one input at a time.
    +        But non-equality expressions can be complex:
    +          select t1.c1, t2.c1, t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +        or even contain self join which can not be transformed into filter since OR clause is present
    +          select *from t1 inner join t2 on t1.c1 >= t2.c1 or t1.c3 <> t1.c4
    +
    +        In this case logical expression can not be split according to input presence (like during equality joins
    --- End diff --
    
    I am not sure the distinction is accurate.  Even for equality condition, you could have in your example above the columns coming from the same table.  i.e instead of  ...OR t1.c3 <> t1.c4  one could have OR t1.c3 = t1.c4 which is a valid condition.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r107957723
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillOptiq.java ---
    @@ -70,27 +70,65 @@
       private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(DrillOptiq.class);
     
       /**
    -   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax.
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using one input.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param input data input
    +   * @param expr expression to be converted
    +   * @return converted expression
        */
       public static LogicalExpression toDrill(DrillParseContext context, RelNode input, RexNode expr) {
    -    final RexToDrill visitor = new RexToDrill(context, input);
    +    return toDrill(context, Lists.newArrayList(input), expr);
    +  }
    +
    +  /**
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using multiple inputs.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param inputs multiple data inputs
    +   * @param expr expression to be converted
    +   * @return converted expression
    +   */
    +  public static LogicalExpression toDrill(DrillParseContext context, List<RelNode> inputs, RexNode expr) {
    +    final RexToDrill visitor = new RexToDrill(context, inputs);
         return expr.accept(visitor);
       }
     
       private static class RexToDrill extends RexVisitorImpl<LogicalExpression> {
    -    private final RelNode input;
    +    private final List<RelNode> inputs;
         private final DrillParseContext context;
    +    private final List<RelDataTypeField> fieldList;
     
    -    RexToDrill(DrillParseContext context, RelNode input) {
    +    RexToDrill(DrillParseContext context, List<RelNode> inputs) {
           super(true);
           this.context = context;
    -      this.input = input;
    +      this.inputs = inputs;
    +      this.fieldList = Lists.newArrayList();
    +      /*
    +         Fields are enumerated by their presence order in input. Details {@link org.apache.calcite.rex.RexInputRef}.
    +         Thus we can merge field list from several inputs by adding them into the list in order of appearance.
    +         Each field index in the list will match field index in the RexInputRef instance which will allow us
    +         to retrieve field from filed list by index in {@link #visitInputRef(RexInputRef)} method. Example:
    +
    +         Query: select t1.c1, t2.c1. t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +
    +         Input 1: $0
    +         Input 2: $1, $2
    +
    +         Result: $0, $1, $2
    +       */
    +      for (RelNode input : inputs) {
    --- End diff --
    
    Calcite already ensures that the field ordinal numbers from the left and right inputs of a join are in ascending order.  In your example  t1 inner join t2 on t1.c1 between t2.c1 and t2.c2  will first be converted to a canonical form using range predicates:  t1.c1 >= t2.c1 AND t1.c1 <= t2.c2 .  If t1 is the left input of join then the field index of t1.c1 = $0.  On the right input suppose the row type is [t2.c1, t2.c2] then field index will be $1, $2.   Were you seeing a different behavior that motivated this change ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108213635
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillOptiq.java ---
    @@ -70,27 +70,65 @@
       private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(DrillOptiq.class);
     
       /**
    -   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax.
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using one input.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param input data input
    +   * @param expr expression to be converted
    +   * @return converted expression
        */
       public static LogicalExpression toDrill(DrillParseContext context, RelNode input, RexNode expr) {
    -    final RexToDrill visitor = new RexToDrill(context, input);
    +    return toDrill(context, Lists.newArrayList(input), expr);
    +  }
    +
    +  /**
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using multiple inputs.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param inputs multiple data inputs
    +   * @param expr expression to be converted
    +   * @return converted expression
    +   */
    +  public static LogicalExpression toDrill(DrillParseContext context, List<RelNode> inputs, RexNode expr) {
    +    final RexToDrill visitor = new RexToDrill(context, inputs);
         return expr.accept(visitor);
       }
     
       private static class RexToDrill extends RexVisitorImpl<LogicalExpression> {
    -    private final RelNode input;
    +    private final List<RelNode> inputs;
         private final DrillParseContext context;
    +    private final List<RelDataTypeField> fieldList;
     
    -    RexToDrill(DrillParseContext context, RelNode input) {
    +    RexToDrill(DrillParseContext context, List<RelNode> inputs) {
           super(true);
           this.context = context;
    -      this.input = input;
    +      this.inputs = inputs;
    +      this.fieldList = Lists.newArrayList();
    +      /*
    +         Fields are enumerated by their presence order in input. Details {@link org.apache.calcite.rex.RexInputRef}.
    +         Thus we can merge field list from several inputs by adding them into the list in order of appearance.
    +         Each field index in the list will match field index in the RexInputRef instance which will allow us
    +         to retrieve field from filed list by index in {@link #visitInputRef(RexInputRef)} method. Example:
    +
    +         Query: select t1.c1, t2.c1. t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +
    +         Input 1: $0
    +         Input 2: $1, $2
    +
    +         Result: $0, $1, $2
    +       */
    +      for (RelNode input : inputs) {
    --- End diff --
    
    I guess I am still unclear about the reason for merging the multiple lists into one.  Later, you will have to determine which input each $n reference belonged to, isn't it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #794: DRILL-5375: Nested loop join: return correct result for le...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on the issue:

    https://github.com/apache/drill/pull/794
  
    A few follow-up comments which are non-blockers.  Overall LGTM.  +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108210211
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/BatchReference.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to you under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.expr;
    +
    +import com.google.common.base.Preconditions;
    +
    +/**
    + * Holder class that contains batch naming, batch  and record index. Batch index is used when batch is hyper container.
    + * Used to distinguish batches in non-equi conditions during expression materialization.
    + * Mostly used for nested loop join which allows non equi-join.
    + *
    + * Example:
    + * BatchReference{batchName='leftBatch', batchIndex='leftIndex', recordIndex='leftIndex'}
    + * BatchReference{batchName='rightContainer', batchIndex='rightBatchIndex', recordIndex='rightRecordIndexWithinBatch'}
    + *
    + */
    +public final class BatchReference {
    --- End diff --
    
    Can you add some comments about when the batch reference should be set ? for every batch or once per OK_NEW_SCHEMA status ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r109205650
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillOptiq.java ---
    @@ -70,27 +70,65 @@
       private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(DrillOptiq.class);
     
       /**
    -   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax.
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using one input.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param input data input
    +   * @param expr expression to be converted
    +   * @return converted expression
        */
       public static LogicalExpression toDrill(DrillParseContext context, RelNode input, RexNode expr) {
    -    final RexToDrill visitor = new RexToDrill(context, input);
    +    return toDrill(context, Lists.newArrayList(input), expr);
    +  }
    +
    +  /**
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using multiple inputs.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param inputs multiple data inputs
    +   * @param expr expression to be converted
    +   * @return converted expression
    +   */
    +  public static LogicalExpression toDrill(DrillParseContext context, List<RelNode> inputs, RexNode expr) {
    +    final RexToDrill visitor = new RexToDrill(context, inputs);
         return expr.accept(visitor);
       }
     
       private static class RexToDrill extends RexVisitorImpl<LogicalExpression> {
    -    private final RelNode input;
    +    private final List<RelNode> inputs;
         private final DrillParseContext context;
    +    private final List<RelDataTypeField> fieldList;
     
    -    RexToDrill(DrillParseContext context, RelNode input) {
    +    RexToDrill(DrillParseContext context, List<RelNode> inputs) {
           super(true);
           this.context = context;
    -      this.input = input;
    +      this.inputs = inputs;
    +      this.fieldList = Lists.newArrayList();
    +      /*
    +         Fields are enumerated by their presence order in input. Details {@link org.apache.calcite.rex.RexInputRef}.
    +         Thus we can merge field list from several inputs by adding them into the list in order of appearance.
    +         Each field index in the list will match field index in the RexInputRef instance which will allow us
    +         to retrieve field from filed list by index in {@link #visitInputRef(RexInputRef)} method. Example:
    +
    +         Query: select t1.c1, t2.c1. t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +
    +         Input 1: $0
    +         Input 2: $1, $2
    +
    +         Result: $0, $1, $2
    +       */
    +      for (RelNode input : inputs) {
    --- End diff --
    
    The core change was to allow more than one input since we are visiting join condition which can have left and right inputs fields combined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/794


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108421451
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/BatchReference.java ---
    @@ -0,0 +1,73 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to you under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.expr;
    +
    +import com.google.common.base.Preconditions;
    +
    +/**
    + * Holder class that contains batch naming, batch  and record index. Batch index is used when batch is hyper container.
    + * Used to distinguish batches in non-equi conditions during expression materialization.
    + * Mostly used for nested loop join which allows non equi-join.
    + *
    + * Example:
    + * BatchReference{batchName='leftBatch', batchIndex='leftIndex', recordIndex='leftIndex'}
    + * BatchReference{batchName='rightContainer', batchIndex='rightBatchIndex', recordIndex='rightRecordIndexWithinBatch'}
    + *
    + */
    +public final class BatchReference {
    --- End diff --
    
    `BatchReference` instance can be created during batch initialization (ex: instance of `AbstractRecordBatch`) since naming of batches used won't change during data processing. Though info from batch references will be used during schema build (i.e. once per OK_NEW_SCHEMA).
    Added this info into `BatchReference` java doc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r107960020
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java ---
    @@ -105,6 +103,29 @@
       public static final PositiveLongValidator PARQUET_ROWGROUP_FILTER_PUSHDOWN_PLANNING_THRESHOLD = new PositiveLongValidator(PARQUET_ROWGROUP_FILTER_PUSHDOWN_PLANNING_THRESHOLD_KEY,
           Long.MAX_VALUE, 10000);
     
    +  /*
    +     Enables rules that re-write query joins in the most optimal way.
    +     Though its turned on be default and its value in query optimization is undeniable, user may want turn off such
    +     optimization to leave join order indicated in sql query unchanged.
    +
    +      For example:
    +      Currently only nested loop join allows non-equi join conditions usage.
    +      During planning stage nested loop join will be chosen when non-equi join is detected
    +      and {@link #NLJOIN_FOR_SCALAR} set to false. Though query performance may not be the most optimal in such case,
    +      user may use such workaround to execute queries with non-equi joins.
    +
    +      Nested loop join allows only INNER and LEFT join usage and implies that right input is smaller that left input.
    +      During LEFT join when join optimization is enabled and detected that right input is larger that left,
    +      join will be optimized: left and right inputs will be flipped and LEFT join type will be changed to RIGHT one.
    +      If query contains non-equi joins, after such optimization it will fail, since nested loop does not allow
    +      RIGHT join. In this case if user accepts probability of non optimal performance, he may turn off join optimization.
    +      Turning off join optimization, makes sense only if user are not sure that right output is less or equal to left,
    +      otherwise join optimization can be left turned on.
    +
    +      Note: once hash and merge joins will allow non-equi join conditions,
    +      the need to turn off join optimization may go away.
    +   */
    +  public static final BooleanValidator JOIN_OPTIMIZATION = new BooleanValidator("planner.enable_join_optimization", true);
    --- End diff --
    
    The name 'join_optimization' is misleading since that term is quite broad and includes both ordering of multiple joins and the left vs right inputs of a join.  Here we are talking about the latter only and for a specific join type.  For HashJoins, we have a planner option 'enable_hashjoin_swap'.  It sounds to me that your new option is targeted for the same thing for nested loop join.  Would be better to call it 'enable_nljoin_swap'.   Does that convey the intent you want or is there something more ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r109197541
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/PlannerSettings.java ---
    @@ -105,6 +103,29 @@
       public static final PositiveLongValidator PARQUET_ROWGROUP_FILTER_PUSHDOWN_PLANNING_THRESHOLD = new PositiveLongValidator(PARQUET_ROWGROUP_FILTER_PUSHDOWN_PLANNING_THRESHOLD_KEY,
           Long.MAX_VALUE, 10000);
     
    +  /*
    +     Enables rules that re-write query joins in the most optimal way.
    +     Though its turned on be default and its value in query optimization is undeniable, user may want turn off such
    +     optimization to leave join order indicated in sql query unchanged.
    +
    +      For example:
    +      Currently only nested loop join allows non-equi join conditions usage.
    +      During planning stage nested loop join will be chosen when non-equi join is detected
    +      and {@link #NLJOIN_FOR_SCALAR} set to false. Though query performance may not be the most optimal in such case,
    +      user may use such workaround to execute queries with non-equi joins.
    +
    +      Nested loop join allows only INNER and LEFT join usage and implies that right input is smaller that left input.
    +      During LEFT join when join optimization is enabled and detected that right input is larger that left,
    +      join will be optimized: left and right inputs will be flipped and LEFT join type will be changed to RIGHT one.
    +      If query contains non-equi joins, after such optimization it will fail, since nested loop does not allow
    +      RIGHT join. In this case if user accepts probability of non optimal performance, he may turn off join optimization.
    +      Turning off join optimization, makes sense only if user are not sure that right output is less or equal to left,
    +      otherwise join optimization can be left turned on.
    +
    +      Note: once hash and merge joins will allow non-equi join conditions,
    +      the need to turn off join optimization may go away.
    +   */
    +  public static final BooleanValidator JOIN_OPTIMIZATION = new BooleanValidator("planner.enable_join_optimization", true);
    --- End diff --
    
    Ah, you added this option to enable/disable the *logical* join rules.  Since NestedLoopJoin is a physical join implementation, from the comments I interpreted that this was intended for the swapping of left and right inputs of the (physical) NL join, which is why I mentioned about hashjoin_swap option.   It seems to me that if there is an LEFT OUTER JOIN and condition is non-equality, then we should not allow changing to a Right Outer Join by flipping the left and right sides, since that would make the query fail.   What do you think ?
    I suppose we could keep your boolean option for this PR and address the left outer join issue separately.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r109204353
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinBatch.java ---
    @@ -214,26 +226,62 @@ private boolean hasMore(IterOutcome outcome) {
     
       /**
        * Method generates the runtime code needed for NLJ. Other than the setup method to set the input and output value
    -   * vector references we implement two more methods
    -   * 1. emitLeft()  -> Project record from the left side
    -   * 2. emitRight() -> Project record from the right side (which is a hyper container)
    +   * vector references we implement three more methods
    +   * 1. doEval() -> Evaluates if record from left side matches record from the right side
    +   * 2. emitLeft() -> Project record from the left side
    +   * 3. emitRight() -> Project record from the right side (which is a hyper container)
        * @return the runtime generated class that implements the NestedLoopJoin interface
    -   * @throws IOException
    -   * @throws ClassTransformationException
        */
    -  private NestedLoopJoin setupWorker() throws IOException, ClassTransformationException {
    -    final CodeGenerator<NestedLoopJoin> nLJCodeGenerator = CodeGenerator.get(NestedLoopJoin.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
    +  private NestedLoopJoin setupWorker() throws IOException, ClassTransformationException, SchemaChangeException {
    +    final CodeGenerator<NestedLoopJoin> nLJCodeGenerator = CodeGenerator.get(
    +        NestedLoopJoin.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
         nLJCodeGenerator.plainJavaCapable(true);
         // Uncomment out this line to debug the generated code.
     //    nLJCodeGenerator.saveCodeForDebugging(true);
         final ClassGenerator<NestedLoopJoin> nLJClassGenerator = nLJCodeGenerator.getRoot();
     
    +    // generate doEval
    +    final ErrorCollector collector = new ErrorCollectorImpl();
    +
    +
    +    /*
    +        Logical expression may contain fields from left and right batches. During code generation (materialization)
    +        we need to indicate from which input field should be taken. Mapping sets can work with only one input at a time.
    +        But non-equality expressions can be complex:
    +          select t1.c1, t2.c1, t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +        or even contain self join which can not be transformed into filter since OR clause is present
    +          select *from t1 inner join t2 on t1.c1 >= t2.c1 or t1.c3 <> t1.c4
    +
    +        In this case logical expression can not be split according to input presence (like during equality joins
    --- End diff --
    
    Agree. I have updated the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108640631
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillOptiq.java ---
    @@ -70,27 +70,65 @@
       private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(DrillOptiq.class);
     
       /**
    -   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax.
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using one input.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param input data input
    +   * @param expr expression to be converted
    +   * @return converted expression
        */
       public static LogicalExpression toDrill(DrillParseContext context, RelNode input, RexNode expr) {
    -    final RexToDrill visitor = new RexToDrill(context, input);
    +    return toDrill(context, Lists.newArrayList(input), expr);
    +  }
    +
    +  /**
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using multiple inputs.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param inputs multiple data inputs
    +   * @param expr expression to be converted
    +   * @return converted expression
    +   */
    +  public static LogicalExpression toDrill(DrillParseContext context, List<RelNode> inputs, RexNode expr) {
    +    final RexToDrill visitor = new RexToDrill(context, inputs);
         return expr.accept(visitor);
       }
     
       private static class RexToDrill extends RexVisitorImpl<LogicalExpression> {
    -    private final RelNode input;
    +    private final List<RelNode> inputs;
         private final DrillParseContext context;
    +    private final List<RelDataTypeField> fieldList;
     
    -    RexToDrill(DrillParseContext context, RelNode input) {
    +    RexToDrill(DrillParseContext context, List<RelNode> inputs) {
           super(true);
           this.context = context;
    -      this.input = input;
    +      this.inputs = inputs;
    +      this.fieldList = Lists.newArrayList();
    +      /*
    +         Fields are enumerated by their presence order in input. Details {@link org.apache.calcite.rex.RexInputRef}.
    +         Thus we can merge field list from several inputs by adding them into the list in order of appearance.
    +         Each field index in the list will match field index in the RexInputRef instance which will allow us
    +         to retrieve field from filed list by index in {@link #visitInputRef(RexInputRef)} method. Example:
    +
    +         Query: select t1.c1, t2.c1. t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +
    +         Input 1: $0
    +         Input 2: $1, $2
    +
    +         Result: $0, $1, $2
    +       */
    +      for (RelNode input : inputs) {
    --- End diff --
    
    Yes, in `public LogicalExpression visitInputRef(RexInputRef inputRef)` we determine to which input field belongs to. Before that we had only one input thus we did simple get operation `input.getRowType().getFieldList().get(index)` but now we have two inputs so we have to get operation on one input and if field in not found try in the second.  I could iterate over two inputs and do get operation and once filed is found break the loop OR I could merge filed list in one and do simple get operation `fieldList.get(index)`. For performance reasons, I decided to merge filed lists in constructor and use them in `public LogicalExpression visitInputRef(RexInputRef inputRef)` rather than iterating over them for each field.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r109371613
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java ---
    @@ -40,132 +41,133 @@
       // Record count of the left batch currently being processed
       private int leftRecordCount = 0;
     
    -  // List of record counts  per batch in the hyper container
    +  // List of record counts per batch in the hyper container
       private List<Integer> rightCounts = null;
     
       // Output batch
       private NestedLoopJoinBatch outgoing = null;
     
    -  // Next right batch to process
    -  private int nextRightBatchToProcess = 0;
    -
    -  // Next record in the current right batch to process
    -  private int nextRightRecordToProcess = 0;
    -
    -  // Next record in the left batch to process
    -  private int nextLeftRecordToProcess = 0;
    +  // Iteration status tracker
    +  private IterationStatusTracker tracker = new IterationStatusTracker();
     
       /**
        * Method initializes necessary state and invokes the doSetup() to set the
    -   * input and output value vector references
    +   * input and output value vector references.
    +   *
        * @param context Fragment context
        * @param left Current left input batch being processed
        * @param rightContainer Hyper container
    +   * @param rightCounts Counts for each right container
        * @param outgoing Output batch
        */
    -  public void setupNestedLoopJoin(FragmentContext context, RecordBatch left,
    +  public void setupNestedLoopJoin(FragmentContext context,
    +                                  RecordBatch left,
                                       ExpandableHyperContainer rightContainer,
                                       LinkedList<Integer> rightCounts,
                                       NestedLoopJoinBatch outgoing) {
         this.left = left;
    -    leftRecordCount = left.getRecordCount();
    +    this.leftRecordCount = left.getRecordCount();
         this.rightCounts = rightCounts;
         this.outgoing = outgoing;
     
         doSetup(context, rightContainer, left, outgoing);
       }
     
       /**
    -   * This method is the core of the nested loop join. For every record on the right we go over
    -   * the left batch and produce the cross product output
    +   * Main entry point for producing the output records. Thin wrapper around populateOutgoingBatch(), this method
    +   * controls which left batch we are processing and fetches the next left input batch once we exhaust the current one.
    +   *
    +   * @param joinType join type (INNER ot LEFT)
    +   * @return the number of records produced in the output batch
    +   */
    +  public int outputRecords(JoinRelType joinType) {
    +    int outputIndex = 0;
    +    while (leftRecordCount != 0) {
    +      outputIndex = populateOutgoingBatch(joinType, outputIndex);
    +      if (outputIndex >= NestedLoopJoinBatch.MAX_BATCH_SIZE) {
    +        break;
    +      }
    +      // reset state and get next left batch
    +      resetAndGetNextLeft();
    +    }
    +    return outputIndex;
    +  }
    +
    +  /**
    +   * This method is the core of the nested loop join.For each left batch record looks for matching record
    +   * from the list of right batches. Match is checked by calling {@link #doEval(int, int, int)} method.
    +   * If matching record is found both left and right records are written into output batch,
    +   * otherwise if join type is LEFT, than only left record is written, right batch record values will be null.
    +   *
    +   * @param joinType join type (INNER or LEFT)
        * @param outputIndex index to start emitting records at
        * @return final outputIndex after producing records in the output batch
        */
    -  private int populateOutgoingBatch(int outputIndex) {
    -
    -    // Total number of batches on the right side
    -    int totalRightBatches = rightCounts.size();
    -
    -    // Total number of records on the left
    -    int localLeftRecordCount = leftRecordCount;
    -
    -    /*
    -     * The below logic is the core of the NLJ. To have better performance we copy the instance members into local
    -     * method variables, once we are done with the loop we need to update the instance variables to reflect the new
    -     * state. To avoid code duplication of resetting the instance members at every exit point in the loop we are using
    -     * 'goto'
    -     */
    -    int localNextRightBatchToProcess = nextRightBatchToProcess;
    -    int localNextRightRecordToProcess = nextRightRecordToProcess;
    -    int localNextLeftRecordToProcess = nextLeftRecordToProcess;
    -
    -    outer: {
    -
    -      for (; localNextRightBatchToProcess< totalRightBatches; localNextRightBatchToProcess++) { // for every batch on the right
    -        int compositeIndexPart = localNextRightBatchToProcess << 16;
    -        int rightRecordCount = rightCounts.get(localNextRightBatchToProcess);
    -
    -        for (; localNextRightRecordToProcess < rightRecordCount; localNextRightRecordToProcess++) { // for every record in this right batch
    -          for (; localNextLeftRecordToProcess < localLeftRecordCount; localNextLeftRecordToProcess++) { // for every record in the left batch
    -
    +  private int populateOutgoingBatch(JoinRelType joinType, int outputIndex) {
    +    // copy index and match counters as local variables to speed up processing
    +    int nextRightBatchToProcess = tracker.getNextRightBatchToProcess();
    +    int nextRightRecordToProcess = tracker.getNextRightRecordToProcess();
    +    int nextLeftRecordToProcess = tracker.getNextLeftRecordToProcess();
    +    boolean rightRecordMatched = tracker.isRightRecordMatched();
    +
    +    outer:
    +    // for every record in the left batch
    +    for (; nextLeftRecordToProcess < leftRecordCount; nextLeftRecordToProcess++) {
    +      // for every batch on the right
    +      for (; nextRightBatchToProcess < rightCounts.size(); nextRightBatchToProcess++) {
    +        int rightRecordCount = rightCounts.get(nextRightBatchToProcess);
    --- End diff --
    
    No. I have created Jira https://issues.apache.org/jira/browse/DRILL-5407 to fix this issue since it was present before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r109193083
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinBatch.java ---
    @@ -214,26 +226,62 @@ private boolean hasMore(IterOutcome outcome) {
     
       /**
        * Method generates the runtime code needed for NLJ. Other than the setup method to set the input and output value
    -   * vector references we implement two more methods
    -   * 1. emitLeft()  -> Project record from the left side
    -   * 2. emitRight() -> Project record from the right side (which is a hyper container)
    +   * vector references we implement three more methods
    +   * 1. doEval() -> Evaluates if record from left side matches record from the right side
    +   * 2. emitLeft() -> Project record from the left side
    +   * 3. emitRight() -> Project record from the right side (which is a hyper container)
        * @return the runtime generated class that implements the NestedLoopJoin interface
    -   * @throws IOException
    -   * @throws ClassTransformationException
        */
    -  private NestedLoopJoin setupWorker() throws IOException, ClassTransformationException {
    -    final CodeGenerator<NestedLoopJoin> nLJCodeGenerator = CodeGenerator.get(NestedLoopJoin.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
    +  private NestedLoopJoin setupWorker() throws IOException, ClassTransformationException, SchemaChangeException {
    +    final CodeGenerator<NestedLoopJoin> nLJCodeGenerator = CodeGenerator.get(
    +        NestedLoopJoin.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
         nLJCodeGenerator.plainJavaCapable(true);
         // Uncomment out this line to debug the generated code.
     //    nLJCodeGenerator.saveCodeForDebugging(true);
         final ClassGenerator<NestedLoopJoin> nLJClassGenerator = nLJCodeGenerator.getRoot();
     
    +    // generate doEval
    +    final ErrorCollector collector = new ErrorCollectorImpl();
    +
    +
    +    /*
    +        Logical expression may contain fields from left and right batches. During code generation (materialization)
    +        we need to indicate from which input field should be taken. Mapping sets can work with only one input at a time.
    +        But non-equality expressions can be complex:
    +          select t1.c1, t2.c1, t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +        or even contain self join which can not be transformed into filter since OR clause is present
    +          select *from t1 inner join t2 on t1.c1 >= t2.c1 or t1.c3 <> t1.c4
    +
    +        In this case logical expression can not be split according to input presence (like during equality joins
    --- End diff --
    
    To avoid confusion you could list couple of example categories:  
    1. Join on non-equijoin predicates:  t1 inner join t2 on  (t1.c1 between t2.c1 AND t2.c2) AND (...) 
    2. Join with an OR predicate: t1 inner join t2 on on t1.c1 = t2.c1 OR t1.c2 = t2.c2
    
    The other category where a join predicate includes self-join could probably be left out since there are quite a few variations there - if there are 2 tables but the join condition only specifies 1 table, then it would be a cartesian join with the second table. If the self join occurs in combination with an AND it would be treated differently compared with OR etc..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by arina-ielchiieva <gi...@git.apache.org>.
Github user arina-ielchiieva commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108036357
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinBatch.java ---
    @@ -214,26 +226,62 @@ private boolean hasMore(IterOutcome outcome) {
     
       /**
        * Method generates the runtime code needed for NLJ. Other than the setup method to set the input and output value
    -   * vector references we implement two more methods
    -   * 1. emitLeft()  -> Project record from the left side
    -   * 2. emitRight() -> Project record from the right side (which is a hyper container)
    +   * vector references we implement three more methods
    +   * 1. doEval() -> Evaluates if record from left side matches record from the right side
    +   * 2. emitLeft() -> Project record from the left side
    +   * 3. emitRight() -> Project record from the right side (which is a hyper container)
        * @return the runtime generated class that implements the NestedLoopJoin interface
    -   * @throws IOException
    -   * @throws ClassTransformationException
        */
    -  private NestedLoopJoin setupWorker() throws IOException, ClassTransformationException {
    -    final CodeGenerator<NestedLoopJoin> nLJCodeGenerator = CodeGenerator.get(NestedLoopJoin.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
    +  private NestedLoopJoin setupWorker() throws IOException, ClassTransformationException, SchemaChangeException {
    +    final CodeGenerator<NestedLoopJoin> nLJCodeGenerator = CodeGenerator.get(
    +        NestedLoopJoin.TEMPLATE_DEFINITION, context.getFunctionRegistry(), context.getOptions());
         nLJCodeGenerator.plainJavaCapable(true);
         // Uncomment out this line to debug the generated code.
     //    nLJCodeGenerator.saveCodeForDebugging(true);
         final ClassGenerator<NestedLoopJoin> nLJClassGenerator = nLJCodeGenerator.getRoot();
     
    +    // generate doEval
    +    final ErrorCollector collector = new ErrorCollectorImpl();
    +
    +
    +    /*
    +        Logical expression may contain fields from left and right batches. During code generation (materialization)
    +        we need to indicate from which input field should be taken. Mapping sets can work with only one input at a time.
    +        But non-equality expressions can be complex:
    +          select t1.c1, t2.c1, t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +        or even contain self join which can not be transformed into filter since OR clause is present
    +          select *from t1 inner join t2 on t1.c1 >= t2.c1 or t1.c3 <> t1.c4
    +
    +        In this case logical expression can not be split according to input presence (like during equality joins
    --- End diff --
    
    The thing is that inequality join is not only join that has `t1.c3 <> t1.c4` but also the one that has `OR`.
    For example, currently the following query `select * from t1 inner join t2 on t1.c1 = t2.c1 or t1.c2 = t2.c2` will fail which the following error: `UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due to either a cartesian join or an inequality join`.
    
    The main idea of my comment is that I don't bother if it's equality or inequality join, I just materialize the whole expression with fields from two inputs and to find out which input field is I add batch indication. If you want I can remove the comment, if it's confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r109193949
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java ---
    @@ -40,132 +41,133 @@
       // Record count of the left batch currently being processed
       private int leftRecordCount = 0;
     
    -  // List of record counts  per batch in the hyper container
    +  // List of record counts per batch in the hyper container
       private List<Integer> rightCounts = null;
     
       // Output batch
       private NestedLoopJoinBatch outgoing = null;
     
    -  // Next right batch to process
    -  private int nextRightBatchToProcess = 0;
    -
    -  // Next record in the current right batch to process
    -  private int nextRightRecordToProcess = 0;
    -
    -  // Next record in the left batch to process
    -  private int nextLeftRecordToProcess = 0;
    +  // Iteration status tracker
    +  private IterationStatusTracker tracker = new IterationStatusTracker();
     
       /**
        * Method initializes necessary state and invokes the doSetup() to set the
    -   * input and output value vector references
    +   * input and output value vector references.
    +   *
        * @param context Fragment context
        * @param left Current left input batch being processed
        * @param rightContainer Hyper container
    +   * @param rightCounts Counts for each right container
        * @param outgoing Output batch
        */
    -  public void setupNestedLoopJoin(FragmentContext context, RecordBatch left,
    +  public void setupNestedLoopJoin(FragmentContext context,
    +                                  RecordBatch left,
                                       ExpandableHyperContainer rightContainer,
                                       LinkedList<Integer> rightCounts,
                                       NestedLoopJoinBatch outgoing) {
         this.left = left;
    -    leftRecordCount = left.getRecordCount();
    +    this.leftRecordCount = left.getRecordCount();
         this.rightCounts = rightCounts;
         this.outgoing = outgoing;
     
         doSetup(context, rightContainer, left, outgoing);
       }
     
       /**
    -   * This method is the core of the nested loop join. For every record on the right we go over
    -   * the left batch and produce the cross product output
    +   * Main entry point for producing the output records. Thin wrapper around populateOutgoingBatch(), this method
    +   * controls which left batch we are processing and fetches the next left input batch once we exhaust the current one.
    +   *
    +   * @param joinType join type (INNER ot LEFT)
    +   * @return the number of records produced in the output batch
    +   */
    +  public int outputRecords(JoinRelType joinType) {
    +    int outputIndex = 0;
    +    while (leftRecordCount != 0) {
    +      outputIndex = populateOutgoingBatch(joinType, outputIndex);
    +      if (outputIndex >= NestedLoopJoinBatch.MAX_BATCH_SIZE) {
    +        break;
    +      }
    +      // reset state and get next left batch
    +      resetAndGetNextLeft();
    +    }
    +    return outputIndex;
    +  }
    +
    +  /**
    +   * This method is the core of the nested loop join.For each left batch record looks for matching record
    +   * from the list of right batches. Match is checked by calling {@link #doEval(int, int, int)} method.
    +   * If matching record is found both left and right records are written into output batch,
    +   * otherwise if join type is LEFT, than only left record is written, right batch record values will be null.
    +   *
    +   * @param joinType join type (INNER or LEFT)
        * @param outputIndex index to start emitting records at
        * @return final outputIndex after producing records in the output batch
        */
    -  private int populateOutgoingBatch(int outputIndex) {
    -
    -    // Total number of batches on the right side
    -    int totalRightBatches = rightCounts.size();
    -
    -    // Total number of records on the left
    -    int localLeftRecordCount = leftRecordCount;
    -
    -    /*
    -     * The below logic is the core of the NLJ. To have better performance we copy the instance members into local
    -     * method variables, once we are done with the loop we need to update the instance variables to reflect the new
    -     * state. To avoid code duplication of resetting the instance members at every exit point in the loop we are using
    -     * 'goto'
    -     */
    -    int localNextRightBatchToProcess = nextRightBatchToProcess;
    -    int localNextRightRecordToProcess = nextRightRecordToProcess;
    -    int localNextLeftRecordToProcess = nextLeftRecordToProcess;
    -
    -    outer: {
    -
    -      for (; localNextRightBatchToProcess< totalRightBatches; localNextRightBatchToProcess++) { // for every batch on the right
    -        int compositeIndexPart = localNextRightBatchToProcess << 16;
    -        int rightRecordCount = rightCounts.get(localNextRightBatchToProcess);
    -
    -        for (; localNextRightRecordToProcess < rightRecordCount; localNextRightRecordToProcess++) { // for every record in this right batch
    -          for (; localNextLeftRecordToProcess < localLeftRecordCount; localNextLeftRecordToProcess++) { // for every record in the left batch
    -
    +  private int populateOutgoingBatch(JoinRelType joinType, int outputIndex) {
    +    // copy index and match counters as local variables to speed up processing
    +    int nextRightBatchToProcess = tracker.getNextRightBatchToProcess();
    +    int nextRightRecordToProcess = tracker.getNextRightRecordToProcess();
    +    int nextLeftRecordToProcess = tracker.getNextLeftRecordToProcess();
    +    boolean rightRecordMatched = tracker.isRightRecordMatched();
    +
    +    outer:
    +    // for every record in the left batch
    +    for (; nextLeftRecordToProcess < leftRecordCount; nextLeftRecordToProcess++) {
    +      // for every batch on the right
    +      for (; nextRightBatchToProcess < rightCounts.size(); nextRightBatchToProcess++) {
    +        int rightRecordCount = rightCounts.get(nextRightBatchToProcess);
    --- End diff --
    
    Does that mean the query profile correctly show NxM rows processed in the nested loop join ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r108208895
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java ---
    @@ -40,132 +41,133 @@
       // Record count of the left batch currently being processed
       private int leftRecordCount = 0;
     
    -  // List of record counts  per batch in the hyper container
    +  // List of record counts per batch in the hyper container
       private List<Integer> rightCounts = null;
     
       // Output batch
       private NestedLoopJoinBatch outgoing = null;
     
    -  // Next right batch to process
    -  private int nextRightBatchToProcess = 0;
    -
    -  // Next record in the current right batch to process
    -  private int nextRightRecordToProcess = 0;
    -
    -  // Next record in the left batch to process
    -  private int nextLeftRecordToProcess = 0;
    +  // Iteration status tracker
    +  private IterationStatusTracker tracker = new IterationStatusTracker();
     
       /**
        * Method initializes necessary state and invokes the doSetup() to set the
    -   * input and output value vector references
    +   * input and output value vector references.
    +   *
        * @param context Fragment context
        * @param left Current left input batch being processed
        * @param rightContainer Hyper container
    +   * @param rightCounts Counts for each right container
        * @param outgoing Output batch
        */
    -  public void setupNestedLoopJoin(FragmentContext context, RecordBatch left,
    +  public void setupNestedLoopJoin(FragmentContext context,
    +                                  RecordBatch left,
                                       ExpandableHyperContainer rightContainer,
                                       LinkedList<Integer> rightCounts,
                                       NestedLoopJoinBatch outgoing) {
         this.left = left;
    -    leftRecordCount = left.getRecordCount();
    +    this.leftRecordCount = left.getRecordCount();
         this.rightCounts = rightCounts;
         this.outgoing = outgoing;
     
         doSetup(context, rightContainer, left, outgoing);
       }
     
       /**
    -   * This method is the core of the nested loop join. For every record on the right we go over
    -   * the left batch and produce the cross product output
    +   * Main entry point for producing the output records. Thin wrapper around populateOutgoingBatch(), this method
    +   * controls which left batch we are processing and fetches the next left input batch once we exhaust the current one.
    +   *
    +   * @param joinType join type (INNER ot LEFT)
    +   * @return the number of records produced in the output batch
    +   */
    +  public int outputRecords(JoinRelType joinType) {
    +    int outputIndex = 0;
    +    while (leftRecordCount != 0) {
    +      outputIndex = populateOutgoingBatch(joinType, outputIndex);
    +      if (outputIndex >= NestedLoopJoinBatch.MAX_BATCH_SIZE) {
    +        break;
    +      }
    +      // reset state and get next left batch
    +      resetAndGetNextLeft();
    +    }
    +    return outputIndex;
    +  }
    +
    +  /**
    +   * This method is the core of the nested loop join.For each left batch record looks for matching record
    +   * from the list of right batches. Match is checked by calling {@link #doEval(int, int, int)} method.
    +   * If matching record is found both left and right records are written into output batch,
    +   * otherwise if join type is LEFT, than only left record is written, right batch record values will be null.
    +   *
    +   * @param joinType join type (INNER or LEFT)
        * @param outputIndex index to start emitting records at
        * @return final outputIndex after producing records in the output batch
        */
    -  private int populateOutgoingBatch(int outputIndex) {
    -
    -    // Total number of batches on the right side
    -    int totalRightBatches = rightCounts.size();
    -
    -    // Total number of records on the left
    -    int localLeftRecordCount = leftRecordCount;
    -
    -    /*
    -     * The below logic is the core of the NLJ. To have better performance we copy the instance members into local
    -     * method variables, once we are done with the loop we need to update the instance variables to reflect the new
    -     * state. To avoid code duplication of resetting the instance members at every exit point in the loop we are using
    -     * 'goto'
    -     */
    -    int localNextRightBatchToProcess = nextRightBatchToProcess;
    -    int localNextRightRecordToProcess = nextRightRecordToProcess;
    -    int localNextLeftRecordToProcess = nextLeftRecordToProcess;
    -
    -    outer: {
    -
    -      for (; localNextRightBatchToProcess< totalRightBatches; localNextRightBatchToProcess++) { // for every batch on the right
    -        int compositeIndexPart = localNextRightBatchToProcess << 16;
    -        int rightRecordCount = rightCounts.get(localNextRightBatchToProcess);
    -
    -        for (; localNextRightRecordToProcess < rightRecordCount; localNextRightRecordToProcess++) { // for every record in this right batch
    -          for (; localNextLeftRecordToProcess < localLeftRecordCount; localNextLeftRecordToProcess++) { // for every record in the left batch
    -
    +  private int populateOutgoingBatch(JoinRelType joinType, int outputIndex) {
    +    // copy index and match counters as local variables to speed up processing
    +    int nextRightBatchToProcess = tracker.getNextRightBatchToProcess();
    +    int nextRightRecordToProcess = tracker.getNextRightRecordToProcess();
    +    int nextLeftRecordToProcess = tracker.getNextLeftRecordToProcess();
    +    boolean rightRecordMatched = tracker.isRightRecordMatched();
    +
    +    outer:
    +    // for every record in the left batch
    +    for (; nextLeftRecordToProcess < leftRecordCount; nextLeftRecordToProcess++) {
    +      // for every batch on the right
    +      for (; nextRightBatchToProcess < rightCounts.size(); nextRightBatchToProcess++) {
    +        int rightRecordCount = rightCounts.get(nextRightBatchToProcess);
    --- End diff --
    
    For keeping track of the number of records on the right that are processed due to the O(n^2) join, we should have a  non-local metric which would be a Long.  Currently, it seems the query profile for NLJoin will be incorrect. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #794: DRILL-5375: Nested loop join: return correct result...

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/794#discussion_r109186694
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillOptiq.java ---
    @@ -70,27 +70,65 @@
       private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(DrillOptiq.class);
     
       /**
    -   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax.
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using one input.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param input data input
    +   * @param expr expression to be converted
    +   * @return converted expression
        */
       public static LogicalExpression toDrill(DrillParseContext context, RelNode input, RexNode expr) {
    -    final RexToDrill visitor = new RexToDrill(context, input);
    +    return toDrill(context, Lists.newArrayList(input), expr);
    +  }
    +
    +  /**
    +   * Converts a tree of {@link RexNode} operators into a scalar expression in Drill syntax using multiple inputs.
    +   *
    +   * @param context parse context which contains planner settings
    +   * @param inputs multiple data inputs
    +   * @param expr expression to be converted
    +   * @return converted expression
    +   */
    +  public static LogicalExpression toDrill(DrillParseContext context, List<RelNode> inputs, RexNode expr) {
    +    final RexToDrill visitor = new RexToDrill(context, inputs);
         return expr.accept(visitor);
       }
     
       private static class RexToDrill extends RexVisitorImpl<LogicalExpression> {
    -    private final RelNode input;
    +    private final List<RelNode> inputs;
         private final DrillParseContext context;
    +    private final List<RelDataTypeField> fieldList;
     
    -    RexToDrill(DrillParseContext context, RelNode input) {
    +    RexToDrill(DrillParseContext context, List<RelNode> inputs) {
           super(true);
           this.context = context;
    -      this.input = input;
    +      this.inputs = inputs;
    +      this.fieldList = Lists.newArrayList();
    +      /*
    +         Fields are enumerated by their presence order in input. Details {@link org.apache.calcite.rex.RexInputRef}.
    +         Thus we can merge field list from several inputs by adding them into the list in order of appearance.
    +         Each field index in the list will match field index in the RexInputRef instance which will allow us
    +         to retrieve field from filed list by index in {@link #visitInputRef(RexInputRef)} method. Example:
    +
    +         Query: select t1.c1, t2.c1. t2.c2 from t1 inner join t2 on t1.c1 between t2.c1 and t2.c2
    +
    +         Input 1: $0
    +         Input 2: $1, $2
    +
    +         Result: $0, $1, $2
    +       */
    +      for (RelNode input : inputs) {
    --- End diff --
    
    Ok, I see.  Performance-wise it is a minor thing but It is more about working with the existing visitInputRef() which takes one input. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---