You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Sorabh Hamirwasia (JIRA)" <ji...@apache.org> on 2018/01/31 21:31:00 UTC
[jira] [Created] (DRILL-6128) Wrong Result with Nested Loop Join

Sorabh Hamirwasia created DRILL-6128:
----------------------------------------

             Summary: Wrong Result with Nested Loop Join
                 Key: DRILL-6128
                 URL: https://issues.apache.org/jira/browse/DRILL-6128
             Project: Apache Drill
          Issue Type: Bug
          Components: Execution - Relational Operators
            Reporter: Sorabh Hamirwasia
            Assignee: Sorabh Hamirwasia


Nested Loop Join produces wrong result's if there are multiple batches on the right side. It builds an ExapandableHyperContainer to hold all the right side of batches. Then for each record on left side input evaluates the condition with all records on right side and emit the output if condition is satisfied. The main loop inside [populateOutgoingBatch|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/NestedLoopJoinTemplate.java#L106] call's *doEval* with correct indexes to evaluate records on both the sides. In generated code of *doEval* for some reason there is a right shift of 16 done on the rightBatchIndex (sample shared below).
{code:java}
public boolean doEval(int leftIndex, int rightBatchIndex, int rightRecordIndexWithinBatch)
 throws SchemaChangeException
{
  {
   IntHolder out3 = new IntHolder();
   {
     out3 .value = vv0 .getAccessor().get((leftIndex));
   }
   IntHolder out7 = new IntHolder();
   {
     out7 .value =  
 vv4[((rightBatchIndex)>>>16)].getAccessor().get(((rightRecordIndexWithinBatch)& 65535));
   }

......
......
}{code}
 

When the actual loop is processing second batch, inside eval method the index with right shift becomes 0 and it ends up evaluating condition w.r.t first right batch again. So if there is more than one batch (upto 65535) on right side doEval will always consider first batch for condition evaluation. But the output data will be based on correct batch so there will be issues like OutOfBound and WrongData. Cases can be:

Let's say: *rightBatchIndex*: index of right batch to consider, *rightRecordIndexWithinBatch*: index of record in right batch at rightBatchIndex

1) First right batch comes with zero data and with OK_NEW_SCHEMA (let's say because of filter in the operator tree). Next Right batch has > 0 data. So when we call doEval for second batch(*rightBatchIndex = 1*) and first record in it (i.e. *rightRecordIndexWithinBatch = 0*), actual evaluation will happen using first batch (since *rightBatchIndex >>> 16 = 0*). On accessing record at *rightRecordIndexWithinBatch* in first batch it will throw *IndexOutofBoundException* since the first batch has no records.

2) Let's say there are 2 batches on right side. Also let's say first batch contains 3 records (with id_right=1/2/3) and 2nd batch also contain 3 records (with id_right=10/20/30). Also let's say there is 1 batch on left side with 3 records (with id_left=1/2/3). Then in this case the NestedLoopJoin (with equality condition) will end up producing 6 records instead of 3. It produces first 3 records based on match between left records and match in first right batch records. But while 2nd right batch it will evaluate id_left=id_right based on first batch instead and will again find matches and will produce another 3 records. *Example:*

*Left Batch Data:*

 
{code:java}
Batch1:

{
 "id_left": 1,
 "cost_left": 11,
 "name_left": "item11"
}
{
 "id_left": 2,
 "cost_left": 21,
 "name_left": "item21"
}
{
 "id_left": 3,
 "cost_left": 31,
 "name_left": "item31"
}{code}
 

*Right Batch Data:*

 
{code:java}
Batch 1:
{
 "id_right": 1,
 "cost_right": 10,
 "name_right": "item1"
}
{
 "id_right": 2,
 "cost_right": 20,
 "name_right": "item2"
}
{
 "id_right": 3,
 "cost_right": 30,
 "name_right": "item3"
}
{code}
 

 
{code:java}
Batch 2:
{
 "id_right": 4,
 "cost_right": 40,
 "name_right": "item4"
}
{
 "id_right": 4,
 "cost_right": 40,
 "name_right": "item4"
}
{
 "id_right": 4,
 "cost_right": 40,
 "name_right": "item4"
}{code}
 

*Produced output:*
{code:java}
{
 "id_left": 1,
 "cost_left": 11,
 "name_left": "item11",
 "id_right": 1,
 "cost_right": 10,
 "name_right": "item1"
}
{
 "id_left": 1,
 "cost_left": 11,
 "name_left": "item11",
 "id_right": 4,
 "cost_right": 40,
 "name_right": "item4"
}
{
 "id_left": 2,
 "cost_left": 21,
 "name_left": "item21"
 "id_right": 2, 
 "cost_right": 20,
 "name_right": "item2"
}
{
 "id_left": 2,
 "cost_left": 21,
 "name_left": "item21"
 "id_right": 4, 
 "cost_right": 40,
 "name_right": "item4"
}
{
 "id_left": 3,
 "cost_left": 31,
 "name_left": "item31"
 "id_right": 3, 
 "cost_right": 30,
 "name_right": "item3"
}
{
 "id_left": 3,
 "cost_left": 31,
 "name_left": "item31"
 "id_right": 4, 
 "cost_right": 40,
 "name_right": "item4"
}{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)