You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Deepak Jaiswal <dj...@hortonworks.com> on 2017/07/24 18:47:28 UTC

Review Request 61087: HIVE-16965 SMB join may produce incorrect results

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
-----------------------------------------------------------

Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Bugs: HIVE-16965
    https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
-------

Usually, in a JOIN with multiple inputs (partitions), the inputs are read sequentially, however, incase of SMB join, the inputs are read based on key ordering. This invalidates the current IOContext assumption that the input path once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-----

  itests/src/test/resources/testconfiguration.properties f66e19be3e 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 698fa7f69e 
  ql/src/test/queries/clientpositive/smb_join1.q PRE-CREATION 
  ql/src/test/results/clientpositive/llap/smb_join1.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/61087/diff/1/


Testing (updated)
-------

Added a new test.


Thanks,

Deepak Jaiswal


Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

Posted by Gopal V <go...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/#review181246
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java
Lines 66 (patched)
<https://reviews.apache.org/r/61087/#comment256769>

    IdentityHashMap - don't trust the hashCode() for KeyValueReader to be safe.



ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java
Line 109 (original), 130 (patched)
<https://reviews.apache.org/r/61087/#comment256770>

    Clear the prev and IOContext refs - interrupts do leave leaky state behind sometimes


- Gopal V


On July 24, 2017, 6:47 p.m., Deepak Jaiswal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61087/
> -----------------------------------------------------------
> 
> (Updated July 24, 2017, 6:47 p.m.)
> 
> 
> Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.
> 
> 
> Bugs: HIVE-16965
>     https://issues.apache.org/jira/browse/HIVE-16965
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Usually, in a JOIN with multiple inputs (partitions), the inputs are read sequentially, however, incase of SMB join, the inputs are read based on key ordering. This invalidates the current IOContext assumption that the input path once set wont change unless the input changes.
> This was resulting in incorrect partition information in results as it is derived from the input path in IOContext.
> The new logic changes the input path as and when input changes.
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties f66e19be3e 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java add7d08c40 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 698fa7f69e 
>   ql/src/test/queries/clientpositive/smb_join1.q PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/smb_join1.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/61087/diff/1/
> 
> 
> Testing
> -------
> 
> Added a new test.
> 
> 
> Thanks,
> 
> Deepak Jaiswal
> 
>


Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

Posted by Gopal V <go...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/#review181369
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java
Lines 86 (patched)
<https://reviews.apache.org/r/61087/#comment256909>

    That is not always true.
    
    splits.get(1) could have a different path.
    
    You might want to add a loop + assert there.


- Gopal V


On July 25, 2017, 8:01 p.m., Deepak Jaiswal wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61087/
> -----------------------------------------------------------
> 
> (Updated July 25, 2017, 8:01 p.m.)
> 
> 
> Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.
> 
> 
> Bugs: HIVE-16965
>     https://issues.apache.org/jira/browse/HIVE-16965
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Usually, in a JOIN with multiple inputs (partitions), the inputs are read sequentially, however, incase of SMB join, the inputs are read based on key ordering. This invalidates the current IOContext assumption that the input path once set wont change unless the input changes.
> This was resulting in incorrect partition information in results as it is derived from the input path in IOContext.
> The new logic changes the input path as and when input changes.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java add7d08c40 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 698fa7f69e 
>   ql/src/test/results/clientpositive/llap/llap_smb.q.out 87b33db805 
> 
> 
> Diff: https://reviews.apache.org/r/61087/diff/3/
> 
> 
> Testing
> -------
> 
> Added a new test.
> 
> 
> Thanks,
> 
> Deepak Jaiswal
> 
>


Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

Posted by Deepak Jaiswal <dj...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
-----------------------------------------------------------

(Updated July 27, 2017, 9:25 p.m.)


Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Changes
-------

Fixed the assert introduced in last rev. to compare the path values instead of comparing the path objects.


Bugs: HIVE-16965
    https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
-------

Usually, in a JOIN with multiple inputs (partitions), the inputs are read sequentially, however, incase of SMB join, the inputs are read based on key ordering. This invalidates the current IOContext assumption that the input path once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 698fa7f69e 
  ql/src/test/results/clientpositive/llap/llap_smb.q.out 87b33db805 


Diff: https://reviews.apache.org/r/61087/diff/5/

Changes: https://reviews.apache.org/r/61087/diff/4-5/


Testing
-------

Added a new test.


Thanks,

Deepak Jaiswal


Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

Posted by Deepak Jaiswal <dj...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
-----------------------------------------------------------

(Updated July 27, 2017, 8:37 p.m.)


Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Changes
-------

Added a better assert to verify uniqueness of the paths for a given input.


Bugs: HIVE-16965
    https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
-------

Usually, in a JOIN with multiple inputs (partitions), the inputs are read sequentially, however, incase of SMB join, the inputs are read based on key ordering. This invalidates the current IOContext assumption that the input path once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 698fa7f69e 
  ql/src/test/results/clientpositive/llap/llap_smb.q.out 87b33db805 


Diff: https://reviews.apache.org/r/61087/diff/4/

Changes: https://reviews.apache.org/r/61087/diff/3-4/


Testing
-------

Added a new test.


Thanks,

Deepak Jaiswal


Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

Posted by Deepak Jaiswal <dj...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
-----------------------------------------------------------

(Updated July 25, 2017, 8:01 p.m.)


Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Changes
-------

Use llap_smb.q as main test instead of smb_join1.q
Remove assert based on failing tests. Irrespective of number of splits the path is same.


Bugs: HIVE-16965
    https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
-------

Usually, in a JOIN with multiple inputs (partitions), the inputs are read sequentially, however, incase of SMB join, the inputs are read based on key ordering. This invalidates the current IOContext assumption that the input path once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 698fa7f69e 
  ql/src/test/results/clientpositive/llap/llap_smb.q.out 87b33db805 


Diff: https://reviews.apache.org/r/61087/diff/3/

Changes: https://reviews.apache.org/r/61087/diff/2-3/


Testing
-------

Added a new test.


Thanks,

Deepak Jaiswal


Re: Review Request 61087: HIVE-16965 SMB join may produce incorrect results

Posted by Deepak Jaiswal <dj...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61087/
-----------------------------------------------------------

(Updated July 24, 2017, 9:35 p.m.)


Review request for hive, Gopal V, Jason Dere, and Sergey Shelukhin.


Changes
-------

Implemented comments from Sergey and Gopal.


Bugs: HIVE-16965
    https://issues.apache.org/jira/browse/HIVE-16965


Repository: hive-git


Description
-------

Usually, in a JOIN with multiple inputs (partitions), the inputs are read sequentially, however, incase of SMB join, the inputs are read based on key ordering. This invalidates the current IOContext assumption that the input path once set wont change unless the input changes.
This was resulting in incorrect partition information in results as it is derived from the input path in IOContext.
The new logic changes the input path as and when input changes.


Diffs (updated)
-----

  itests/src/test/resources/testconfiguration.properties f66e19be3e 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/MapRecordSource.java add7d08c40 
  ql/src/java/org/apache/hadoop/hive/ql/exec/tez/tools/KeyValueInputMerger.java 698fa7f69e 
  ql/src/test/queries/clientpositive/smb_join1.q PRE-CREATION 
  ql/src/test/results/clientpositive/llap/smb_join1.q.out PRE-CREATION 


Diff: https://reviews.apache.org/r/61087/diff/2/

Changes: https://reviews.apache.org/r/61087/diff/1-2/


Testing
-------

Added a new test.


Thanks,

Deepak Jaiswal