You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Jimmy Xiang <jx...@cloudera.com> on 2014/10/13 23:33:09 UTC

Review Request 26661: HIVE-7873 Re-enable lazy HiveBaseFunctionResultList

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26661/
-----------------------------------------------------------

Review request for hive and Xuefu Zhang.


Bugs: HIVE-7873
    https://issues.apache.org/jira/browse/HIVE-7873


Repository: hive-git


Description
-------

Re-enabled lazy HiveBaseFunctionResultList. A separate RowContainer is used to work around the no-write-after-read limitation of RowContainer. The patch also fixed a concurrency issue in HiveKVResultCache. Synchronized is used instead of reentrant lock since I assume there won't be many threads to access the cache.


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java 0df2580 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java a6b9037 
  ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 496a11f 

Diff: https://reviews.apache.org/r/26661/diff/


Testing
-------

Unit test, some simple perf test.


Thanks,

Jimmy Xiang


Re: Review Request 26661: HIVE-7873 Re-enable lazy HiveBaseFunctionResultList

Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26661/#review56698
-----------------------------------------------------------

Ship it!


Ship It!

- Xuefu Zhang


On Oct. 15, 2014, 12:55 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26661/
> -----------------------------------------------------------
> 
> (Updated Oct. 15, 2014, 12:55 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7873
>     https://issues.apache.org/jira/browse/HIVE-7873
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Re-enabled lazy HiveBaseFunctionResultList. A separate RowContainer is used to work around the no-write-after-read limitation of RowContainer. The patch also fixed a concurrency issue in HiveKVResultCache. Synchronized is used instead of reentrant lock since I assume there won't be many threads to access the cache.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java 0df2580 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java a6b9037 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 496a11f 
> 
> Diff: https://reviews.apache.org/r/26661/diff/
> 
> 
> Testing
> -------
> 
> Unit test, some simple perf test.
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>


Re: Review Request 26661: HIVE-7873 Re-enable lazy HiveBaseFunctionResultList

Posted by Jimmy Xiang <jx...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26661/
-----------------------------------------------------------

(Updated Oct. 15, 2014, 12:55 p.m.)


Review request for hive and Xuefu Zhang.


Changes
-------

The new patch made HiveKVResultCache fully thread safe as Xuefu suggested.


Bugs: HIVE-7873
    https://issues.apache.org/jira/browse/HIVE-7873


Repository: hive-git


Description
-------

Re-enabled lazy HiveBaseFunctionResultList. A separate RowContainer is used to work around the no-write-after-read limitation of RowContainer. The patch also fixed a concurrency issue in HiveKVResultCache. Synchronized is used instead of reentrant lock since I assume there won't be many threads to access the cache.


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java 0df2580 
  ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java a6b9037 
  ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 496a11f 

Diff: https://reviews.apache.org/r/26661/diff/


Testing
-------

Unit test, some simple perf test.


Thanks,

Jimmy Xiang


Re: Review Request 26661: HIVE-7873 Re-enable lazy HiveBaseFunctionResultList

Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26661/#review56487
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java
<https://reviews.apache.org/r/26661/#comment96841>

    I think it makes sense to make the class fully thread safe, for the sake of clarity and confidence. If there is no concurrent read, synchronization cost can be ignored.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java
<https://reviews.apache.org/r/26661/#comment96842>

    Yeah. Let's make this thread safe for read also. Otherwise, it's error prone and hard to understand.
    
    We also need to make clear() thread safe, probably.


- Xuefu Zhang


On Oct. 13, 2014, 9:33 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26661/
> -----------------------------------------------------------
> 
> (Updated Oct. 13, 2014, 9:33 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7873
>     https://issues.apache.org/jira/browse/HIVE-7873
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Re-enabled lazy HiveBaseFunctionResultList. A separate RowContainer is used to work around the no-write-after-read limitation of RowContainer. The patch also fixed a concurrency issue in HiveKVResultCache. Synchronized is used instead of reentrant lock since I assume there won't be many threads to access the cache.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java 0df2580 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java a6b9037 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 496a11f 
> 
> Diff: https://reviews.apache.org/r/26661/diff/
> 
> 
> Testing
> -------
> 
> Unit test, some simple perf test.
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>


Re: Review Request 26661: HIVE-7873 Re-enable lazy HiveBaseFunctionResultList

Posted by Jimmy Xiang <jx...@cloudera.com>.

> On Oct. 13, 2014, 10:26 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java, line 49
> > <https://reviews.apache.org/r/26661/diff/1/?file=719717#file719717line49>
> >
> >     Nit: could we put -1L or skip it and suppress the warning instead of a random number for this?

Sure. Will change it to -1L


> On Oct. 13, 2014, 10:26 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java, line 47
> > <https://reviews.apache.org/r/26661/diff/1/?file=719718#file719718line47>
> >
> >     Do you mean one thread can access hasNext() and next(), while multiple threads can concurrently access add()? The doc here is a little unclear on this.

Yes, that's what I meant. Let me enhance it a little.

Should we make the class fully thread safe instead?


> On Oct. 13, 2014, 10:26 p.m., Xuefu Zhang wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java, line 143
> > <https://reviews.apache.org/r/26661/diff/1/?file=719719#file719719line143>
> >
> >     Same here.

Changed to -1L


> On Oct. 13, 2014, 10:26 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java, line 158
> > <https://reviews.apache.org/r/26661/diff/1/?file=719718#file719718line158>
> >
> >     Same here, cursor and container are protected resources and they need to be in the synchnized block.

Same assumption as above. The add() method never changes "cursor". So if "cursor" is not 0, then it must be the iteration thread itself, so we don't need the synchronization.


> On Oct. 13, 2014, 10:26 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java, line 129
> > <https://reviews.apache.org/r/26661/diff/1/?file=719718#file719718line129>
> >
> >     I think the condition check should be also in the synchronized block, as it accesses protected resources: container.rowCount() and cursor.

I assume just one thread does the iteration, while there could be multiple threads doing add() concurrently. So if cursor is not 0, other threads won't touch "container". They will use the backup one.
If cursor is 0, other threads are changing rowCount, and this condition check doesn't return true, the next block should get the right value.  I need to make a little change here.

Should we make the whole class thread safe?


- Jimmy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26661/#review56449
-----------------------------------------------------------


On Oct. 13, 2014, 9:33 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26661/
> -----------------------------------------------------------
> 
> (Updated Oct. 13, 2014, 9:33 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7873
>     https://issues.apache.org/jira/browse/HIVE-7873
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Re-enabled lazy HiveBaseFunctionResultList. A separate RowContainer is used to work around the no-write-after-read limitation of RowContainer. The patch also fixed a concurrency issue in HiveKVResultCache. Synchronized is used instead of reentrant lock since I assume there won't be many threads to access the cache.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java 0df2580 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java a6b9037 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 496a11f 
> 
> Diff: https://reviews.apache.org/r/26661/diff/
> 
> 
> Testing
> -------
> 
> Unit test, some simple perf test.
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>


Re: Review Request 26661: HIVE-7873 Re-enable lazy HiveBaseFunctionResultList

Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26661/#review56449
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java
<https://reviews.apache.org/r/26661/#comment96779>

    Nit: could we put -1L or skip it and suppress the warning instead of a random number for this?



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java
<https://reviews.apache.org/r/26661/#comment96780>

    Do you mean one thread can access hasNext() and next(), while multiple threads can concurrently access add()? The doc here is a little unclear on this.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java
<https://reviews.apache.org/r/26661/#comment96782>

    I think the condition check should be also in the synchronized block, as it accesses protected resources: container.rowCount() and cursor.



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java
<https://reviews.apache.org/r/26661/#comment96783>

    Same here, cursor and container are protected resources and they need to be in the synchnized block.



ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java
<https://reviews.apache.org/r/26661/#comment96784>

    Same here.


- Xuefu Zhang


On Oct. 13, 2014, 9:33 p.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26661/
> -----------------------------------------------------------
> 
> (Updated Oct. 13, 2014, 9:33 p.m.)
> 
> 
> Review request for hive and Xuefu Zhang.
> 
> 
> Bugs: HIVE-7873
>     https://issues.apache.org/jira/browse/HIVE-7873
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Re-enabled lazy HiveBaseFunctionResultList. A separate RowContainer is used to work around the no-write-after-read limitation of RowContainer. The patch also fixed a concurrency issue in HiveKVResultCache. Synchronized is used instead of reentrant lock since I assume there won't be many threads to access the cache.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java 0df2580 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java a6b9037 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 496a11f 
> 
> Diff: https://reviews.apache.org/r/26661/diff/
> 
> 
> Testing
> -------
> 
> Unit test, some simple perf test.
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>