You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Ashutosh Chauhan <ha...@apache.org> on 2016/02/16 22:20:21 UTC

Review Request 43626: HIVE-12988

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43626/
-----------------------------------------------------------

Review request for hive and Prasanth_J.


Bugs: HIVE-12988
    https://issues.apache.org/jira/browse/HIVE-12988


Repository: hive-git


Description
-------

Improve dynamic partition loading IV Parallelize copyFiles()


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a92c002 

Diff: https://reviews.apache.org/r/43626/diff/


Testing
-------

Regression suite


Thanks,

Ashutosh Chauhan


Re: Review Request 43626: HIVE-12988

Posted by Ashutosh Chauhan <ha...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43626/
-----------------------------------------------------------

(Updated March 9, 2016, 9:35 p.m.)


Review request for hive and Prasanth_J.


Changes
-------

Addressed Prashanth's feedback


Bugs: HIVE-12988
    https://issues.apache.org/jira/browse/HIVE-12988


Repository: hive-git


Description
-------

Improve dynamic partition loading IV Parallelize copyFiles()


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 5098851 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java ad17096 
  shims/common/src/main/java/org/apache/hadoop/fs/ProxyFileSystem.java cb1e2b7 

Diff: https://reviews.apache.org/r/43626/diff/


Testing
-------

Regression suite


Thanks,

Ashutosh Chauhan


Re: Review Request 43626: HIVE-12988

Posted by j....@gmail.com.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/43626/#review120265
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 1499)
<https://reviews.apache.org/r/43626/#comment181697>

    nit: I don't see newFiles modified outside. Can this be made as an return argument from Hive.copyFiles()? Instead of InOut argument.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 1746)
<https://reviews.apache.org/r/43626/#comment181698>

    nit: same here. newFiles not modified anywhere other than copyFiles.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2592)
<https://reviews.apache.org/r/43626/#comment181695>

    I think the thread pool size should be bounded. Cached thread pool creates threads on-demand. On sudden burst of events this could end up creating too many threads that increases context switch times.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2593)
<https://reviews.apache.org/r/43626/#comment181699>

    Can you also name the thread pool with setNameFormat()?



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2643)
<https://reviews.apache.org/r/43626/#comment181701>

    I think this should happen inside the callable and not in notification listener. The cost of creating path is cheaper when compared to moving or copying. 
    
    The default  method of addCallback() handles notification in the same thread that sent the notification. As per documentation, "Note: If the callback is slow or heavyweight, consider supplying an executor. If you do not supply an executor, addCallback will use a direct executor, which carries some caveats for heavier operations. For example, the callback may run on an unpredictable or undesirable thread".



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2645)
<https://reviews.apache.org/r/43626/#comment181706>

    I can see 2 expensive operations here
    1) listStatus
    2) mvFile
    
    I think both should be handled by the threadpool.
    Typical sequence of operations would be 
     - For each directory, submit to threadpool to compute listStatus. This returns List<Future>
     - Iterate the futures, invoke future.get() to get List<Path> for each directories. If get() throws exception then shutdown the executor and log exception.
     - Iterator the List<Path>, submit to threadpool to perform mvFile. This returns List<Future>
     - Iterate the futures, invoke futures.get() to get void/boolean. Same way of exception handling here as well. 
     
     I think addCallback is good approach but it adds complications when multi-level futures are involved like above and waiting can be tricking like the 1 hour waiting below.


- Prasanth_J


On Feb. 16, 2016, 9:20 p.m., Ashutosh Chauhan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/43626/
> -----------------------------------------------------------
> 
> (Updated Feb. 16, 2016, 9:20 p.m.)
> 
> 
> Review request for hive and Prasanth_J.
> 
> 
> Bugs: HIVE-12988
>     https://issues.apache.org/jira/browse/HIVE-12988
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Improve dynamic partition loading IV Parallelize copyFiles()
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java a92c002 
> 
> Diff: https://reviews.apache.org/r/43626/diff/
> 
> 
> Testing
> -------
> 
> Regression suite
> 
> 
> Thanks,
> 
> Ashutosh Chauhan
> 
>