You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sahil Takiar <ta...@gmail.com> on 2017/01/18 03:17:57 UTC

Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/
-----------------------------------------------------------

Review request for hive and Sergio Pena.


Bugs: HIVE-15546
    https://issues.apache.org/jira/browse/HIVE-15546


Repository: hive-git


Description
-------

HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel


Diffs
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20 
  ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa 

Diff: https://reviews.apache.org/r/55661/diff/


Testing
-------

Unit tests added


Thanks,

Sahil Takiar


Re: Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Posted by Sahil Takiar <ta...@gmail.com>.

> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > Can this be considered a S3 performance only? or is it fine to have many RPC calls in case the path is on HDFS?

We could make this S3 / blobstore specific, but this code block will also create empty files for each empty partition it finds, which can take some amount of time on HDFS.

It's the same number of RPC calls to HDFS, they are just all done in parallel. The value of mapred.dfsclient.parallelism.max probably needs to be tuned depending on if your running on HDFS or S3.


> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 2992
> > <https://reviews.apache.org/r/55661/diff/1/?file=1607288#file1607288line2992>
> >
> >     1. what happens if numThread is a negative number?
> >     2. do we need to execute a threadpool if numThreads is just 1?

Will fix.


> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 3016
> > <https://reviews.apache.org/r/55661/diff/1/?file=1607288#file1607288line3016>
> >
> >     Can this happen? I followed the workflow when file is null, and found that createDummyFileForEmptyPartition() may fail when attempts to call path.toString(). Is that right? What can we do to avoid a NPE there?

Good point. I think we can log a warning, and then continue on to the next file. I don't see why a null file would be added to the `pathToAliases` but it is possible. Will fix.


- Sahil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162184
-----------------------------------------------------------


On Jan. 18, 2017, 3:17 a.m., Sahil Takiar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
> 
> (Updated Jan. 18, 2017, 3:17 a.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-15546
>     https://issues.apache.org/jira/browse/HIVE-15546
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa 
> 
> Diff: https://reviews.apache.org/r/55661/diff/
> 
> 
> Testing
> -------
> 
> Unit tests added
> 
> 
> Thanks,
> 
> Sahil Takiar
> 
>


Re: Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162184
-----------------------------------------------------------



Can this be considered a S3 performance only? or is it fine to have many RPC calls in case the path is on HDFS?


ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java (line 2992)
<https://reviews.apache.org/r/55661/#comment233436>

    1. what happens if numThread is a negative number?
    2. do we need to execute a threadpool if numThreads is just 1?



ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java (line 3016)
<https://reviews.apache.org/r/55661/#comment233439>

    Can this happen? I followed the workflow when file is null, and found that createDummyFileForEmptyPartition() may fail when attempts to call path.toString(). Is that right? What can we do to avoid a NPE there?


- Sergio Pena


On Jan. 18, 2017, 3:17 a.m., Sahil Takiar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
> 
> (Updated Jan. 18, 2017, 3:17 a.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-15546
>     https://issues.apache.org/jira/browse/HIVE-15546
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa 
> 
> Diff: https://reviews.apache.org/r/55661/diff/
> 
> 
> Testing
> -------
> 
> Unit tests added
> 
> 
> Thanks,
> 
> Sahil Takiar
> 
>


Re: Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Posted by Sergio Pena <se...@cloudera.com>.

> On Jan. 18, 2017, 11:24 p.m., Sergio Pena wrote:
> > Ship It!

Thanks. The code looks good, and the RPC calls are fine for me.
Please upload another file to the JIRA to re-run the tests.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162206
-----------------------------------------------------------


On Jan. 18, 2017, 11:11 p.m., Sahil Takiar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
> 
> (Updated Jan. 18, 2017, 11:11 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-15546
>     https://issues.apache.org/jira/browse/HIVE-15546
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa 
> 
> Diff: https://reviews.apache.org/r/55661/diff/
> 
> 
> Testing
> -------
> 
> Unit tests added
> 
> 
> Thanks,
> 
> Sahil Takiar
> 
>


Re: Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162206
-----------------------------------------------------------


Ship it!




Ship It!

- Sergio Pena


On Jan. 18, 2017, 11:11 p.m., Sahil Takiar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
> 
> (Updated Jan. 18, 2017, 11:11 p.m.)
> 
> 
> Review request for hive and Sergio Pena.
> 
> 
> Bugs: HIVE-15546
>     https://issues.apache.org/jira/browse/HIVE-15546
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa 
> 
> Diff: https://reviews.apache.org/r/55661/diff/
> 
> 
> Testing
> -------
> 
> Unit tests added
> 
> 
> Thanks,
> 
> Sahil Takiar
> 
>


Re: Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel

Posted by Sahil Takiar <ta...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/
-----------------------------------------------------------

(Updated Jan. 18, 2017, 11:11 p.m.)


Review request for hive and Sergio Pena.


Bugs: HIVE-15546
    https://issues.apache.org/jira/browse/HIVE-15546


Repository: hive-git


Description
-------

HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel


Diffs (updated)
-----

  ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20 
  ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa 

Diff: https://reviews.apache.org/r/55661/diff/


Testing
-------

Unit tests added


Thanks,

Sahil Takiar