You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sahil Takiar <ta...@gmail.com> on 2017/01/18 03:17:57 UTC
Review Request 55661: HIVE-15546: Optimize Utilities.getInputPaths()
so each listStatus of a partition is done in parallel
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/
-----------------------------------------------------------
Review request for hive and Sergio Pena.
Bugs: HIVE-15546
https://issues.apache.org/jira/browse/HIVE-15546
Repository: hive-git
Description
-------
HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
Diffs
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20
ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa
Diff: https://reviews.apache.org/r/55661/diff/
Testing
-------
Unit tests added
Thanks,
Sahil Takiar
Re: Review Request 55661: HIVE-15546: Optimize
Utilities.getInputPaths() so
each listStatus of a partition is done in parallel
Posted by Sahil Takiar <ta...@gmail.com>.
> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > Can this be considered a S3 performance only? or is it fine to have many RPC calls in case the path is on HDFS?
We could make this S3 / blobstore specific, but this code block will also create empty files for each empty partition it finds, which can take some amount of time on HDFS.
It's the same number of RPC calls to HDFS, they are just all done in parallel. The value of mapred.dfsclient.parallelism.max probably needs to be tuned depending on if your running on HDFS or S3.
> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 2992
> > <https://reviews.apache.org/r/55661/diff/1/?file=1607288#file1607288line2992>
> >
> > 1. what happens if numThread is a negative number?
> > 2. do we need to execute a threadpool if numThreads is just 1?
Will fix.
> On Jan. 18, 2017, 9:59 p.m., Sergio Pena wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 3016
> > <https://reviews.apache.org/r/55661/diff/1/?file=1607288#file1607288line3016>
> >
> > Can this happen? I followed the workflow when file is null, and found that createDummyFileForEmptyPartition() may fail when attempts to call path.toString(). Is that right? What can we do to avoid a NPE there?
Good point. I think we can log a warning, and then continue on to the next file. I don't see why a null file would be added to the `pathToAliases` but it is possible. Will fix.
- Sahil
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162184
-----------------------------------------------------------
On Jan. 18, 2017, 3:17 a.m., Sahil Takiar wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
>
> (Updated Jan. 18, 2017, 3:17 a.m.)
>
>
> Review request for hive and Sergio Pena.
>
>
> Bugs: HIVE-15546
> https://issues.apache.org/jira/browse/HIVE-15546
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20
> ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa
>
> Diff: https://reviews.apache.org/r/55661/diff/
>
>
> Testing
> -------
>
> Unit tests added
>
>
> Thanks,
>
> Sahil Takiar
>
>
Re: Review Request 55661: HIVE-15546: Optimize
Utilities.getInputPaths() so
each listStatus of a partition is done in parallel
Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162184
-----------------------------------------------------------
Can this be considered a S3 performance only? or is it fine to have many RPC calls in case the path is on HDFS?
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java (line 2992)
<https://reviews.apache.org/r/55661/#comment233436>
1. what happens if numThread is a negative number?
2. do we need to execute a threadpool if numThreads is just 1?
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java (line 3016)
<https://reviews.apache.org/r/55661/#comment233439>
Can this happen? I followed the workflow when file is null, and found that createDummyFileForEmptyPartition() may fail when attempts to call path.toString(). Is that right? What can we do to avoid a NPE there?
- Sergio Pena
On Jan. 18, 2017, 3:17 a.m., Sahil Takiar wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
>
> (Updated Jan. 18, 2017, 3:17 a.m.)
>
>
> Review request for hive and Sergio Pena.
>
>
> Bugs: HIVE-15546
> https://issues.apache.org/jira/browse/HIVE-15546
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20
> ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa
>
> Diff: https://reviews.apache.org/r/55661/diff/
>
>
> Testing
> -------
>
> Unit tests added
>
>
> Thanks,
>
> Sahil Takiar
>
>
Re: Review Request 55661: HIVE-15546: Optimize
Utilities.getInputPaths() so
each listStatus of a partition is done in parallel
Posted by Sergio Pena <se...@cloudera.com>.
> On Jan. 18, 2017, 11:24 p.m., Sergio Pena wrote:
> > Ship It!
Thanks. The code looks good, and the RPC calls are fine for me.
Please upload another file to the JIRA to re-run the tests.
- Sergio
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162206
-----------------------------------------------------------
On Jan. 18, 2017, 11:11 p.m., Sahil Takiar wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
>
> (Updated Jan. 18, 2017, 11:11 p.m.)
>
>
> Review request for hive and Sergio Pena.
>
>
> Bugs: HIVE-15546
> https://issues.apache.org/jira/browse/HIVE-15546
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20
> ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa
>
> Diff: https://reviews.apache.org/r/55661/diff/
>
>
> Testing
> -------
>
> Unit tests added
>
>
> Thanks,
>
> Sahil Takiar
>
>
Re: Review Request 55661: HIVE-15546: Optimize
Utilities.getInputPaths() so
each listStatus of a partition is done in parallel
Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/#review162206
-----------------------------------------------------------
Ship it!
Ship It!
- Sergio Pena
On Jan. 18, 2017, 11:11 p.m., Sahil Takiar wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55661/
> -----------------------------------------------------------
>
> (Updated Jan. 18, 2017, 11:11 p.m.)
>
>
> Review request for hive and Sergio Pena.
>
>
> Bugs: HIVE-15546
> https://issues.apache.org/jira/browse/HIVE-15546
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20
> ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa
>
> Diff: https://reviews.apache.org/r/55661/diff/
>
>
> Testing
> -------
>
> Unit tests added
>
>
> Thanks,
>
> Sahil Takiar
>
>
Re: Review Request 55661: HIVE-15546: Optimize
Utilities.getInputPaths() so
each listStatus of a partition is done in parallel
Posted by Sahil Takiar <ta...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55661/
-----------------------------------------------------------
(Updated Jan. 18, 2017, 11:11 p.m.)
Review request for hive and Sergio Pena.
Bugs: HIVE-15546
https://issues.apache.org/jira/browse/HIVE-15546
Repository: hive-git
Description
-------
HIVE-15546: Optimize Utilities.getInputPaths() so each listStatus of a partition is done in parallel
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 0161c20
ql/src/test/org/apache/hadoop/hive/ql/exec/TestUtilities.java bd067aa
Diff: https://reviews.apache.org/r/55661/diff/
Testing
-------
Unit tests added
Thanks,
Sahil Takiar