You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Sergio Pena <se...@cloudera.com> on 2016/11/21 22:29:44 UTC
Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing
the old rows with the new ones
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/
-----------------------------------------------------------
Review request for hive.
Bugs: HIVE-15199
https://issues.apache.org/jira/browse/HIVE-15199
Repository: hive-git
Description
-------
The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
Diffs
-----
common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
Diff: https://reviews.apache.org/r/53966/diff/
Testing
-------
Thanks,
Sergio Pena
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Sergio Pena <se...@cloudera.com>.
> On Nov. 21, 2016, 10:48 p.m., Aihua Xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2789
> > <https://reviews.apache.org/r/53966/diff/1/?file=1568247#file1568247line2789>
> >
> > Do you think that will be a performance impact for HDFS if there are many files in the folder? Looks like recursive is not needed, at least for HDFS since it's to mv to 'destPath = new Path(destf, name + ("_copy_" + counter) + filetype);'?
> >
> > The other changes are more like refactoring, correct?
The recursive is needed because the copyFiles() may also copies directories.
for (FileStatus src : srcs) {
...
if (src.isDirectory()) {
files = srcFs.listStatus(src.getPath(), FileUtils.HIDDEN_FILES_PATH_FILTER);
}
...
for (final FileStatus srcFile : files) {
...
}
}
You may see a bad performance for the first INSERT INTO statements, but you'll see it much better if repeated INSERT INTO are executed.
This is not true for the current code. Here's an example with numbers:
1) Call INSERT INTO only once (creates 100 files)
BEFORE:
100 rename calls
AFTER:
1 listFiles
100 exists
100 renames
2) Call INSERT INTO 100 times (creates 100 files) (performance of last 100th INSERT INTO)
BEFORE:
1,000 rename calls (each file will call rename() 100 times until HDFS returns true)
AFTER:
1 listFiles
100 exists
100 renames
I could try to save exists() calls in we are dealing with HDFS.
- Sergio
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156517
-----------------------------------------------------------
On Nov. 21, 2016, 10:29 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 10:29 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Aihua Xu <ax...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156517
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2789)
<https://reviews.apache.org/r/53966/#comment226704>
Do you think that will be a performance impact for HDFS if there are many files in the folder? Looks like recursive is not needed, at least for HDFS since it's to mv to 'destPath = new Path(destf, name + ("_copy_" + counter) + filetype);'?
The other changes are more like refactoring, correct?
- Aihua Xu
On Nov. 21, 2016, 10:29 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 10:29 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Yongzhi Chen <yc...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156644
-----------------------------------------------------------
Ship it!
Ship It!
- Yongzhi Chen
On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 11:54 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Sergio Pena <se...@cloudera.com>.
> On Nov. 22, 2016, 9:27 p.m., Yongzhi Chen wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2983
> > <https://reviews.apache.org/r/53966/diff/2/?file=1568277#file1568277line2983>
> >
> > if (isBlobStoragePath && !destFs.exists(destFilePath)
> > then the second condition :
> > !destFs.rename(sourcePath, destFilePath) will be evaluated. I assume you do not want that be called, right
I wanted to be called. For HDFS, the rename() will return 'false' if the destination file already exists, but in S3, the rename will return 'true' even if the file already exists, so that's why I use 'exists' for S3 only.
- Sergio
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156639
-----------------------------------------------------------
On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 11:54 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Yongzhi Chen <yc...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156639
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2951)
<https://reviews.apache.org/r/53966/#comment226831>
if (isBlobStoragePath && !destFs.exists(destFilePath)
then the second condition :
!destFs.rename(sourcePath, destFilePath) will be evaluated. I assume you do not want that be called, right
- Yongzhi Chen
On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 11:54 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Illya Yalovyy <ya...@amazon.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156738
-----------------------------------------------------------
The patch looks good to me. We probably want to add a integration test for it in hive-blobstore package.
- Illya Yalovyy
On Nov. 22, 2016, 10:35 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 22, 2016, 10:35 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Yongzhi Chen <yc...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156659
-----------------------------------------------------------
The latest patch solved all the issues Illya Yalovyy pointed out, the fix looks good.
+1
- Yongzhi Chen
On Nov. 22, 2016, 10:35 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 22, 2016, 10:35 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/
-----------------------------------------------------------
(Updated Nov. 22, 2016, 10:35 p.m.)
Review request for hive.
Changes
-------
Addressed issues from Illya Yalovyy.
Also, I sticked to the if (!exists || !rename) condition on S3, and not using the listFiles() to avoid OOM issues with concurrent HS2 requests. We can design a better performance work in a different JIRA.
Bugs: HIVE-15199
https://issues.apache.org/jira/browse/HIVE-15199
Repository: hive-git
Description
-------
The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
Diffs (updated)
-----
itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
Diff: https://reviews.apache.org/r/53966/diff/
Testing
-------
Thanks,
Sergio Pena
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Sergio Pena <se...@cloudera.com>.
> On Nov. 22, 2016, 9:30 p.m., Illya Yalovyy wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2789
> > <https://reviews.apache.org/r/53966/diff/2/?file=1568277#file1568277line2789>
> >
> > Scalability concern:
> >
> > On some real datasets, it could be millions of elements in that list. If it happens in HS2 with many cocurrent connection this jvm can easily go down with OOM Exceptions. I would suggest reconsider that approach.
You're right, I did not see that case. Probably it would better to stick with the if (!exists && !rename) condition. This would be slow when doing many repeated INSERT INTO, but it will not have problems with memory.
- Sergio
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156629
-----------------------------------------------------------
On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 11:54 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Sergio Pena <se...@cloudera.com>.
> On Nov. 22, 2016, 9:30 p.m., Illya Yalovyy wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2953
> > <https://reviews.apache.org/r/53966/diff/2/?file=1568277#file1568277line2953>
> >
> > is "copy" part of the file name misleading? It is not actually a copy of an original file.
Correct, but that's what the old code is using:
for (int counter = 1; !destFs.rename(srcP,destPath); counter++) {
destPath = new Path(destf, name + ("_copy_" + counter) + filetype);
}
The filename could be better, but I'm using the same one to fix the issue as I will need to investigate more in the code to see if that does not affect something.
- Sergio
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156629
-----------------------------------------------------------
On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 11:54 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Illya Yalovyy <ya...@amazon.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156629
-----------------------------------------------------------
common/src/java/org/apache/hadoop/hive/common/FileUtils.java (line 1007)
<https://reviews.apache.org/r/53966/#comment226819>
Can we use an explicit type as return type? Something like
final class NameAndType {
final String name;
final String type;
}
common/src/java/org/apache/hadoop/hive/common/FileUtils.java (lines 1010 - 1019)
<https://reviews.apache.org/r/53966/#comment226820>
return new NameAndType(FilenameUtils.getBaseName(filename), FilenameUtils.getExtension(filename));
https://commons.apache.org/proper/commons-io/javadocs/api-1.4/org/apache/commons/io/FilenameUtils.html
common/src/java/org/apache/hadoop/hive/common/FileUtils.java (line 1015)
<https://reviews.apache.org/r/53966/#comment226821>
extra ";" is not needed.
common/src/java/org/apache/hadoop/hive/common/FileUtils.java (line 1019)
<https://reviews.apache.org/r/53966/#comment226835>
This utility method should be covered with unit tests. Please make sure you have covered cases like:
s3://mybucket.test/foo/bar/00000_0
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2786)
<https://reviews.apache.org/r/53966/#comment226822>
Scalability concern:
On some real datasets, it could be millions of elements in that list. If it happens in HS2 with many cocurrent connection this jvm can easily go down with OOM Exceptions. I would suggest reconsider that approach.
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2921)
<https://reviews.apache.org/r/53966/#comment226828>
is "copy" part of the file name misleading? It is not actually a copy of an original file.
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2923)
<https://reviews.apache.org/r/53966/#comment226825>
Just a note:
filename + "." + filetype is 10x faster than String.format("%s%s", filename, filetype).
Also it seems like "." is missing.
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2928)
<https://reviews.apache.org/r/53966/#comment226826>
"." is missing between name and type.
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2943)
<https://reviews.apache.org/r/53966/#comment226830>
FilenameUtils can do the job:
https://commons.apache.org/proper/commons-io/javadocs/api-1.4/org/apache/commons/io/FilenameUtils.html
- Illya Yalovyy
On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
>
> (Updated Nov. 21, 2016, 11:54 p.m.)
>
>
> Review request for hive.
>
>
> Bugs: HIVE-15199
> https://issues.apache.org/jira/browse/HIVE-15199
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
> itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
> itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
> ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
>
> Diff: https://reviews.apache.org/r/53966/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Sergio Pena
>
>
Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is
replacing the old rows with the new ones
Posted by Sergio Pena <se...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/
-----------------------------------------------------------
(Updated Nov. 21, 2016, 11:54 p.m.)
Review request for hive.
Changes
-------
Attach a new patch that do not call destFs.exists when HDFS is used. This is to avoid performance penalties when HDFS Is used.
Only change is on Hive.mvFile():
boolean isBlobStoragePath = BlobStorageUtils.isBlobStoragePath(conf, destDirPath);
while ((isBlobStoragePath && destFs.exists(destFilePath)) || !destFs.rename(sourcePath, destFilePath)) {
destFilePath = createCopyFilePath(destDirPath, name, type, ++counter);
}
Bugs: HIVE-15199
https://issues.apache.org/jira/browse/HIVE-15199
Repository: hive-git
Description
-------
The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
Diffs (updated)
-----
common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5
itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1
itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0
ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe
Diff: https://reviews.apache.org/r/53966/diff/
Testing
-------
Thanks,
Sergio Pena