You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by Sergio Pena <se...@cloudera.com> on 2016/11/21 22:29:44 UTC

Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/
-----------------------------------------------------------

Review request for hive.


Bugs: HIVE-15199
    https://issues.apache.org/jira/browse/HIVE-15199


Repository: hive-git


Description
-------

The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.


Diffs
-----

  common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
  itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
  itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 

Diff: https://reviews.apache.org/r/53966/diff/


Testing
-------


Thanks,

Sergio Pena

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Sergio Pena <se...@cloudera.com>.


> On Nov. 21, 2016, 10:48 p.m., Aihua Xu wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2789
> > <https://reviews.apache.org/r/53966/diff/1/?file=1568247#file1568247line2789>
> >
> >     Do you think that will be a performance impact for HDFS if there are many files in the folder? Looks like recursive is not needed, at least for HDFS since it's to mv to 'destPath = new Path(destf, name + ("_copy_" + counter) + filetype);'?
> >     
> >     The other changes are more like refactoring, correct?

The recursive is needed because the copyFiles() may also copies directories.

for (FileStatus src : srcs) {
  ...
  if (src.isDirectory()) {
    files = srcFs.listStatus(src.getPath(), FileUtils.HIDDEN_FILES_PATH_FILTER);
  }
  ...
  for (final FileStatus srcFile : files) {
    ...
  }
}

You may see a bad performance for the first INSERT INTO statements, but you'll see it much better if repeated INSERT INTO are executed. 
This is not true for the current code. Here's an example with numbers:

1) Call INSERT INTO only once (creates 100 files)

   BEFORE:
     100 rename calls 
   AFTER:
     1 listFiles 
     100 exists 
     100 renames
     
2) Call INSERT INTO 100 times (creates 100 files) (performance of last 100th INSERT INTO)
 
   BEFORE:
     1,000 rename calls (each file will call rename() 100 times until HDFS returns true)
   AFTER:
     1 listFiles
     100 exists
     100 renames
     
I could try to save exists() calls in we are dealing with HDFS.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156517
-----------------------------------------------------------


On Nov. 21, 2016, 10:29 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 10:29 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Aihua Xu <ax...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156517
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2789)
<https://reviews.apache.org/r/53966/#comment226704>

    Do you think that will be a performance impact for HDFS if there are many files in the folder? Looks like recursive is not needed, at least for HDFS since it's to mv to 'destPath = new Path(destf, name + ("_copy_" + counter) + filetype);'?
    
    The other changes are more like refactoring, correct?


- Aihua Xu


On Nov. 21, 2016, 10:29 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 10:29 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Yongzhi Chen <yc...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156644
-----------------------------------------------------------


Ship it!




Ship It!

- Yongzhi Chen


On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 11:54 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Sergio Pena <se...@cloudera.com>.


> On Nov. 22, 2016, 9:27 p.m., Yongzhi Chen wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2983
> > <https://reviews.apache.org/r/53966/diff/2/?file=1568277#file1568277line2983>
> >
> >     if (isBlobStoragePath && !destFs.exists(destFilePath)
> >     then the second condition :
> >     !destFs.rename(sourcePath, destFilePath) will be evaluated. I assume you do not want that be called, right

I wanted to be called. For HDFS, the rename() will return 'false' if the destination file already exists, but in S3, the rename will return 'true' even if the file already exists, so that's why I use 'exists' for S3 only.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156639
-----------------------------------------------------------


On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 11:54 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Yongzhi Chen <yc...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156639
-----------------------------------------------------------




ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2951)
<https://reviews.apache.org/r/53966/#comment226831>

    if (isBlobStoragePath && !destFs.exists(destFilePath)
    then the second condition :
    !destFs.rename(sourcePath, destFilePath) will be evaluated. I assume you do not want that be called, right


- Yongzhi Chen


On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 11:54 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Illya Yalovyy <ya...@amazon.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156738
-----------------------------------------------------------



The patch looks good to me. We probably want to add a integration test for it in hive-blobstore package.

- Illya Yalovyy


On Nov. 22, 2016, 10:35 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2016, 10:35 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Yongzhi Chen <yc...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156659
-----------------------------------------------------------



The latest patch solved all the issues Illya Yalovyy pointed out, the fix looks good. 
+1

- Yongzhi Chen


On Nov. 22, 2016, 10:35 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 22, 2016, 10:35 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Sergio Pena <se...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/
-----------------------------------------------------------

(Updated Nov. 22, 2016, 10:35 p.m.)


Review request for hive.


Changes
-------

Addressed issues from Illya Yalovyy.

Also, I sticked to the if (!exists || !rename) condition on S3, and not using the listFiles() to avoid OOM issues with concurrent HS2 requests. We can design a better performance work in a different JIRA.


Bugs: HIVE-15199
    https://issues.apache.org/jira/browse/HIVE-15199


Repository: hive-git


Description
-------

The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.


Diffs (updated)
-----

  itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
  itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 

Diff: https://reviews.apache.org/r/53966/diff/


Testing
-------


Thanks,

Sergio Pena

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Sergio Pena <se...@cloudera.com>.


> On Nov. 22, 2016, 9:30 p.m., Illya Yalovyy wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2789
> > <https://reviews.apache.org/r/53966/diff/2/?file=1568277#file1568277line2789>
> >
> >     Scalability concern:
> >     
> >     On some real datasets, it could be millions of elements in that list. If it happens in HS2 with many cocurrent connection this jvm can easily go down with OOM Exceptions. I would suggest reconsider that approach.

You're right, I did not see that case. Probably it would better to stick with the if (!exists && !rename) condition. This would be slow when doing many repeated INSERT INTO, but it will not have problems with memory.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156629
-----------------------------------------------------------


On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 11:54 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Sergio Pena <se...@cloudera.com>.


> On Nov. 22, 2016, 9:30 p.m., Illya Yalovyy wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java, line 2953
> > <https://reviews.apache.org/r/53966/diff/2/?file=1568277#file1568277line2953>
> >
> >     is "copy" part of the file name misleading? It is not actually a copy of an original file.

Correct, but that's what the old code is using:

for (int counter = 1; !destFs.rename(srcP,destPath); counter++) {
    destPath = new Path(destf, name + ("_copy_" + counter) + filetype);
}

The filename could be better, but I'm using the same one to fix the issue as I will need to investigate more in the code to see if that does not affect something.


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156629
-----------------------------------------------------------


On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 11:54 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Illya Yalovyy <ya...@amazon.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/#review156629
-----------------------------------------------------------




common/src/java/org/apache/hadoop/hive/common/FileUtils.java (line 1007)
<https://reviews.apache.org/r/53966/#comment226819>

    Can we use an explicit type as return type? Something like 
    final class NameAndType {
      final String name;
      final String type;
    }



common/src/java/org/apache/hadoop/hive/common/FileUtils.java (lines 1010 - 1019)
<https://reviews.apache.org/r/53966/#comment226820>

    return new NameAndType(FilenameUtils.getBaseName(filename), FilenameUtils.getExtension(filename));
    
    https://commons.apache.org/proper/commons-io/javadocs/api-1.4/org/apache/commons/io/FilenameUtils.html



common/src/java/org/apache/hadoop/hive/common/FileUtils.java (line 1015)
<https://reviews.apache.org/r/53966/#comment226821>

    extra ";" is not needed.



common/src/java/org/apache/hadoop/hive/common/FileUtils.java (line 1019)
<https://reviews.apache.org/r/53966/#comment226835>

    This utility method should be covered with unit tests. Please make sure you have covered cases like:
    
    s3://mybucket.test/foo/bar/00000_0



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2786)
<https://reviews.apache.org/r/53966/#comment226822>

    Scalability concern:
    
    On some real datasets, it could be millions of elements in that list. If it happens in HS2 with many cocurrent connection this jvm can easily go down with OOM Exceptions. I would suggest reconsider that approach.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2921)
<https://reviews.apache.org/r/53966/#comment226828>

    is "copy" part of the file name misleading? It is not actually a copy of an original file.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2923)
<https://reviews.apache.org/r/53966/#comment226825>

    Just a note:
    filename + "." + filetype is 10x faster than String.format("%s%s", filename, filetype).
    
    Also it seems like "." is missing.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2928)
<https://reviews.apache.org/r/53966/#comment226826>

    "." is missing between name and type.



ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java (line 2943)
<https://reviews.apache.org/r/53966/#comment226830>

    FilenameUtils can do the job:
    https://commons.apache.org/proper/commons-io/javadocs/api-1.4/org/apache/commons/io/FilenameUtils.html


- Illya Yalovyy


On Nov. 21, 2016, 11:54 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53966/
> -----------------------------------------------------------
> 
> (Updated Nov. 21, 2016, 11:54 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-15199
>     https://issues.apache.org/jira/browse/HIVE-15199
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
>   itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
>   itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 
> 
> Diff: https://reviews.apache.org/r/53966/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 53966: HIVE-15199: INSERT INTO data on S3 is replacing the old rows with the new ones

Posted by Sergio Pena <se...@cloudera.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53966/
-----------------------------------------------------------

(Updated Nov. 21, 2016, 11:54 p.m.)


Review request for hive.


Changes
-------

Attach a new patch that do not call destFs.exists when HDFS is used. This is to avoid performance penalties when HDFS Is used.

Only change is on Hive.mvFile():

      boolean isBlobStoragePath = BlobStorageUtils.isBlobStoragePath(conf, destDirPath);
      while ((isBlobStoragePath && destFs.exists(destFilePath)) || !destFs.rename(sourcePath, destFilePath)) {
        destFilePath = createCopyFilePath(destDirPath, name, type, ++counter);
      }


Bugs: HIVE-15199
    https://issues.apache.org/jira/browse/HIVE-15199


Repository: hive-git


Description
-------

The patch helps execute repeated INSERT INTO statements on S3 tables when the scratch directory is on S3.


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/common/FileUtils.java 1d8c04160c35e48781b20f8e6e14760c19df9ca5 
  itests/hive-blobstore/src/test/queries/clientpositive/insert_into.q 919ff7d9c7cb40062d68b876d6acbc8efb8a8cf1 
  itests/hive-blobstore/src/test/results/clientpositive/insert_into.q.out c25d0c4eec6983b6869e2eba711b39ba91a4c6e0 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 61b8bd0ac40cffcd6dca0fc874940066bc0aeffe 

Diff: https://reviews.apache.org/r/53966/diff/


Testing
-------


Thanks,

Sergio Pena