You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/01/20 14:30:10 UTC

[GitHub] [iceberg] steveloughran opened a new pull request #2125: Improve HadoopCatalog performance/scalability #2124

steveloughran opened a new pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125


   Move to RemoteIterator for scanning directories.
   It's not as elegant as using the java8 streaming, but it works with
   the prefetching that the s3a and (soon) abfs connectors do, as well
   as bailing out more efficiently.
   
   Because each directory is probed with its own getFileStatus and list calls, the overhead of the outer list could be entirely swallowed by those inner probes -at least if there is >1 page of results in the listing *and* the implementation is prefetching. 
   
   Also added check for access errors to also look for AccessDeniedException; that's to support other filesystems and to prepare for HADOOP-15710
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] steveloughran commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-767850160


   ok. I thought those imports were in use. Will fix; just a bit distracted today. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-767019464


   A couple more checkstyle errors:
   
   ```
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] steveloughran commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-766858807


   updated to fix the checkstyle. Also did a quick fix for azure exception reporting...this patch will continue to work with branches with and without the hadoop changes
    https://github.com/apache/hadoop/pull/2648


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-767019464


   A couple more checkstyle errors:
   
   ```
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]
   ```
   
   The JDK 11 test failure is a flaky test with Hive that we need to track down.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-767019464


   A couple more checkstyle errors:
   
   ```
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#discussion_r562315209



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java
##########
@@ -181,7 +185,7 @@ private boolean isTableDir(Path path) {
     // still a namespace.
     try {
       return fs.listStatus(metadataPath, TABLE_FILTER).length >= 1;
-    } catch (FileNotFoundException e) {
+    } catch (FileNotFoundException  e) {

Review comment:
       Accidental whitespace change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-769262053


   Looks good. Thanks for fixing this!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#discussion_r562315209



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java
##########
@@ -181,7 +185,7 @@ private boolean isTableDir(Path path) {
     // still a namespace.
     try {
       return fs.listStatus(metadataPath, TABLE_FILTER).length >= 1;
-    } catch (FileNotFoundException e) {
+    } catch (FileNotFoundException  e) {

Review comment:
       Accidental whitespace change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] steveloughran commented on a change in pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
steveloughran commented on a change in pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#discussion_r566215727



##########
File path: core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java
##########
@@ -181,7 +185,7 @@ private boolean isTableDir(Path path) {
     // still a namespace.
     try {
       return fs.listStatus(metadataPath, TABLE_FILTER).length >= 1;
-    } catch (FileNotFoundException e) {
+    } catch (FileNotFoundException  e) {

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-765596523


   Tests look like they're passing to me?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] steveloughran commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-769737942


   thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-765057986


   Looks good to me overall. Any reason why it is a draft?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue edited a comment on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue edited a comment on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-767019464


   A couple more checkstyle errors:
   
   ```
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]
   ```
   
   The JDK 11 test failure is a flaky test with Hive that we need to track down.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] steveloughran commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-765338835


   > Looks good to me overall. Any reason why it is a draft?
   
   cos for some reason some of the tests are failing, and I'm just getting through your build process before I start bothering people for reviews.
   
   one of the listIterator changes isn't working, and I want to understand why


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-765057986






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] steveloughran commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-766858807


   updated to fix the checkstyle. Also did a quick fix for azure exception reporting...this patch will continue to work with branches with and without the hadoop changes
    https://github.com/apache/hadoop/pull/2648


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #2125: Improve HadoopCatalog performance/scalability #2124

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #2125:
URL: https://github.com/apache/iceberg/pull/2125#issuecomment-765058405


   Looks like there are checkstyle issues to fix:
   
   ```
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:177: Line is longer than 120 characters (found 124). [LineLength]
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:177:15: '||' should be on the previous line. [OperatorWrap]
   Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:229:20: Local variable name 's' must match pattern '^[a-z][a-zA-Z0-9]+$'. [LocalVariableName]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org