You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/03 21:36:58 UTC

[GitHub] [iceberg] amogh-jahagirdar opened a new pull request, #6354: Spark: Check fileIO instead of reading location when determining locality enabled

amogh-jahagirdar opened a new pull request, #6354:
URL: https://github.com/apache/iceberg/pull/6354

   When checking if locality should be enabled, the fileIO instance should be used rather than reading the location


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho merged pull request #6354: Spark: Check fileIO instead of reading location when determining locality enabled

Posted by GitBox <gi...@apache.org>.
szehon-ho merged PR #6354:
URL: https://github.com/apache/iceberg/pull/6354


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a diff in pull request #6354: Spark: Check fileIO instead of reading location when determining locality enabled

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on code in PR #6354:
URL: https://github.com/apache/iceberg/pull/6354#discussion_r1043674439


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java:
##########
@@ -67,10 +67,9 @@ public boolean caseSensitive() {
   }
 
   public boolean localityEnabled() {
-    InputFile file = table.io().newInputFile(table.location());
-
-    if (file instanceof HadoopInputFile) {
-      String scheme = ((HadoopInputFile) file).getFileSystem().getScheme();
+    if (table.io() instanceof HadoopFileIO) {
+      HadoopInputFile file = (HadoopInputFile) table.io().newInputFile(table.location());
+      String scheme = file.getFileSystem().getScheme();
       boolean defaultValue = LOCALITY_WHITELIST_FS.contains(scheme);
       return PropertyUtil.propertyAsBoolean(readOptions, SparkReadOptions.LOCALITY, defaultValue);
     }

Review Comment:
   Yea it's done that way in Flink SourceUtil.  I guess there they have to catch the declared IOException as they instantiate a Hadoop Path, but here we do not.  Not sure if there's that much value to change to the Flink's way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6354: Spark: Check fileIO instead of reading location when determining locality enabled

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6354:
URL: https://github.com/apache/iceberg/pull/6354#discussion_r1038876030


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java:
##########
@@ -67,10 +67,9 @@ public boolean caseSensitive() {
   }
 
   public boolean localityEnabled() {
-    InputFile file = table.io().newInputFile(table.location());
-
-    if (file instanceof HadoopInputFile) {
-      String scheme = ((HadoopInputFile) file).getFileSystem().getScheme();
+    if (table.io() instanceof HadoopFileIO) {
+      HadoopInputFile file = (HadoopInputFile) table.io().newInputFile(table.location());
+      String scheme = file.getFileSystem().getScheme();
       boolean defaultValue = LOCALITY_WHITELIST_FS.contains(scheme);
       return PropertyUtil.propertyAsBoolean(readOptions, SparkReadOptions.LOCALITY, defaultValue);
     }

Review Comment:
   Another approach is to treat this as best effort and in case of any errors during this process just return false. This also looks to be done in the flink case as well but I'm not sure if this is desired for Spark; I think it should be since in the worst case whether locality is enabled or not cannot be determined and just opt for assuming it's not. This shouldn't have any correctness impact, but maybe @aokolnychyi @RussellSpitzer can help me out here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #6354: Spark: Check fileIO instead of reading location when determining locality enabled

Posted by GitBox <gi...@apache.org>.
amogh-jahagirdar commented on code in PR #6354:
URL: https://github.com/apache/iceberg/pull/6354#discussion_r1038876030


##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java:
##########
@@ -67,10 +67,9 @@ public boolean caseSensitive() {
   }
 
   public boolean localityEnabled() {
-    InputFile file = table.io().newInputFile(table.location());
-
-    if (file instanceof HadoopInputFile) {
-      String scheme = ((HadoopInputFile) file).getFileSystem().getScheme();
+    if (table.io() instanceof HadoopFileIO) {
+      HadoopInputFile file = (HadoopInputFile) table.io().newInputFile(table.location());
+      String scheme = file.getFileSystem().getScheme();
       boolean defaultValue = LOCALITY_WHITELIST_FS.contains(scheme);
       return PropertyUtil.propertyAsBoolean(readOptions, SparkReadOptions.LOCALITY, defaultValue);
     }

Review Comment:
   Another approach is to treat this as best effort and in case of any errors during this process just return false. This also looks to be done in the flink case as well but I'm not sure if this is desired for Spark; I think it should be since in the worst case whether locality is enabled or not cannot be determined and just opt for assuming it's not. This shouldn't have any correctness impact, but maybe @aokolnychyi @RussellSpitzer can validate this assumption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on pull request #6354: Spark: Check fileIO instead of reading location when determining locality enabled

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on PR #6354:
URL: https://github.com/apache/iceberg/pull/6354#issuecomment-1344869084

   Merged, thanks @amogh-jahagirdar for the change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org