You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/12/31 11:00:24 UTC

[GitHub] [incubator-iceberg] jerryshao opened a new pull request #721: Add PreferredLocations support for Iceberg Spark Source

jerryshao opened a new pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721
 
 
   This addresses the issue of #697 to add PreferredLocations support for Iceberg Spark ReadTask. This PR is against master branch, I will backport to spark-3 branch once ready. 
   
   This patch is verified manually under Spark 2.4 and Hadoop 2.7 (I didn't add an UT since it is a little complicated to mock FileSystem and others).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r367018956
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -131,6 +142,16 @@
     this.splitLookback = options.get("lookback").map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get("file-open-cost").map(Long::parseLong).orElse(null);
 
+    String scheme = "no_exist";
+    try {
+      FileSystem fs = FileSystem.get(SparkSession.active().sparkContext().hadoopConfiguration());
+      scheme = fs.getScheme().toLowerCase(Locale.ENGLISH);
+    } catch (IOException ioe) {
+      LOG.warn("Failed to get Hadoop Filesystem", ioe);
 
 Review comment:
   That makes sense!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r364109948
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -287,19 +309,21 @@ public String toString() {
     private final Broadcast<FileIO> io;
     private final Broadcast<EncryptionManager> encryptionManager;
     private final boolean caseSensitive;
+    private final boolean enableLocality;
 
 Review comment:
   How about LocalitySensitive or LocalityAwareness? Just a thought from `caseSensitive`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r367011915
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -131,6 +142,16 @@
     this.splitLookback = options.get("lookback").map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get("file-open-cost").map(Long::parseLong).orElse(null);
 
+    String scheme = "no_exist";
+    try {
+      FileSystem fs = FileSystem.get(SparkSession.active().sparkContext().hadoopConfiguration());
 
 Review comment:
   Should we also validate that the FileIO is a HadoopFileIO?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r370272452
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -131,6 +143,21 @@
     this.splitLookback = options.get("lookback").map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get("file-open-cost").map(Long::parseLong).orElse(null);
 
+    if (io.getValue() instanceof HadoopFileIO) {
+      String scheme = "no_exist";
+      try {
+        FileSystem fs = new Path(table.location()).getFileSystem(
+            SparkSession.active().sparkContext().hadoopConfiguration());
 
 Review comment:
   This is minor and can be done in a follow-up.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue merged pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r370271209
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -131,6 +143,21 @@
     this.splitLookback = options.get("lookback").map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get("file-open-cost").map(Long::parseLong).orElse(null);
 
+    if (io.getValue() instanceof HadoopFileIO) {
+      String scheme = "no_exist";
+      try {
+        FileSystem fs = new Path(table.location()).getFileSystem(
+            SparkSession.active().sparkContext().hadoopConfiguration());
 
 Review comment:
   Do we need to get a FileSystem instance to get the scheme of the table location? Seems like we might be able to just parse it as a URI, or use `split(":")[0]` instead. That would avoid getting the active Spark context and would eliminate the need for the try/catch.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r367011590
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -308,6 +332,29 @@ private ReadTask(CombinedScanTask task, String tableSchemaString, String expecte
         encryptionManager.value(), caseSensitive);
     }
 
+    @Override
+    public String[] preferredLocations() {
+      if (!localityPreferred) {
+        return new String[0];
+      }
+
+      Configuration conf = SparkSession.active().sparkContext().hadoopConfiguration();
+      Set<String> locations = Sets.newHashSet();
+      for (FileScanTask f : task.files()) {
+        Path path = new Path(f.file().path().toString());
+        try {
+          FileSystem fs = path.getFileSystem(conf);
+          for (BlockLocation b : fs.getFileBlockLocations(path, f.start(), f.length())) {
+            locations.addAll(Arrays.asList(b.getHosts()));
+          }
+        } catch (IOException ioe) {
+          LOG.warn("Failed to get block locations for path {}", path, ioe);
 
 Review comment:
   I think I agree with the existing warn behavior. If we can't get locality info from the FS it shouldn't fail the job. It is a preference, not a requirement right?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r365977533
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -308,6 +332,29 @@ private ReadTask(CombinedScanTask task, String tableSchemaString, String expecte
         encryptionManager.value(), caseSensitive);
     }
 
+    @Override
+    public String[] preferredLocations() {
+      if (!localityPreferred) {
+        return new String[0];
+      }
+
+      Configuration conf = SparkSession.active().sparkContext().hadoopConfiguration();
+      Set<String> locations = Sets.newHashSet();
+      for (FileScanTask f : task.files()) {
+        Path path = new Path(f.file().path().toString());
+        try {
+          FileSystem fs = path.getFileSystem(conf);
+          for (BlockLocation b : fs.getFileBlockLocations(path, f.start(), f.length())) {
+            locations.addAll(Arrays.asList(b.getHosts()));
+          }
+        } catch (IOException ioe) {
+          LOG.warn("Failed to get block locations for path {}", path, ioe);
 
 Review comment:
   Since `localityPreferred` flag is `true`.  We should throw a `org.apache.iceberg.exceptions.RuntimeIOException` here. 
   Also we can use `Util.getFS` to get the FileSystem class

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] jerryshao commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
jerryshao commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r364208520
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -287,19 +309,21 @@ public String toString() {
     private final Broadcast<FileIO> io;
     private final Broadcast<EncryptionManager> encryptionManager;
     private final boolean caseSensitive;
+    private final boolean enableLocality;
 
 Review comment:
   Let me think a better name, `localityAwareness` seems too long.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r367016233
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -308,6 +332,29 @@ private ReadTask(CombinedScanTask task, String tableSchemaString, String expecte
         encryptionManager.value(), caseSensitive);
     }
 
+    @Override
+    public String[] preferredLocations() {
 
 Review comment:
   In other formats, splits typically fetch block locations as they were accessed. I think there is an assumption that split planning happens in parallel and that working with splits is cheap after that point.
   
   My concern is that this defers fetching block locations until they are accessed, which will make a previously cheap call (retrieve locality from an instance field) into an expensive RPC call (ask the NN). I think it may be better to fetch this information when the `ReadTask` is created into a `transient` field. That way, the expensive part is done in parallel during split planning.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdsr commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r365979398
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -131,6 +142,16 @@
     this.splitLookback = options.get("lookback").map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get("file-open-cost").map(Long::parseLong).orElse(null);
 
+    String scheme = "no_exist";
+    try {
+      FileSystem fs = FileSystem.get(SparkSession.active().sparkContext().hadoopConfiguration());
+      scheme = fs.getScheme().toLowerCase(Locale.ENGLISH);
+    } catch (IOException ioe) {
+      LOG.warn("Failed to get Hadoop Filesystem", ioe);
 
 Review comment:
   Does it make sense to use the method `org.apache.hadoop.fs.FileSystem#getDefaultUri` instead? That does not throw an exception and does not need to get a `FileSystem` class. It gets the information from the `DEFAULT_URI`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r367017892
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -131,6 +142,16 @@
     this.splitLookback = options.get("lookback").map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get("file-open-cost").map(Long::parseLong).orElse(null);
 
+    String scheme = "no_exist";
+    try {
+      FileSystem fs = FileSystem.get(SparkSession.active().sparkContext().hadoopConfiguration());
+      scheme = fs.getScheme().toLowerCase(Locale.ENGLISH);
+    } catch (IOException ioe) {
+      LOG.warn("Failed to get Hadoop Filesystem", ioe);
 
 Review comment:
   Actually, there's no guarantee that the table uses the default file system. I think this should check the file system of the table's `location` instead.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r370271209
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -131,6 +143,21 @@
     this.splitLookback = options.get("lookback").map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get("file-open-cost").map(Long::parseLong).orElse(null);
 
+    if (io.getValue() instanceof HadoopFileIO) {
+      String scheme = "no_exist";
+      try {
+        FileSystem fs = new Path(table.location()).getFileSystem(
+            SparkSession.active().sparkContext().hadoopConfiguration());
 
 Review comment:
   Do we need to get a FileSystem instance to get the scheme of the table location? Seems like we might be able to just parse it as a URI, or use `split(":")[0]` instead. That would avoid getting the active Spark context.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] fbocse commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
fbocse commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r363768943
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -308,6 +332,29 @@ private ReadTask(CombinedScanTask task, String tableSchemaString, String expecte
         encryptionManager.value(), caseSensitive);
     }
 
+    @Override
+    public String[] preferredLocations() {
+      if (!enableLocality) {
+        return new String[0];
+      }
+
+      Configuration conf = SparkSession.active().sparkContext().hadoopConfiguration();
+      Set<String> locations = Sets.newHashSet();
+      for (FileScanTask f : task.files()) {
+        Path path = new Path(f.file().path().toString());
+        try {
+          FileSystem fs = path.getFileSystem(conf);
+          for (BlockLocation b : fs.getFileBlockLocations(path, f.start(), f.length())) {
+            locations.addAll(Arrays.asList(b.getHosts()));
 
 Review comment:
   Any chance `getFileBlockLocations ` returns `null` so we'd end up with a NPE for `b.getHosts()`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] jerryshao commented on issue #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
jerryshao commented on issue #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#issuecomment-575964753
 
 
   Thanks all for the feedback, sorry for late response. I will address them soon.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] jerryshao commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
jerryshao commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r368273069
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -131,6 +142,16 @@
     this.splitLookback = options.get("lookback").map(Integer::parseInt).orElse(null);
     this.splitOpenFileCost = options.get("file-open-cost").map(Long::parseLong).orElse(null);
 
+    String scheme = "no_exist";
+    try {
+      FileSystem fs = FileSystem.get(SparkSession.active().sparkContext().hadoopConfiguration());
+      scheme = fs.getScheme().toLowerCase(Locale.ENGLISH);
+    } catch (IOException ioe) {
+      LOG.warn("Failed to get Hadoop Filesystem", ioe);
 
 Review comment:
   Sure, let me change it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [incubator-iceberg] jerryshao commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source

Posted by GitBox <gi...@apache.org>.
jerryshao commented on a change in pull request #721: Add PreferredLocations support for Iceberg Spark Source
URL: https://github.com/apache/incubator-iceberg/pull/721#discussion_r364103673
 
 

 ##########
 File path: spark/src/main/java/org/apache/iceberg/spark/source/Reader.java
 ##########
 @@ -308,6 +332,29 @@ private ReadTask(CombinedScanTask task, String tableSchemaString, String expecte
         encryptionManager.value(), caseSensitive);
     }
 
+    @Override
+    public String[] preferredLocations() {
+      if (!enableLocality) {
+        return new String[0];
+      }
+
+      Configuration conf = SparkSession.active().sparkContext().hadoopConfiguration();
+      Set<String> locations = Sets.newHashSet();
+      for (FileScanTask f : task.files()) {
+        Path path = new Path(f.file().path().toString());
+        try {
+          FileSystem fs = path.getFileSystem(conf);
+          for (BlockLocation b : fs.getFileBlockLocations(path, f.start(), f.length())) {
+            locations.addAll(Arrays.asList(b.getHosts()));
 
 Review comment:
   I checked the Hadoop code, I don't think this will return a null value here, on the contrary, it will return an empty array if there's no hosts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org