You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/18 04:52:24 UTC

[GitHub] [iceberg] shardulm94 opened a new pull request #1957: Spark: Print file location in case of error during reads

shardulm94 opened a new pull request #1957:
URL: https://github.com/apache/iceberg/pull/1957


   Make it easier to find out exactly which file was being read when we encountered an error either opening the task or when reading rows from the task
   
   Heres an example of how it looks now
   ```
   Exception: Error reading file: hdfs://namenode/table/path/part-r-00001.orc
   	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:97)
   	at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:49)
   	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$12$$anon$1.hasNext(WholeStageCodegenExec.scala:634)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
   	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
   	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
   	at org.apache.spark.scheduler.Task.run(Task.scala:109)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:429)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalArgumentException: Can not promote LONG type to INTEGER
   	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:441)
   	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:301)
   	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:275)
   	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:275)
   	at org.apache.iceberg.orc.ORCSchemaUtil.buildOrcProjection(ORCSchemaUtil.java:258)
   	at org.apache.iceberg.orc.OrcIterable.iterator(OrcIterable.java:97)
   	at org.apache.iceberg.spark.source.RowDataReader.open(RowDataReader.java:100)
   	at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:84)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #1957: Spark: Print file location in case of error during reads

Posted by GitBox <gi...@apache.org>.

rdblue merged pull request #1957:
URL: https://github.com/apache/iceberg/pull/1957


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1957: Spark: Print file location in case of error during reads

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1957:
URL: https://github.com/apache/iceberg/pull/1957#discussion_r546003880



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java
##########
@@ -77,16 +78,30 @@
   }
 
   public boolean next() throws IOException {
-    while (true) {
-      if (currentIterator.hasNext()) {
-        this.current = currentIterator.next();
-        return true;
-      } else if (tasks.hasNext()) {
-        this.currentIterator.close();
-        this.currentIterator = open(tasks.next());
+    try {
+      while (true) {
+        if (currentIterator.hasNext()) {
+          this.current = currentIterator.next();
+          return true;
+        } else if (tasks.hasNext()) {
+          this.currentIterator.close();
+          this.currentTask = tasks.next();
+          this.currentIterator = open(currentTask);
+        } else {
+          this.currentIterator.close();
+          return false;
+        }
+      }
+    } catch (IOException | RuntimeException e) {
+      if (currentTask == null || currentTask.isDataTask()) {
+        throw e;
       } else {
-        this.currentIterator.close();
-        return false;
+        String message = String.format("Error reading file: %s", getInputFile(currentTask).location());
+        if (e instanceof IOException) {
+          throw new IOException(message, e);
+        } else {
+          throw new RuntimeException(message, e);
+        }

Review comment:
       The main problem I have with this is that it alters the exception by discarding its type. Callers can no longer catch exceptions and handle errors. For example, a caller might catch a `SocketException` and retry, but let an `EOFException` propagate.
   
   Is there another way to attach the data? What about logging the file path and exception instead of modifying the exception?
   
   Another option is to create a new exception and attach it as suppressed, then throw the original unmodified.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1957: Spark: Print file location in case of error during reads

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on a change in pull request #1957:
URL: https://github.com/apache/iceberg/pull/1957#discussion_r546077547



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java
##########
@@ -77,16 +78,30 @@
   }
 
   public boolean next() throws IOException {
-    while (true) {
-      if (currentIterator.hasNext()) {
-        this.current = currentIterator.next();
-        return true;
-      } else if (tasks.hasNext()) {
-        this.currentIterator.close();
-        this.currentIterator = open(tasks.next());
+    try {
+      while (true) {
+        if (currentIterator.hasNext()) {
+          this.current = currentIterator.next();
+          return true;
+        } else if (tasks.hasNext()) {
+          this.currentIterator.close();
+          this.currentTask = tasks.next();
+          this.currentIterator = open(currentTask);
+        } else {
+          this.currentIterator.close();
+          return false;
+        }
+      }
+    } catch (IOException | RuntimeException e) {
+      if (currentTask == null || currentTask.isDataTask()) {
+        throw e;
       } else {
-        this.currentIterator.close();
-        return false;
+        String message = String.format("Error reading file: %s", getInputFile(currentTask).location());
+        if (e instanceof IOException) {
+          throw new IOException(message, e);
+        } else {
+          throw new RuntimeException(message, e);
+        }

Review comment:
       Logging the error makes sense! Thanks for the suggestion.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1957: Spark: Print file location in case of error during reads

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1957:
URL: https://github.com/apache/iceberg/pull/1957#discussion_r546290574



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java
##########
@@ -56,6 +56,7 @@
 
   private CloseableIterator<T> currentIterator;
   private T current = null;
+  private FileScanTask currentTask = null;

Review comment:
       Sorry, I missed that this is getting set in `next`. I thought that change was just to indentation. Nevermind!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a change in pull request #1957: Spark: Print file location in case of error during reads

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on a change in pull request #1957:
URL: https://github.com/apache/iceberg/pull/1957#discussion_r546048471



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java
##########
@@ -77,16 +78,30 @@
   }
 
   public boolean next() throws IOException {
-    while (true) {
-      if (currentIterator.hasNext()) {
-        this.current = currentIterator.next();
-        return true;
-      } else if (tasks.hasNext()) {
-        this.currentIterator.close();
-        this.currentIterator = open(tasks.next());
+    try {
+      while (true) {
+        if (currentIterator.hasNext()) {
+          this.current = currentIterator.next();
+          return true;
+        } else if (tasks.hasNext()) {
+          this.currentIterator.close();
+          this.currentTask = tasks.next();
+          this.currentIterator = open(currentTask);
+        } else {
+          this.currentIterator.close();
+          return false;
+        }
+      }
+    } catch (IOException | RuntimeException e) {
+      if (currentTask == null || currentTask.isDataTask()) {
+        throw e;
       } else {
-        this.currentIterator.close();
-        return false;
+        String message = String.format("Error reading file: %s", getInputFile(currentTask).location());
+        if (e instanceof IOException) {
+          throw new IOException(message, e);
+        } else {
+          throw new RuntimeException(message, e);
+        }

Review comment:
       +1 To Logging an error




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1957: Spark: Print file location in case of error during reads

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1957:
URL: https://github.com/apache/iceberg/pull/1957#discussion_r546001602



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java
##########
@@ -56,6 +56,7 @@
 
   private CloseableIterator<T> currentIterator;
   private T current = null;
+  private FileScanTask currentTask = null;

Review comment:
       Did you intend to set this in the constructor?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] shardulm94 commented on a change in pull request #1957: Spark: Print file location in case of error during reads

Posted by GitBox <gi...@apache.org>.

shardulm94 commented on a change in pull request #1957:
URL: https://github.com/apache/iceberg/pull/1957#discussion_r546178998



##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java
##########
@@ -56,6 +56,7 @@
 
   private CloseableIterator<T> currentIterator;
   private T current = null;
+  private FileScanTask currentTask = null;

Review comment:
       Not really, does it matter if it is in the constructor? It is going to be set to null in the constructor anyway. Similar to the `current` variable above it.

##########
File path: spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java
##########
@@ -77,16 +78,30 @@
   }
 
   public boolean next() throws IOException {
-    while (true) {
-      if (currentIterator.hasNext()) {
-        this.current = currentIterator.next();
-        return true;
-      } else if (tasks.hasNext()) {
-        this.currentIterator.close();
-        this.currentIterator = open(tasks.next());
+    try {
+      while (true) {
+        if (currentIterator.hasNext()) {
+          this.current = currentIterator.next();
+          return true;
+        } else if (tasks.hasNext()) {
+          this.currentIterator.close();
+          this.currentTask = tasks.next();
+          this.currentIterator = open(currentTask);
+        } else {
+          this.currentIterator.close();
+          return false;
+        }
+      }
+    } catch (IOException | RuntimeException e) {
+      if (currentTask == null || currentTask.isDataTask()) {
+        throw e;
       } else {
-        this.currentIterator.close();
-        return false;
+        String message = String.format("Error reading file: %s", getInputFile(currentTask).location());
+        if (e instanceof IOException) {
+          throw new IOException(message, e);
+        } else {
+          throw new RuntimeException(message, e);
+        }

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org