You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2021/10/08 00:04:56 UTC

[spark] branch master updated: [SPARK-29871][ML] Catch all exceptions for handling invalid images in image source

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 0ecc71b  [SPARK-29871][ML] Catch all exceptions for handling invalid images in image source
0ecc71b is described below

commit 0ecc71bbf979f13e7260af93c4bffa8c133dc9ea
Author: Hyukjin Kwon <gu...@apache.org>
AuthorDate: Fri Oct 8 09:04:13 2021 +0900

    [SPARK-29871][ML] Catch all exceptions for handling invalid images in image source
    
    ### What changes were proposed in this pull request?
    
    This PR fixes the test failure:
    
    ```
    Running tests...
    ----------------------------------------------------------------------
    test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ... ERROR (12.050s)
    
    ======================================================================
    ERROR [12.050s]: test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
    File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py", line 35, in test_read_images
    self.assertEqual(df.count(), 4)
    File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py", line 507, in count
    return int(self._jdf.count())
    File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in _call_
    answer, self.gateway_client, self.target_id, self.name)
    File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", line 98, in deco
    return f(*a, **kw)
    File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
    py4j.protocol.Py4JJavaError: An error occurred while calling o32.count.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, amp-jenkins-worker-05.amp, executor driver): javax.imageio.IIOException: Unsupported Image Type
    at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079)
    at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050)
    at javax.imageio.ImageIO.read(ImageIO.java:1448)
    at javax.imageio.ImageIO.read(ImageIO.java:1352)
    ```
    
    This exception happens apparently when handling malformed invalid images with `dropInvalid` option set - `ImageIO.read` fails to catch `javax.imageio.IIOException` for an invalid image that is not `RuntimeException`.
    
    In fact, the bytes are already in memory so the real IO exception would not happen during `ImageIO.read`. Therefore, this PR proposes to catch all exceptions when reading image to properly handle malformed images.
    
    For the reason why it's flaky instead of consistently failing, I am not yet sure. However, the fix should be correct.
    
    ### Why are the changes needed?
    
    To fix the flaky tests, see https://github.com/apache/spark/runs/3802639160 as an example.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Users would be able to read malformed data even for the cases of `javax.imageio.IIOException` (or other unexpected non-runtime exceptions) is thrown when `dropInvalid`  option is enabled.
    
    ### How was this patch tested?
    
    Existing unittests. We should track if the tests are still flaky or not.
    
    Closes #34187 from HyukjinKwon/SPARK-29871.
    
    Authored-by: Hyukjin Kwon <gu...@apache.org>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
index 37b7159..242496f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
@@ -133,9 +133,12 @@ object ImageSchema {
     val img = try {
       ImageIO.read(new ByteArrayInputStream(bytes))
     } catch {
-      // Catch runtime exception because `ImageIO` may throw unexpected `RuntimeException`.
-      // But do not catch the declared `IOException` (regarded as FileSystem failure)
-      case _: RuntimeException => null
+      // Note that:
+      // - At this point, the files are already read from the files as bytes. Therefore,
+      //   no real I/O exceptions are expected.
+      // - `ImageIO.read` can throw `javax.imageio.IIOException` that is technically
+      //   a runtime exception but it inherits IOException.
+      case _: Throwable => null
     }
 
     if (img == null) {

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org