You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by gu...@apache.org on 2021/10/08 00:04:56 UTC
[spark] branch master updated: [SPARK-29871][ML] Catch all
exceptions for handling invalid images in image source
This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 0ecc71b [SPARK-29871][ML] Catch all exceptions for handling invalid images in image source
0ecc71b is described below
commit 0ecc71bbf979f13e7260af93c4bffa8c133dc9ea
Author: Hyukjin Kwon <gu...@apache.org>
AuthorDate: Fri Oct 8 09:04:13 2021 +0900
[SPARK-29871][ML] Catch all exceptions for handling invalid images in image source
### What changes were proposed in this pull request?
This PR fixes the test failure:
```
Running tests...
----------------------------------------------------------------------
test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ... ERROR (12.050s)
======================================================================
ERROR [12.050s]: test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py", line 35, in test_read_images
self.assertEqual(df.count(), 4)
File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py", line 507, in count
return int(self._jdf.count())
File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in _call_
answer, self.gateway_client, self.target_id, self.name)
File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o32.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, amp-jenkins-worker-05.amp, executor driver): javax.imageio.IIOException: Unsupported Image Type
at com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079)
at com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050)
at javax.imageio.ImageIO.read(ImageIO.java:1448)
at javax.imageio.ImageIO.read(ImageIO.java:1352)
```
This exception happens apparently when handling malformed invalid images with `dropInvalid` option set - `ImageIO.read` fails to catch `javax.imageio.IIOException` for an invalid image that is not `RuntimeException`.
In fact, the bytes are already in memory so the real IO exception would not happen during `ImageIO.read`. Therefore, this PR proposes to catch all exceptions when reading image to properly handle malformed images.
For the reason why it's flaky instead of consistently failing, I am not yet sure. However, the fix should be correct.
### Why are the changes needed?
To fix the flaky tests, see https://github.com/apache/spark/runs/3802639160 as an example.
### Does this PR introduce _any_ user-facing change?
Users would be able to read malformed data even for the cases of `javax.imageio.IIOException` (or other unexpected non-runtime exceptions) is thrown when `dropInvalid` option is enabled.
### How was this patch tested?
Existing unittests. We should track if the tests are still flaky or not.
Closes #34187 from HyukjinKwon/SPARK-29871.
Authored-by: Hyukjin Kwon <gu...@apache.org>
Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
index 37b7159..242496f 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
@@ -133,9 +133,12 @@ object ImageSchema {
val img = try {
ImageIO.read(new ByteArrayInputStream(bytes))
} catch {
- // Catch runtime exception because `ImageIO` may throw unexpected `RuntimeException`.
- // But do not catch the declared `IOException` (regarded as FileSystem failure)
- case _: RuntimeException => null
+ // Note that:
+ // - At this point, the files are already read from the files as bytes. Therefore,
+ // no real I/O exceptions are expected.
+ // - `ImageIO.read` can throw `javax.imageio.IIOException` that is technically
+ // a runtime exception but it inherits IOException.
+ case _: Throwable => null
}
if (img == null) {
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org