You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Neven Jovic (Jira)" <ji...@apache.org> on 2022/02/25 13:46:00 UTC

[jira] [Created] (SPARK-38329) High I/O wait when Spark Structured Streaming checkpoint changed to EFS

Neven Jovic created SPARK-38329:
-----------------------------------

             Summary: High I/O wait when Spark Structured Streaming checkpoint changed to EFS
                 Key: SPARK-38329
                 URL: https://issues.apache.org/jira/browse/SPARK-38329
             Project: Spark
          Issue Type: Question
          Components: EC2, Input/Output, PySpark, Structured Streaming
    Affects Versions: 2.4.6
            Reporter: Neven Jovic
         Attachments: Screenshot from 2022-02-25 14-16-11.png

I'm currently running spark structured streaming application written in python(pyspark) where my source is kafka topic and sink i mongodb. I changed my checkpoint to Amazon EFS, which is distributed on all spark workers and after that I got increased I/o wait, averaging 8%

 

!image-2022-02-25-14-42-31-904.png!

Currently I have 6000 messages coming to kafka every second, and I get every once in a while a WARN message:
{quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up files for HDFSStateStoreProvider[id = (op=0,part=90),dir = file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For input string: ""
{quote}
I'm not quite sure if that message has anything to do with high I/O wait and is this behavior expected, or something to be concerned about?
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org