You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2020/01/09 03:03:00 UTC

[jira] [Updated] (SPARK-30462) Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects

     [ https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jungtaek Lim updated SPARK-30462:
---------------------------------
    Affects Version/s: 3.0.0

> Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-30462
>                 URL: https://issues.apache.org/jira/browse/SPARK-30462
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.4.3, 2.4.4, 3.0.0
>            Reporter: Vladimir Yankov
>            Priority: Critical
>
> Hi,
> With the current implementation of the Spark Structured Streaming it does not seem to be possible to have a constantly running stream, writing millions of files, without increasing the spark driver's memory to dozens of GB's.
> In our scenario we are using Spark structured streaming to consume messages from a Kafka cluster, transform them, and write them as compressed Parquet files in an S3 Objectstore Service.
> Each 30 seconds a new batch of the spark-streaming is writing hundreds of objects, which respectively results within time to millions of objects in S3.
> As all written objects are recorded in the _spark_metadata, the size of the compact files there grows to GB's that eventually fill up the Spark Driver's memory and lead to OOM errors.
> We need the functionality to configure the spark structured streaming to run without loading all the historically accumulated metadata in its memory. 
> Regularly resetting the _spark_metadata and the checkpoint folders is not an option in our use-case, as we are using the information from the _spark_metadata to have a register of the objects for faster querying and search of the written objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org