You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "ragnarok56 (via GitHub)" <gi...@apache.org> on 2024/03/02 20:25:48 UTC

[PR] [SPARK-44924][SS] Add config for FileStreamSource cached files [spark]

ragnarok56 opened a new pull request, #45362:
URL: https://github.com/apache/spark/pull/45362

   ### What changes were proposed in this pull request?
   This change adds configuration options for the streaming input File Source for `maxCachedFiles` and `discardCachedFilesRatio`.  These values were originally introduced with https://github.com/apache/spark/pull/27620 but were hardcoded to 10,000 and 0.2, respectively.
   
   ### Why are the changes needed?
   Under certain workloads with large `maxFilesPerTrigger` settings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish.  For example, a job with `maxFilesPerTrigger` set to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times.  This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have.
   
   ### Does this PR introduce _any_ user-facing change?
   Updated documentation for structured streaming sources to describe new configurations options
   
   ### How was this patch tested?
   New and existing unit tests. 
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org