You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "VitoMakarevich (via GitHub)" <gi...@apache.org> on 2023/09/12 17:59:53 UTC

[GitHub] [spark] VitoMakarevich opened a new pull request, #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

VitoMakarevich opened a new pull request, #42893:
URL: https://github.com/apache/spark/pull/42893

### What changes were proposed in this pull request?
The description is listed in [Jira SPARK-44459](https://issues.apache.org/jira/browse/SPARK-44459).
I faced an issue when running ~90 streaming queries in parallel in one driver. Driver memory footprint(max heap and used heap) was constantly increasing until reached the node limit(in my case ~110 GBs) and the driver got killed. When I checked the heap dump, I observed hundred of millions of `java.lang.ref.Finalizer` and `java.util.zip.*` entries. When I dug into the code, I found that on Java 8 - many classes from `java.util.zip.*` include non-empty `finalize` method.
This, in turn, leads to accumulation of references to classes to be GCed, and only `finalizer` thread(**one per JVM**) can run the non-empty `finalize` method, and only then memory will be released by GC. Also, many classes from `java.util.zip.*` hold references to native(C) memory, so the situation is even worse since not only heap memory is accumulating. I verified it's an issue with finalization queue when checked JMX bean related to finalization queue size.
I ensured this comes from the Spark package by finding a problematic part of Spark - when I turned off SparkUI - the problem disappeared. As per my findings, the issue comes from `netty` using `java.util.zip` heavily, and a large number of tasks leads to an enormous production rate, so in multithreaded code `finalizer` thread could not keep up with the increasing `Reference` queue.
There are [detailed article with figures](https://medium.com/@vitali.makarevich/spark-structured-streaming-and-java-util-zip-and-finalize-method-83181c6bc86f) - let me know if you want me to put it all here.
I made a fix by running logic similar to existing in spark with `System.runFinalization()` and memory consumption normalized and the application can run for weeks, but it would be good if others had the same fix and didn't spend weeks finding the cause.
If you want, I can add a configuration value controlling whether this logic should be run, so any potential bottlenecks could be avoided by existing users.

### Why are the changes needed?
Java 8 environment with similar conditions(e.g. a lot of streaming queries in parallel with a lot of tasks) should have the same problem.

### Does this PR introduce _any_ user-facing change?
It has no change the user will face, but in theory, with some circumstances client may see reduction in performance.

### How was this patch tested?
I have an application running this in production. I don't expect any existing test to fail since this logic only improves the cleanup process and it's not covered by any existing test. I don't expect performance/benchmark to degrade(although can admit there may be problems with some combination of JVM/GC).
I think there may be a way to check the finalization queue size from the code(similar to the JMX bean I used to catch the issue) to check that it has an effect, but it looks to be a very native operation that has no possibility to work incorrectly.

### Was this patch authored or co-authored using generative AI tooling?
No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] VitoMakarevich commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1725834322

   I would not say it's specific to the deployment. We have a pretty common deployment - AWS EMR on EKS - running on Java 8. What makes the app different from others is running 90 structured streaming queries in parallel, but I can imagine others doing the same. 
   I can make this a library, but don't you think the best way users can know about bottlenecks is from official documentation rather than Stackoverflow/medium/Reddit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] VitoMakarevich commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1718048192

   I was trying this setting back in the time, it was ~200 for ~40-50 GB heap and it didn't help, then I tried with ~1000 with the same heap - same fail, then I extended memory to ~100GB and tried with `1000` - no effect. I was using G1GC.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] VitoMakarevich commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1723627334

   @srowen @LuciferYang 
   Guys, will you continue the discussion? Maybe we can deliver it under the config flag and disable it by default, put some docs around it, so only users struggling(like me) will enable it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] VitoMakarevich commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1718054124

   It's hard to say they were from netty, let me describe more about the facts and assumptions I made. I checked heapdump and found that:
   1. There were hundred of millions of entries to finalizer queue/`java.util.zip.*` entries.
   2. It's impossible to find where they come from once they GCed(so moved to finalizer queue and the only reference to `java.util.zip.*` package is coming from the finalizer queue.
   3. When I turned off spark UI - the issue disappeared. I was playing with settings like number of tasks/stages/jobs and so on, but it didn't help.
   But what I found is that there were at least a couple of thousands of references to `java.util.zip.*` from `netty` - that's why I assume finally finalizer queue is fed by `netty` - but I may be wrong here. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] VitoMakarevich commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1718060859

   I think it's coming from `netty`, but it may be any other package. The fact is that if you use `java.util.zip` package in Java 8 - you use non-empty `finalize` methods that were deprecated and I saw in e.g. Java 11 this package is overwritten to have an empty `finalize` method. So the fact is that whoever using `java.util.zip` causes such problems. But I think the problem is visible only when JVM cannot keep up with the speed of generation in such instances - so it's probably in a big applications only.
   I verified it worked initially when saw millions of objects there - I triggered finalization on running the application using JMX bean and immediately after it I saw the queue being 0 sizes and memory usage drop by dozens of GBs.
   And the second fix I made - I created a background thread that runs `System.runFinalization()` with the same frequency as spark cleanup - and the driver became stable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1717687532

   Is this is not just a different way of saying, you need different GC settings? You want finalizers to run sooner. I don't think we should really be doing this - even the current call to System.gc() is sketch. Doesn't that trigger finalizers though where possible?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1718053663

   You're also saying that this change does resolve it? because I also wonder if this is just a memory leak from netty or somewhere, in which case finalizers don't help much


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] VitoMakarevich commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1724244869

   Is there any question/example I can give you to convince you? As I understand there is no way to reproduce it without provisioning a TB size cluster, unfortunately.
   And what do you think about adding it under the configuration flag, so it will affect only users who want to enable it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "mridulm (via GitHub)" <gi...@apache.org>.

mridulm commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1724562144

   If this is specific to this deployment, as @srowen mentioned, why not do this in user code/library?
   You can run a thread which periodically does this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1716901262

I think regularly executing `runFinalization()` may have the following effects on the system:

1. Impact on performance: The overhead created by executing runFinalization() could affect system performance, especially if a large number of objects define a finalize method, as each call to this method takes time to process these objects.

2. Impact on memory usage: Normally, GC and the execution of the finalize method is automatically managed by the JVM, that is, it is only executed when the object is no longer reachable and system resources are tight. Regularly calling runFinalization() might trigger additional GC when it's not needed, possibly leading to temporary increases in memory usage.

3. Possible change in expected behavior: If the application relies on specific timing for the execution of the finalize method now, then regularly executing runFinalization() could disrupt this expected behavior.

I have a cautious attitude towards this pr, I prefer to find the key points and change them to use methods like `try finally` to clean up resources in a timely manner. Also, the minimum supported Java version for Apache Spark 4.0 will be Java 17, does this problem still exist?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] VitoMakarevich commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "VitoMakarevich (via GitHub)" <gi...@apache.org>.

VitoMakarevich commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1718041681

My reasoning was that currently Spark has `System.gc()` so it should not be a problem to add another call simular to it.
As for comments:
1. `Impact on performance` - Indeed in some circumstances, it may be a problem, but it's better than failing with OOM since millions of objects are accumulated. Also, it's done by a single thread, so the blast radius is limited. And yet it may be added under a feature flag so users who struggle with simular problem may enable it(probably it's minority of users, but they may need a fix as well).
2. `Impact on memory usage` - Yet again you are saying that usually standard approach should work and I'm sure it works, but there are cases when it's not enough. I assume we can say the same about `System.gc()` - usually you should not invoke it since it's costly, but it's here.
3. `Possible change in expected behavior` - there is no way to define logic depending on it as far as I know, it's JVM internal, and it's different from `ReferenceQueue` - a mechanism designed for controllable release of resources - and that's why `finalization` is evil.

As for `try finally` - for sure this is the best way to do it, but as I said - the problem comes from the side library(as I understood from `netty`) and java implementation - so you can't apply this pattern there. And the second problem is that even if you surround a block with `try finally` if it uses objects with non-empty `finalize` - they will go through `Finalizer` thread anyway, so there is no way to prevent it in Java 8.
As for Spark update - you guys see better, maybe it's not a problem in the long-term, but as usual with Java - many users will live years on Java 8 as living now.

As for GC settings - it may look strange, but I didn't find `any` mechanism to control the finalization queue in `any` GC. If you know such - for sure it's up to users to change it according to their application. Another argument for adding this call is that while the user can tune GC - there is still a `System.gc()` in the Spark codebase.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1723630406

   I'm just not convinced this is a good way to address this problem, which seems somehow specific to your setup or usage, and which might have side effects. It's not crazy but feels hacky


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1871639150

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup
URL: https://github.com/apache/spark/pull/42893


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1718045609

   I see, you're saying the finalizers are coming from netty, and they're contributing to OOM.
   I'm curious if something like `-XX:MaxGCPauseMillis` helps your case - turn it down a lot.
   That, too, has perf implications of course.
   
   Do you generally have low memory available to workers / driver? I"m trying to figure out why this doesn't seem to arise regularly, before we try to make a relatively aggressive change here. Running finalizer to _clear memory_ doesn't quite feel right


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #42893: [SPARK-44459][Core] Add System.runFinalization() to periodic cleanup

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #42893:
URL: https://github.com/apache/spark/pull/42893#issuecomment-1716868766

   cc @rednaxelafx FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org