You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/02/10 01:57:06 UTC

[GitHub] [spark] gerashegalov opened a new pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

gerashegalov opened a new pull request #31540:
URL: https://github.com/apache/spark/pull/31540


   This PR is a fix for the JLS 17.5.3 violation identified in
   @zsxwing's 19/Feb/19 11:47 comment on the JIRA.
   
   ### What changes were proposed in this pull request?
   - Use a var field to hold the state of the collection accumulator 
   
   ### Why are the changes needed?
   AccumulatorV2 auto-registration of accumulator during readObject doesn't work with final fields that are post-processed outside readObject. As it stands incompletely initialized objects are published to heartbeat thread. This leads to sporadic exceptions knocking out executors which increases the cost of the jobs. We observe such failures multiple times every hour.
   
   ### Does this PR introduce _any_ user-facing change?
   None
   
   ### How was this patch tested?
   - this is a concurrency bug that is almost impossible to reproduce as a quick unit test. 
   - By trial and error I crafted a command https://github.com/NVIDIA/spark-rapids/pull/1688 that reproduces the issue on my dev box several times per hour, with the first occurrence often within a few minutes    
   - existing unit tests in *`AccumulatorV2Suite` and *`LiveEntitySuite`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen closed pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

srowen closed pull request #31540:
URL: https://github.com/apache/spark/pull/31540


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gerashegalov commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

gerashegalov commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-777098664


   Thanks for taking a look @mridulm 
   
   correct, this patch does not address the problem in general, it just mitigates it for `CollectionAccumulator` and its subclass `PythonAccumulatorV2`. Any solution relying on readObject to publish the object including the one you propose will publish the object too early.
   
   A generally correct solution that will work for user-defined Accumulators as well must not rely on readObject imo, and it is more involved.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-776392734


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-776392734


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782705679


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39894/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782695713


   **[Test build #135314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135314/testReport)** for PR 31540 at commit [`c9918ab`](https://github.com/apache/spark/commit/c9918ab29d570eb098aba4104c22945e29944a01).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782725442


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135314/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782681238






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gerashegalov commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

gerashegalov commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-837830422


   @zhengruifeng can you provide a minimum code reproducing for NPEs you are observing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782710527


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39894/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-837702523


   > This does not necessarily solve the issue that @zsxwing detailed - the issue here is `registerAccumulator` should not be called in `readObject` before subclasses have completed readObject.
   > 
   > One possible solution would be to introduce two methods.
   > 
   > a) A protected method `doHandleDriverSideAccumulator()` in `AccumulatorV2` - which has all the code after `defaultReadObject` in readObject.
   > b) Call `handleDriverSideAccumulator` after `defaultReadObject` in `AccumulatorV2`. In `AccumulatorV2`, this protected method will simply delegate to `doHandleDriverSideAccumulator`.
   > c) In subclasses with local state, override `doHandleDriverSideAccumulator` to make it do nothing - and after readObject in subclass is done, invoke `doHandleDriverSideAccumulator`
   > 
   > This will ensure AccumulatorV2 and subclasses will register only after state has been initialized.
   > (Rough sketch, please change logic/names/etc as relevant).
   > 
   > Note, there are other accumulators with local state; we should do this for all.
   > Thoughts ?
   
   +1
   
   I recently impl some accv2 (some complex statistics containing transient lazy vars and using collections like openhashmap/array/etc) in my work, there are lots of NPE which make task probablly fail. I has tried the method like this PR, but it do not help evidently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782725442


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/135314/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #31540:
URL: https://github.com/apache/spark/pull/31540#discussion_r578541850



##########
File path: core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala
##########
@@ -449,39 +449,46 @@ class DoubleAccumulator extends AccumulatorV2[jl.Double, jl.Double] {
  * @since 2.0.0
  */
 class CollectionAccumulator[T] extends AccumulatorV2[T, java.util.List[T]] {
-  private val _list: java.util.List[T] = Collections.synchronizedList(new ArrayList[T]())
+  private var _list: java.util.List[T] = _

Review comment:
       Does a `lazy val` work here? not sure what it generates under the hood or whether it would also fix the issue. It would be simpler than handling init lazily, but this is no big deal.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782716352


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39894/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782695713


   **[Test build #135314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135314/testReport)** for PR 31540 at commit [`c9918ab`](https://github.com/apache/spark/commit/c9918ab29d570eb098aba4104c22945e29944a01).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on a change in pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

srowen commented on a change in pull request #31540:
URL: https://github.com/apache/spark/pull/31540#discussion_r579662690



##########
File path: core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala
##########
@@ -449,39 +449,46 @@ class DoubleAccumulator extends AccumulatorV2[jl.Double, jl.Double] {
  * @since 2.0.0
  */
 class CollectionAccumulator[T] extends AccumulatorV2[T, java.util.List[T]] {
-  private val _list: java.util.List[T] = Collections.synchronizedList(new ArrayList[T]())
+  private var _list: java.util.List[T] = _
+
+  private def getOrCreate = {
+    _list = Option(_list).getOrElse(new java.util.ArrayList[T]())
+    _list
+  }
 
   /**
    * Returns false if this accumulator instance has any values in it.
    */
-  override def isZero: Boolean = _list.isEmpty
+  override def isZero: Boolean = this.synchronized(getOrCreate.isEmpty)
 
   override def copyAndReset(): CollectionAccumulator[T] = new CollectionAccumulator
 
   override def copy(): CollectionAccumulator[T] = {
     val newAcc = new CollectionAccumulator[T]
-    _list.synchronized {
-      newAcc._list.addAll(_list)
+    this.synchronized {

Review comment:
       Obviously we can't sync on _list anymore. This changes to sync on the object itself, which should be fine AFAIK - I don't think anything else depends on the lock of the accumulator itself and it's only manipulating its own state while holding the lock.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] srowen commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

srowen commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782786865


   Merged to master/3.1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mridulm edited a comment on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

mridulm edited a comment on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-776442655

This does not necessarily solve the issue that @zsxwing detailed - the issue here is `registerAccumulator` should not be called in `readObject` before subclasses have completed readObject.

One possible solution would be to introduce two methods.

a) A protected method `doHandleDriverSideAccumulator()` in `AccumulatorV2` - which has all the code after `defaultReadObject` in readObject.
b) Call `handleDriverSideAccumulator` after `defaultReadObject` in `AccumulatorV2`. In `AccumulatorV2`, this protected method will simply delegate to `doHandleDriverSideAccumulator`.
c) In subclasses with local state, override `doHandleDriverSideAccumulator` to make it do nothing - and after readObject in subclass is done, invoke `doHandleDriverSideAccumulator`

This will ensure AccumulatorV2 and subclasses will register only after state has been initialized.
(Rough sketch, please change logic/names/etc as relevant).

Thoughts ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782718561


   **[Test build #135314 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/135314/testReport)** for PR 31540 at commit [`c9918ab`](https://github.com/apache/spark/commit/c9918ab29d570eb098aba4104c22945e29944a01).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mridulm edited a comment on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

mridulm edited a comment on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-776442655

This does not necessarily solve the issue that @zsxwing detailed - the issue here is `registerAccumulator` should not be called in `readObject` before subclasses have completed readObject.

One possible solution would be to introduce two methods.

This will ensure AccumulatorV2 and subclasses will register only after state has been initialized.
(Rough sketch, please change logic/names/etc as relevant).

Note, there are other accumulators with local state; we should do this for all.
Thoughts ?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gerashegalov commented on a change in pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

gerashegalov commented on a change in pull request #31540:
URL: https://github.com/apache/spark/pull/31540#discussion_r578782284



##########
File path: core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala
##########
@@ -449,39 +449,46 @@ class DoubleAccumulator extends AccumulatorV2[jl.Double, jl.Double] {
  * @since 2.0.0
  */
 class CollectionAccumulator[T] extends AccumulatorV2[T, java.util.List[T]] {
-  private val _list: java.util.List[T] = Collections.synchronizedList(new ArrayList[T]())
+  private var _list: java.util.List[T] = _

Review comment:
       `lazy val` was the first thing I tried before embarking on this PR. It exacerbates the situation by adding another `final` field for checking set/not set as another source of NPE that I saw with my repro.
   
   We can use the code from the [filed modifiers java tutorial](https://docs.oracle.com/javase/tutorial/reflect/member/fieldModifiers.html) to demonstrate the issue:
   
   Before the PR one final field:
   ```bash
   $ ./bin/spark-submit ~/gits/FieldModifierSpy/fieldmodifier.jar org.apache.spark.util.CollectionAccumulator final
   Fields in Class 'org.apache.spark.util.CollectionAccumulator' containing modifiers:  final
   _list    [ synthetic=false enum_constant=false ]
   ```
   
   If we replace _list to be a `lazy val` we'll have 
   ```
   Fields in Class 'org.apache.spark.util.CollectionAccumulator' containing modifiers:  
   _list    [ synthetic=false enum_constant=false ]
   bitmap$0 [ synthetic=false enum_constant=false ]
   ```
   
   In the PR branch there are no final fields:
   ```
   $ ./bin/spark-submit ~/gits/FieldModifierSpy/fieldmodifier.jar org.apache.spark.util.CollectionAccumulator final
   Fields in Class 'org.apache.spark.util.CollectionAccumulator' containing modifiers:  final
   No matching fields
   ```
    




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-782716352


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/39894/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #31540: [SPARK-20977][CORE] Use a non-final field for the state of CollectionAccumulator

Posted by GitBox <gi...@apache.org>.

mridulm commented on pull request #31540:
URL: https://github.com/apache/spark/pull/31540#issuecomment-776442655

This does not necessarily solve the issue that @zsxwing detailed - the issue here is `registerAccumulator` should not be called in `readObject` before subclasses have completed readObject.

One possible solution would be to introduce two methods.

a) A protected method `doHandleDriverSideAccumulator()` - which has all the code after `defaultReadObject` in readObject.
b) Call `handleDriverSideAccumulator` after `defaultReadObject` in `AccumulatorV2`. In `AccumulatorV2`, this protected method will simply delegate to `doHandleDriverSideAccumulator`.
c) In subclasses with local state, override `doHandleDriverSideAccumulator` to make it do nothing - and after readObject in subclass is done, invoke `doHandleDriverSideAccumulator`

This will ensure AccumulatorV2 and subclasses will register only after state has been initialized.
(Rough sketch, please change logic/names/etc as relevant).

Thoughts ?

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org