You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by nfergu <gi...@git.apache.org> on 2014/09/18 00:10:04 UTC
[GitHub] spark pull request: [SPARK-3051] Support looking-up named accumula...
GitHub user nfergu opened a pull request:
https://github.com/apache/spark/pull/2438
[SPARK-3051] Support looking-up named accumulators in a registry
This proposal builds on SPARK-2380 (Support displaying accumulator values in the web UI) to allow
named accumulables to be looked-up in a "registry", as opposed to having to be passed to every
method that need to access them. See the JIRA ticket (SPARK-3051) for some more details, including
an example use case.
An AccumulableRegistry object is provided to allow accumulables to be looked-up by name at task
execution time. This requires named accumulables to be broadcast to the executors before the task
is executed. This is taken care of in the DAGScheduler in much the same way as for Tasks.
Accumulables were already stored in thread-local variables in the Accumulators object, so exposing
these in the registry was simply a matter of wrapping this object, and keying the accumulables by
name (they were previously keyed only by ID).
Note that Accumulables cannot be looked-up from the registry in the driver program; they can only be
obtained while an operation is being performed on an RDD (a task is executing).
One important thing to note in the implementation is that it is important that we we deserialize
any named accumulators that have been broadcast before those that are explicitly passed with the task
as we want the explicitly-passed ones to override the broadcast ones (otherwise the explicitly passed
ones would not work). This may be a little brittle, but there are tests (in AccumulatorSuite) that will break
if this is ever changed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nfergu/spark accumreg
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2438.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2438
----
commit 7fae6af05d0ca91ac5444161a1e16b716ac3af83
Author: Neil Ferguson <ne...@nice.com>
Date: 2014-08-14T22:37:31Z
First attempt at allowing named accumulators to be looked-up in an AccumulableRegistry. Needs clean-up, testing, and documentation.
Conflicts:
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
commit 7ae16d69c30baad4609c226c79625958170198be
Author: Neil Ferguson <ne...@nice.com>
Date: 2014-09-16T21:47:26Z
Added documentation, testing, and some fixes for functionality to allow named accumulators to be looked-up in an AccumulableRegistry.
Conflicts:
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
commit a89c51daeb3f172d2a3116afff48dc034e4d50c9
Author: Neil Ferguson <ne...@nice.com>
Date: 2014-09-17T21:53:13Z
Small documentation tweaks for AccumulableRegistry functionality
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3051] Support looking-up named accumula...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2438#issuecomment-55968236
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3051] Support looking-up named accumula...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/2438#issuecomment-75152454
Hi @pwendell @nfergu this seems to be a fairly old PR that hasn't had any attention for a while. What is the state of the changes here? Are they still relevant in master?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3051] Support looking-up named accumula...
Posted by nfergu <gi...@git.apache.org>.
Github user nfergu commented on the pull request:
https://github.com/apache/spark/pull/2438#issuecomment-62641279
Hi @pwendell
I can't think of any obvious way to implement this in user-space without wrapping each RDD operation. In fact, this is the approach that I'm taking at the moment. I'm wrapping each RDD operation in something like the following:
```scala
def myMap[U: ClassTag](f: T => U): RDD[U] = {
val registry = Metrics.Accumulable.value
self.map((t: T) => {
// Get the value that we obtained from thread-local storage in the driver, and put
// it back into thread local storage while executing this function
Metrics.Accumulable.withValue(registry) {
f(t)
}
})
}
```
Where `Metrics.Accumulable` is a DynamicVariable (ThreadLocal).
The main problem with this is, of course, that it involves wrapping every RDD operation. Can you think of a nicer way to implement this in user-space?
There are a couple of other minor problems with this approach: 1 it messes-up the callsite, but that would be easily solved by a small modification to `Utils.getCallSite`; 2 it doesn't work for operations like `saveAsNewAPIHadoopFile` where there's no function that gets passed.
One option, if we want to go ahead with this patch, would be to add a flag to Accumulables which determines whether they get broadcast. Having said that, my need for this patch is not as critical as it was, now that I'm taking the approach mentioned of wrapping the RDD operation.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3051] Support looking-up named accumula...
Posted by nfergu <gi...@git.apache.org>.
Github user nfergu closed the pull request at:
https://github.com/apache/spark/pull/2438
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3051] Support looking-up named accumula...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2438#issuecomment-62332004
Hey @nfergu - I was looking for any older PR's that have fallen through the cracks and came across this. This is a very well written patch - kudos! When I suggested this registry concept initially, I was actually envisioning this happening in user-space rather than in Spark itself. I think automatically broadcasting all named accumulators is not going to work because some applications create thousands of accumulators (e.g. streaming applications), and it could end up with an unexpected performance regression.
For some applications this might be acceptable though. How hard would it be for a user-space library to implement this rather than having it be inside of Spark proper?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-3051] Support looking-up named accumula...
Posted by nfergu <gi...@git.apache.org>.
Github user nfergu commented on the pull request:
https://github.com/apache/spark/pull/2438#issuecomment-75333522
@andrewor14 - I did this originally to enable instrumentation to be added to Spark but I've ended up taking a different approach. The only thing this PR would help with at this point is instrumenting loading from files (which I'm unable to do with my current approach) but it's probably not worth making such a big change for that. So it's probably best if I close this PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org