You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by witgo <gi...@git.apache.org> on 2014/12/19 14:53:32 UTC
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
GitHub user witgo opened a pull request:
https://github.com/apache/spark/pull/3744
[SPARK-4902][CORE] gap-sampling performance optimization
jira: [SPARK-4902](https://issues.apache.org/jira/browse/SPARK-4902)
cc @mengxr
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/witgo/spark SPARK-4902
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3744.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3744
----
commit f9dfad2410f0efd1d3b1d360da6bbe1f31507dae
Author: GuoQiang Li <wi...@qq.com>
Date: 2014-12-19T13:50:08Z
gap-sampling performance optimization
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-95059225
[Test build #30736 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30736/consoleFull) for PR 3744 at commit [`d18f877`](https://github.com/apache/spark/commit/d18f877d3757c78eede30b685d749729ac228960).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
* This patch does not change any dependencies.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-95255789
[Test build #30767 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30767/consoleFull) for PR 3744 at commit [`c66115c`](https://github.com/apache/spark/commit/c66115cc5ff0a5210f6e08a050e4a6c58495fd02).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-67651107
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24644/
Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4902][CORE] gap-sampling performan...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-75561628
@witgo is this still live and have you followed up on Xiangrui's comment?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-67651092
[Test build #24644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24644/consoleFull) for PR 3744 at commit [`f9dfad2`](https://github.com/apache/spark/commit/f9dfad2410f0efd1d3b1d360da6bbe1f31507dae).
* This patch **passes all tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-95059230
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30736/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-168112410
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-76143785
```scala
test("bernoulli sampling benchmark") {
class BernoulliSamplerBenchmark(val fraction: Double, items: () => Iterator[Int]) extends scala.testing.Benchmark {
override def run(): Unit = {
val sampler = new BernoulliSampler[Int](fraction)
val count = sampler.sample(items()).size
}
}
val context = new org.apache.spark.TaskContextImpl(0, 0, 0, 0)
var fraction = 0.2
var len = 1e6.toInt
var noTimes = 1000
var array = (1 to len).toArray
var iter: () => Iterator[Int] = () => {
new Iterator[Int] {
var i = 0
override def hasNext = {
i < len
}
override def next = {
i += 1
i
}
}
}
var sampler = new BernoulliSamplerBenchmark(fraction, iter)
var time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
println(s"general sampling fraction=$fraction len=$len use time: $time Ms")
iter = () => {
new org.apache.spark.InterruptibleIterator(context, array.iterator)
}
sampler = new BernoulliSamplerBenchmark(fraction, iter)
time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
println(s"gap sampling fraction=$fraction len=$len use time: $time Ms")
fraction = 0.05
iter = () => {
new Iterator[Int] {
var i = 0
override def hasNext = {
i < len
}
override def next = {
i += 1
i
}
}
}
sampler = new BernoulliSamplerBenchmark(fraction, iter)
time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
println(s"general sampling fraction=$fraction len=$len use time: $time Ms")
array = (1 to len).toArray
iter = () => {
new org.apache.spark.InterruptibleIterator(context, array.iterator)
}
sampler = new BernoulliSamplerBenchmark(fraction, iter)
time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
println(s"gap sampling fraction=$fraction len=$len use time: $time Ms")
fraction = 0.01
iter = () => {
new Iterator[Int] {
var i = 0
override def hasNext = {
i < len
}
override def next = {
i += 1
i
}
}
}
sampler = new BernoulliSamplerBenchmark(fraction, iter)
time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
println(s"general sampling fraction=$fraction len=$len use time: $time Ms")
array = (1 to len).toArray
iter = () => {
new org.apache.spark.InterruptibleIterator(context, array.iterator)
}
sampler = new BernoulliSamplerBenchmark(fraction, iter)
time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
println(s"gap sampling fraction=$fraction len=$len use time: $time Ms")
}
```
=>
```
general sampling fraction=0.2 len=1000000 use time: 14.562 Ms
gap sampling fraction=0.2 len=1000000 use time: 16.352 Ms
general sampling fraction=0.05 len=1000000 use time: 5.408 Ms
gap sampling fraction=0.05 len=1000000 use time: 4.251 Ms
general sampling fraction=0.01 len=1000000 use time: 7.528 Ms
gap sampling fraction=0.01 len=1000000 use time: 1.009 Ms
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-136917681
@mengxr @srowen any updates on this one?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-95278287
[Test build #30767 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30767/consoleFull) for PR 3744 at commit [`c66115c`](https://github.com/apache/spark/commit/c66115cc5ff0a5210f6e08a050e4a6c58495fd02).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
* This patch does not change any dependencies.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-95278322
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30767/
Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-95058676
[Test build #30736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30736/consoleFull) for PR 3744 at commit [`d18f877`](https://github.com/apache/spark/commit/d18f877d3757c78eede30b685d749729ac228960).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/3744
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-67703632
This only helps us trace one level up, correct? Did you compare the performance?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-67640554
[Test build #24644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24644/consoleFull) for PR 3744 at commit [`f9dfad2`](https://github.com/apache/spark/commit/f9dfad2410f0efd1d3b1d360da6bbe1f31507dae).
* This patch merges cleanly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4902][CORE] gap-sampling performan...
Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:
https://github.com/apache/spark/pull/3744#issuecomment-76126729
Two months ago. I talked to @mengxr in an email, I will post the performance test results
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org