You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by witgo <gi...@git.apache.org> on 2014/12/19 14:53:32 UTC

[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/3744

    [SPARK-4902][CORE] gap-sampling performance optimization

    jira: [SPARK-4902](https://issues.apache.org/jira/browse/SPARK-4902)
    cc @mengxr

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark SPARK-4902

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3744.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3744
    
----
commit f9dfad2410f0efd1d3b1d360da6bbe1f31507dae
Author: GuoQiang Li <wi...@qq.com>
Date:   2014-12-19T13:50:08Z

    gap-sampling performance optimization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-95059225
  
      [Test build #30736 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30736/consoleFull) for   PR 3744 at commit [`d18f877`](https://github.com/apache/spark/commit/d18f877d3757c78eede30b685d749729ac228960).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-95255789
  
      [Test build #30767 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30767/consoleFull) for   PR 3744 at commit [`c66115c`](https://github.com/apache/spark/commit/c66115cc5ff0a5210f6e08a050e4a6c58495fd02).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-67651107
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24644/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-4902][CORE] gap-sampling performan...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-75561628
  
    @witgo is this still live and have you followed up on Xiangrui's comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-67651092
  
      [Test build #24644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24644/consoleFull) for   PR 3744 at commit [`f9dfad2`](https://github.com/apache/spark/commit/f9dfad2410f0efd1d3b1d360da6bbe1f31507dae).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-95059230
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30736/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-168112410
  
    I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-76143785
  
    ```scala
    test("bernoulli sampling benchmark") {
        class BernoulliSamplerBenchmark(val fraction: Double, items: () => Iterator[Int]) extends scala.testing.Benchmark {
          override def run(): Unit = {
            val sampler = new BernoulliSampler[Int](fraction)
            val count = sampler.sample(items()).size
          }
        }
    
        val context = new org.apache.spark.TaskContextImpl(0, 0, 0, 0)
        var fraction = 0.2
        var len = 1e6.toInt
        var noTimes = 1000
        var array = (1 to len).toArray
    
        var iter: () => Iterator[Int] = () => {
          new Iterator[Int] {
            var i = 0
    
            override def hasNext = {
              i < len
            }
    
            override def next = {
              i += 1
              i
            }
          }
        }
        var sampler = new BernoulliSamplerBenchmark(fraction, iter)
        var time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
        println(s"general sampling fraction=$fraction len=$len use time: $time Ms")
    
        iter = () => {
          new org.apache.spark.InterruptibleIterator(context, array.iterator)
        }
        sampler = new BernoulliSamplerBenchmark(fraction, iter)
        time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
        println(s"gap sampling fraction=$fraction len=$len use time: $time Ms")
    
        fraction = 0.05
        iter = () => {
          new Iterator[Int] {
            var i = 0
    
            override def hasNext = {
              i < len
            }
    
            override def next = {
              i += 1
              i
            }
          }
        }
        sampler = new BernoulliSamplerBenchmark(fraction, iter)
        time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
        println(s"general sampling fraction=$fraction len=$len use time: $time Ms")
    
        array = (1 to len).toArray
        iter = () => {
          new org.apache.spark.InterruptibleIterator(context, array.iterator)
        }
        sampler = new BernoulliSamplerBenchmark(fraction, iter)
        time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
        println(s"gap sampling fraction=$fraction len=$len use time: $time Ms")
    
    
        fraction = 0.01
        iter = () => {
          new Iterator[Int] {
            var i = 0
    
            override def hasNext = {
              i < len
            }
    
            override def next = {
              i += 1
              i
            }
          }
        }
        sampler = new BernoulliSamplerBenchmark(fraction, iter)
        time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
        println(s"general sampling fraction=$fraction len=$len use time: $time Ms")
    
        array = (1 to len).toArray
        iter = () => {
          new org.apache.spark.InterruptibleIterator(context, array.iterator)
        }
        sampler = new BernoulliSamplerBenchmark(fraction, iter)
        time = sampler.runBenchmark(noTimes).sum.toDouble / noTimes
        println(s"gap sampling fraction=$fraction len=$len use time: $time Ms")
      }
    ```
    
    =>
    
    ```
    general sampling fraction=0.2 len=1000000 use time: 14.562 Ms
    gap sampling fraction=0.2 len=1000000 use time: 16.352 Ms
    general sampling fraction=0.05 len=1000000 use time: 5.408 Ms
    gap sampling fraction=0.05 len=1000000 use time: 4.251 Ms
    general sampling fraction=0.01 len=1000000 use time: 7.528 Ms
    gap sampling fraction=0.01 len=1000000 use time: 1.009 Ms
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-136917681
  
    @mengxr @srowen any updates on this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-95278287
  
      [Test build #30767 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30767/consoleFull) for   PR 3744 at commit [`c66115c`](https://github.com/apache/spark/commit/c66115cc5ff0a5210f6e08a050e4a6c58495fd02).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-95278322
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30767/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-95058676
  
      [Test build #30736 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30736/consoleFull) for   PR 3744 at commit [`d18f877`](https://github.com/apache/spark/commit/d18f877d3757c78eede30b685d749729ac228960).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3744


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-67703632
  
    This only helps us trace one level up, correct? Did you compare the performance?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4902][CORE] gap-sampling performance op...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-67640554
  
      [Test build #24644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24644/consoleFull) for   PR 3744 at commit [`f9dfad2`](https://github.com/apache/spark/commit/f9dfad2410f0efd1d3b1d360da6bbe1f31507dae).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP][SPARK-4902][CORE] gap-sampling performan...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/3744#issuecomment-76126729
  
    Two months ago. I talked to @mengxr in an email, I will post the performance test results


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org