You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by shenh062326 <gi...@git.apache.org> on 2015/04/21 10:55:29 UTC

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

GitHub user shenh062326 opened a pull request:

    https://github.com/apache/spark/pull/5608

    [SPARK-6738] [CORE] Improve estimate the size of a large array

    Currently, SizeEstimator.visitArray is not correct in the follow case, 
    array size > 200, 
    elem has the share object
    
    when I add a debug log in SizeTracker.scala:
     logInfo(s"numUpdates:$numUpdates, size:$ts, bytesPerUpdate:$bytesPerUpdate, cost time:$b")
    
    I got the following log:
     numUpdates:1, size:262448, bytesPerUpdate:0.0, cost time:35
     numUpdates:2, size:420698, bytesPerUpdate:158250.0, cost time:35
     numUpdates:4, size:420754, bytesPerUpdate:28.0, cost time:32
     numUpdates:7, size:420754, bytesPerUpdate:0.0, cost time:27
     numUpdates:12, size:420754, bytesPerUpdate:0.0, cost time:28
     numUpdates:20, size:420754, bytesPerUpdate:0.0, cost time:25
     numUpdates:32, size:420754, bytesPerUpdate:0.0, cost time:21
     numUpdates:52, size:420754, bytesPerUpdate:0.0, cost time:20
     numUpdates:84, size:420754, bytesPerUpdate:0.0, cost time:20
     numUpdates:135, size:420754, bytesPerUpdate:0.0, cost time:20
     numUpdates:216, size:420754, bytesPerUpdate:0.0, cost time:11
     numUpdates:346, size:420754, bytesPerUpdate:0.0, cost time:6
     numUpdates:554, size:488911, bytesPerUpdate:327.67788461538464, cost time:8
     numUpdates:887, size:2312259426, bytesPerUpdate:6942253.798798799, cost time:198
    15/04/21 14:27:26 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 3.0 GB to disk (1 time so far)
    15/04/21 14:27:26 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1426746631567_11745/spark-local-20150421142719-c001/30/temp_local_066af981-c2fc-4b70-a00e-110e23006fbc
    
    But in fact the file size is only 162K:
    $ ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1426746631567_11745/spark-local-20150421142719-c001/30/temp_local_066af981-c2fc-4b70-a00e-110e23006fbc
    -rw-r----- 1 spark users 162K Apr 21 14:27 /data11/yarnenv/local/usercache/spark/appcache/application_1426746631567_11745/spark-local-20150421142719-c001/30/temp_local_066af981-c2fc-4b70-a00e-110e23006fbc


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/shenh062326/spark my_change5

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5608.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5608
    
----
commit 4c28e36f94cef254655d76ee9e290483943e4ba8
Author: Hong Shen <ho...@tencent.com>
Date:   2015-04-21T08:40:29Z

    Improve estimate the size of a large array

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96455128
  
      [Test build #715 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/715/consoleFull) for   PR 5608 at commit [`5506bae`](https://github.com/apache/spark/commit/5506baed18705f26fbd59aff95db567d2008aba3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94710817
  
      [Test build #30658 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30658/consoleFull) for   PR 5608 at commit [`4c28e36`](https://github.com/apache/spark/commit/4c28e36f94cef254655d76ee9e290483943e4ba8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95926049
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30932/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95901582
  
      [Test build #30931 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30931/consoleFull) for   PR 5608 at commit [`a9fca84`](https://github.com/apache/spark/commit/a9fca8444d7a8591032383a7d6ced84ee1f66a56).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96767275
  
      [Test build #30996 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30996/consoleFull) for   PR 5608 at commit [`5506bae`](https://github.com/apache/spark/commit/5506baed18705f26fbd59aff95db567d2008aba3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96285979
  
      [Test build #702 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/702/consoleFull) for   PR 5608 at commit [`fe202a2`](https://github.com/apache/spark/commit/fe202a2462248a94596517c85db8d97010473ab7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94754171
  
    No, the change has no matter with the check for null.
     If the arraySize > 200, and elem has the share object, the SizeEstimator.visitArray is not correct.
    for example,  arraySize=20000,  all the array elem has a share object with 1000000B and not share object 50B,  the truely size is 20000*50B+1000000B=2*10^B  , but currently, SizeEstimator.visitArray will return (100* 50B　+ 1000000B ) * 20000/100=201*10^6B, It's more than 100 times than the truely size. Furthermore is the array is greater, the greater the multiplier.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96289979
  
      [Test build #702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/702/consoleFull) for   PR 5608 at commit [`fe202a2`](https://github.com/apache/spark/commit/fe202a2462248a94596517c85db8d97010473ab7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch **removes the following dependencies:**
       * `RoaringBitmap-0.4.5.jar`
       * `activation-1.1.jar`
       * `akka-actor_2.10-2.3.4-spark.jar`
       * `akka-remote_2.10-2.3.4-spark.jar`
       * `akka-slf4j_2.10-2.3.4-spark.jar`
       * `aopalliance-1.0.jar`
       * `arpack_combined_all-0.1.jar`
       * `avro-1.7.7.jar`
       * `breeze-macros_2.10-0.11.2.jar`
       * `breeze_2.10-0.11.2.jar`
       * `chill-java-0.5.0.jar`
       * `chill_2.10-0.5.0.jar`
       * `commons-beanutils-1.7.0.jar`
       * `commons-beanutils-core-1.8.0.jar`
       * `commons-cli-1.2.jar`
       * `commons-codec-1.10.jar`
       * `commons-collections-3.2.1.jar`
       * `commons-compress-1.4.1.jar`
       * `commons-configuration-1.6.jar`
       * `commons-digester-1.8.jar`
       * `commons-httpclient-3.1.jar`
       * `commons-io-2.1.jar`
       * `commons-lang-2.5.jar`
       * `commons-lang3-3.3.2.jar`
       * `commons-math-2.1.jar`
       * `commons-math3-3.4.1.jar`
       * `commons-net-2.2.jar`
       * `compress-lzf-1.0.0.jar`
       * `config-1.2.1.jar`
       * `core-1.1.2.jar`
       * `curator-client-2.4.0.jar`
       * `curator-framework-2.4.0.jar`
       * `curator-recipes-2.4.0.jar`
       * `gmbal-api-only-3.0.0-b023.jar`
       * `grizzly-framework-2.1.2.jar`
       * `grizzly-http-2.1.2.jar`
       * `grizzly-http-server-2.1.2.jar`
       * `grizzly-http-servlet-2.1.2.jar`
       * `grizzly-rcm-2.1.2.jar`
       * `groovy-all-2.3.7.jar`
       * `guava-14.0.1.jar`
       * `guice-3.0.jar`
       * `hadoop-annotations-2.2.0.jar`
       * `hadoop-auth-2.2.0.jar`
       * `hadoop-client-2.2.0.jar`
       * `hadoop-common-2.2.0.jar`
       * `hadoop-hdfs-2.2.0.jar`
       * `hadoop-mapreduce-client-app-2.2.0.jar`
       * `hadoop-mapreduce-client-common-2.2.0.jar`
       * `hadoop-mapreduce-client-core-2.2.0.jar`
       * `hadoop-mapreduce-client-jobclient-2.2.0.jar`
       * `hadoop-mapreduce-client-shuffle-2.2.0.jar`
       * `hadoop-yarn-api-2.2.0.jar`
       * `hadoop-yarn-client-2.2.0.jar`
       * `hadoop-yarn-common-2.2.0.jar`
       * `hadoop-yarn-server-common-2.2.0.jar`
       * `ivy-2.4.0.jar`
       * `jackson-annotations-2.4.0.jar`
       * `jackson-core-2.4.4.jar`
       * `jackson-core-asl-1.8.8.jar`
       * `jackson-databind-2.4.4.jar`
       * `jackson-jaxrs-1.8.8.jar`
       * `jackson-mapper-asl-1.8.8.jar`
       * `jackson-module-scala_2.10-2.4.4.jar`
       * `jackson-xc-1.8.8.jar`
       * `jansi-1.4.jar`
       * `javax.inject-1.jar`
       * `javax.servlet-3.0.0.v201112011016.jar`
       * `javax.servlet-3.1.jar`
       * `javax.servlet-api-3.0.1.jar`
       * `jaxb-api-2.2.2.jar`
       * `jaxb-impl-2.2.3-1.jar`
       * `jcl-over-slf4j-1.7.10.jar`
       * `jersey-client-1.9.jar`
       * `jersey-core-1.9.jar`
       * `jersey-grizzly2-1.9.jar`
       * `jersey-guice-1.9.jar`
       * `jersey-json-1.9.jar`
       * `jersey-server-1.9.jar`
       * `jersey-test-framework-core-1.9.jar`
       * `jersey-test-framework-grizzly2-1.9.jar`
       * `jets3t-0.7.1.jar`
       * `jettison-1.1.jar`
       * `jetty-util-6.1.26.jar`
       * `jline-0.9.94.jar`
       * `jline-2.10.4.jar`
       * `jodd-core-3.6.3.jar`
       * `json4s-ast_2.10-3.2.10.jar`
       * `json4s-core_2.10-3.2.10.jar`
       * `json4s-jackson_2.10-3.2.10.jar`
       * `jsr305-1.3.9.jar`
       * `jtransforms-2.4.0.jar`
       * `jul-to-slf4j-1.7.10.jar`
       * `kryo-2.21.jar`
       * `log4j-1.2.17.jar`
       * `lz4-1.2.0.jar`
       * `management-api-3.0.0-b012.jar`
       * `mesos-0.21.0-shaded-protobuf.jar`
       * `metrics-core-3.1.0.jar`
       * `metrics-graphite-3.1.0.jar`
       * `metrics-json-3.1.0.jar`
       * `metrics-jvm-3.1.0.jar`
       * `minlog-1.2.jar`
       * `netty-3.8.0.Final.jar`
       * `netty-all-4.0.23.Final.jar`
       * `objenesis-1.2.jar`
       * `opencsv-2.3.jar`
       * `oro-2.0.8.jar`
       * `paranamer-2.6.jar`
       * `parquet-column-1.6.0rc3.jar`
       * `parquet-common-1.6.0rc3.jar`
       * `parquet-encoding-1.6.0rc3.jar`
       * `parquet-format-2.2.0-rc1.jar`
       * `parquet-generator-1.6.0rc3.jar`
       * `parquet-hadoop-1.6.0rc3.jar`
       * `parquet-jackson-1.6.0rc3.jar`
       * `protobuf-java-2.4.1.jar`
       * `protobuf-java-2.5.0-spark.jar`
       * `py4j-0.8.2.1.jar`
       * `pyrolite-2.0.1.jar`
       * `quasiquotes_2.10-2.0.1.jar`
       * `reflectasm-1.07-shaded.jar`
       * `scala-compiler-2.10.4.jar`
       * `scala-library-2.10.4.jar`
       * `scala-reflect-2.10.4.jar`
       * `scalap-2.10.4.jar`
       * `scalatest_2.10-2.2.1.jar`
       * `slf4j-api-1.7.10.jar`
       * `slf4j-log4j12-1.7.10.jar`
       * `snappy-java-1.1.1.7.jar`
       * `spark-bagel_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-catalyst_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-core_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-graphx_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-launcher_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-mllib_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-network-common_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-network-shuffle_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-repl_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-sql_2.10-1.4.0-SNAPSHOT.jar`
       * `spark-streaming_2.10-1.4.0-SNAPSHOT.jar`
       * `spire-macros_2.10-0.7.4.jar`
       * `spire_2.10-0.7.4.jar`
       * `stax-api-1.0.1.jar`
       * `stream-2.7.0.jar`
       * `tachyon-0.6.4.jar`
       * `tachyon-client-0.6.4.jar`
       * `uncommons-maths-1.2.2a.jar`
       * `unused-1.0.0.jar`
       * `xmlenc-0.52.jar`
       * `xz-1.0.jar`
       * `zookeeper-3.4.5.jar`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96454990
  
    I don't know why it has not start build automaticly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94737759
  
      [Test build #30662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30662/consoleFull) for   PR 5608 at commit [`a2ea7ac`](https://github.com/apache/spark/commit/a2ea7acb04c8fb42a8a285ff9415a7e608214254).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95002775
  
    I see, it looks like you changed the last line to fix this. Can you also add some unit tests, and fix the code to match the Spark code style guide (https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95388189
  
    @srowen  
    At first, I also want to exclude shared objects by discarding the first non-null sample, but not always work, since not all the objects links to the shared objects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96618613
  
    Jenkins had some trouble posting the results, but this passed. There are still some very minor style issues here but maybe I can clean them up on merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5608#discussion_r29097662
  
    --- Diff: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ---
    @@ -204,25 +204,36 @@ private[spark] object SizeEstimator extends Logging {
             }
           } else {
             // Estimate the size of a large array by sampling elements without replacement.
    -        var size = 0.0
    +        // To exclude the shared objects that the array elements may link, sample twice
    +        // and use the min one to caculate array size.
             val rand = new Random(42)
    -        val drawn = new OpenHashSet[Int](ARRAY_SAMPLE_SIZE)
    -        var numElementsDrawn = 0
    -        while (numElementsDrawn < ARRAY_SAMPLE_SIZE) {
    -          var index = 0
    -          do {
    -            index = rand.nextInt(length)
    -          } while (drawn.contains(index))
    -          drawn.add(index)
    -          val elem = ScalaRunTime.array_apply(array, index).asInstanceOf[AnyRef]
    -          size += SizeEstimator.estimate(elem, state.visited)
    -          numElementsDrawn += 1
    -        }
    -        state.size += ((length / (ARRAY_SAMPLE_SIZE * 1.0)) * size).toLong
    +        val drawn = new OpenHashSet[Int](2 * ARRAY_SAMPLE_SIZE)
    --- End diff --
    
    If the array size >= 400, we only have to sample 100 distinct elements from the array, twice.
    +    for (i <- 0 until ARRAY_SAMPLE_SIZE) {
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95908120
  
    Sampling strategy not always works, but sampling twice are more effective then only discarding the first non-null sample. And sampling 200 times  will not cause performance issues. 
    If you think the code shouldn't written like that, I aggree, I will change it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5608#issuecomment-95104653

I understand this a bit better. I see that `estimate` already accounts for shared objects since it calls `enqueue` so my comment above is wrong.

However, wouldn't the shared data structures be counted when estimating the first non-null object? Isn't it sufficient to just discard the first sample? I understand that this only catches the case where all of the non-null objects have this reference to this state, but this approach can't always find such a thing no matter what if it's sampling. The two-loops approach can be written more simply, but even that complexity doesn't seem to buy much.

If the purpose is to spill when there is memory pressure, then why ignore a shared data structure? it's still taking up memory. It doesn't get written to disk, likely, but that's not the question. Or is the purpose to spill when the spilled size is likely to be large enough, and memory usage is just a proxy for that?

I'd like to see if merely discarding the first non-null sample fixes most of this more simply.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95913396
  
      [Test build #30932 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30932/consoleFull) for   PR 5608 at commit [`fe202a2`](https://github.com/apache/spark/commit/fe202a2462248a94596517c85db8d97010473ab7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/5608#issuecomment-94771988

Are you talking about a "shared" object? like, an array containing many references to the same object?

There does seem to be a difference in behavior in the sampled and non-sampled cases. In the non-sampled case, it accounts for shared data structures and counts them once. In the sampled case, it does not and counts them each time.

It seems like they should behave the same way. The former better reflects the size in memory; the latter might better reflect the size as serialized to disk. But we can't really know the on-disk size this way, so I think that maybe it is supposed to use the same mechanism in both cases. That is, yes, it shouldn't double-count shared data structures in the sampled case. (CC @mateiz as the author of that bit in case he has comments.)

However, that's not what you are changing at all, and I don't understand the intent of these changes. Why is 200 special, such that it needs to be 400? and what does the second loop accomplish? Why not use `enqueue` like the other branch does? the rest of this doesn't look like a correct change.

It's not going to change the spilled size of data on disk. I think what it might do is *not* cause it to spill so early. This is highly dependent on the nature of the data being serialized -- whether there are shared data structures and how big the serialized form is. Since the question here is when to spill under memory pressure, I do think that making the logic similar for the sampled and non-sampled cases sounds correct.

CC @sryza and @rxin as this might be a good catch regarding shuffle spills in certain cases, but I'm not entirely sure.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95905420
  
    @shenh062326 yes, but you constructed it that way. I can construct a case that works and doesn't work for any sampling strategy. The question is, what is the common case? I'm pretty sure it's that all N objects share some common data structure, which sampling just 1 would catch.
    
    However if you want to go this way, at least generalize it. There is nothing magic about 2 samples, so it shouldn't be written that way with a redundant loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96870004
  
    @srowen  @mateiz  
    Thanks for you review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96365828
  
    @shenh062326 This still doesn't compile though, see the test output.
    ```
    [error] /home/jenkins/workspace/NewSparkPullRequestBuilder/core/src/test/scala/org/apache/spark/util/SizeEstimatorSuite.scala:104: not found: type ArrayBuffer
    [error]     val buf = new ArrayBuffer[DummyString]()
    [error]                   ^
    [info] Done updating.
    [error] one error found
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5608#discussion_r29046278
  
    --- Diff: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ---
    @@ -204,25 +204,36 @@ private[spark] object SizeEstimator extends Logging {
             }
           } else {
             // Estimate the size of a large array by sampling elements without replacement.
    -        var size = 0.0
    +        // To exclude the shared objects that the array elements may link, sample twice
    +        // and use the min one to caculate array size.
             val rand = new Random(42)
    -        val drawn = new OpenHashSet[Int](ARRAY_SAMPLE_SIZE)
    -        var numElementsDrawn = 0
    -        while (numElementsDrawn < ARRAY_SAMPLE_SIZE) {
    -          var index = 0
    -          do {
    -            index = rand.nextInt(length)
    -          } while (drawn.contains(index))
    -          drawn.add(index)
    -          val elem = ScalaRunTime.array_apply(array, index).asInstanceOf[AnyRef]
    -          size += SizeEstimator.estimate(elem, state.visited)
    -          numElementsDrawn += 1
    -        }
    -        state.size += ((length / (ARRAY_SAMPLE_SIZE * 1.0)) * size).toLong
    +        val drawn = new OpenHashSet[Int](2 * ARRAY_SAMPLE_SIZE)
    --- End diff --
    
    Yes that looks better. We could even generalize to sampling n times easily but that could be overkill. I think we have a potential problem here, that we sample if the array size is >= 400, but then want at least 400 distinct elements from the array, twice. This will enter an infinite loop if the array has between 400 and 800 elements, and will be very slow if it's just a bit larger than 800.
    
    You could sample with replacement, or, only draw `ARRAY_SAMPLE_SIZE/2` elements ( `ARRAY_SAMPLE_SIZE/n` in general. For simplicity, and to avoid slow-downs, I'd say sample with replacement.
    
    You can put the sample threshold back to 200 then, too. I don't know if that needs to change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94922337
  
    In the sampling case, it considers the size of each element directly, so I think it would re-count shared data structures? I think that would explain the behavior here. Spark thinks memory is really full, so spills, but the spilled data is tiny since there wasn't that much data and memory wasn't nearly as full as it looked. So would it make sense to have both of these paths use the `enqueue` method (don't entirely know how that works)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95926017
  
      [Test build #30932 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30932/consoleFull) for   PR 5608 at commit [`fe202a2`](https://github.com/apache/spark/commit/fe202a2462248a94596517c85db8d97010473ab7).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-96369283
  
    Thanks, I will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95937710
  
    **[Test build #30931 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30931/consoleFull)**     for PR 5608 at commit [`a9fca84`](https://github.com/apache/spark/commit/a9fca8444d7a8591032383a7d6ced84ee1f66a56)     after a configured wait of `150m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94993203
  
      [Test build #30717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30717/consoleFull) for   PR 5608 at commit [`4877eee`](https://github.com/apache/spark/commit/4877eeee2ec6271be0160d1aadbd9d72905e1359).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95010247
  
      [Test build #30717 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30717/consoleFull) for   PR 5608 at commit [`4877eee`](https://github.com/apache/spark/commit/4877eeee2ec6271be0160d1aadbd9d72905e1359).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94741778
  
    I don't understand why you copied and pasted the loop? or why the sampling size would matter to this degree for correctness. 
    
    Is the essence of the change just the check for null? that you don't factor in the 0 size of null elements into the average? that could make a lot of sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94995027
  
    @mateiz 
    In most case,  the first sampling size is contain the shared objects, the second will not. But if the arrray is large, and is only has a few not null objects, it can be the second sampling contain shared objects, but  the first will not.  
    
    In order to exclude the shared objects, I use the min size to caculate the result. It will not going wrong if there are no shared objects, I have test it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95904145
  
    Sampling 1 might work for some cases; 100 for others; some may take 1000. There's no way to know. 
    
    This change as it stands is needlessly complex because it duplicates the loop among other things. You just want to take n samples of the array, and use the largest as your base, and smallest as your multiplier. That would be OK, and make n some small number like 2 or 3. At least it would be less hard-coded, and would make for a better change, along with some comments about why you are doing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5608#discussion_r29107048
  
    --- Diff: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ---
    @@ -204,25 +204,36 @@ private[spark] object SizeEstimator extends Logging {
             }
           } else {
             // Estimate the size of a large array by sampling elements without replacement.
    -        var size = 0.0
    +        // To exclude the shared objects that the array elements may link, sample twice
    +        // and use the min one to caculate array size.
             val rand = new Random(42)
    -        val drawn = new OpenHashSet[Int](ARRAY_SAMPLE_SIZE)
    -        var numElementsDrawn = 0
    -        while (numElementsDrawn < ARRAY_SAMPLE_SIZE) {
    -          var index = 0
    -          do {
    -            index = rand.nextInt(length)
    -          } while (drawn.contains(index))
    -          drawn.add(index)
    -          val elem = ScalaRunTime.array_apply(array, index).asInstanceOf[AnyRef]
    -          size += SizeEstimator.estimate(elem, state.visited)
    -          numElementsDrawn += 1
    -        }
    -        state.size += ((length / (ARRAY_SAMPLE_SIZE * 1.0)) * size).toLong
    +        val drawn = new OpenHashSet[Int](2 * ARRAY_SAMPLE_SIZE)
    +        val s1 = sampleArray(array, state, rand, drawn, length)
    +        val s2 = sampleArray(array, state, rand, drawn, length)
    +        val size = math.min(s1, s2)
    +        state.size += math.max(s1, s2) + 
    +          (size * ((length - ARRAY_SAMPLE_SIZE) / (ARRAY_SAMPLE_SIZE))).toLong
           }
         }
       }
     
    +  private def sampleArray(array: AnyRef, state: SearchState, 
    --- End diff --
    
    OK, thanks for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94711386
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30658/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94737773
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30662/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5608#discussion_r29100309
  
    --- Diff: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ---
    @@ -204,25 +204,36 @@ private[spark] object SizeEstimator extends Logging {
             }
           } else {
             // Estimate the size of a large array by sampling elements without replacement.
    -        var size = 0.0
    +        // To exclude the shared objects that the array elements may link, sample twice
    +        // and use the min one to caculate array size.
             val rand = new Random(42)
    -        val drawn = new OpenHashSet[Int](ARRAY_SAMPLE_SIZE)
    -        var numElementsDrawn = 0
    -        while (numElementsDrawn < ARRAY_SAMPLE_SIZE) {
    -          var index = 0
    -          do {
    -            index = rand.nextInt(length)
    -          } while (drawn.contains(index))
    -          drawn.add(index)
    -          val elem = ScalaRunTime.array_apply(array, index).asInstanceOf[AnyRef]
    -          size += SizeEstimator.estimate(elem, state.visited)
    -          numElementsDrawn += 1
    -        }
    -        state.size += ((length / (ARRAY_SAMPLE_SIZE * 1.0)) * size).toLong
    +        val drawn = new OpenHashSet[Int](2 * ARRAY_SAMPLE_SIZE)
    +        val s1 = sampleArray(array, state, rand, drawn, length)
    +        val s2 = sampleArray(array, state, rand, drawn, length)
    +        val size = math.min(s1, s2)
    +        state.size += math.max(s1, s2) + 
    +          (size * ((length - ARRAY_SAMPLE_SIZE) / (ARRAY_SAMPLE_SIZE))).toLong
           }
         }
       }
     
    +  private def sampleArray(array: AnyRef, state: SearchState, 
    --- End diff --
    
    Small style thing: I think the args line needs to be wrapped like the rest of the code (continue on new line and indent each an additional level.
    
    Also I think `if(null != object` -> `if (obj != null)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94717148
  
      [Test build #30662 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30662/consoleFull) for   PR 5608 at commit [`a2ea7ac`](https://github.com/apache/spark/commit/a2ea7acb04c8fb42a8a285ff9415a7e608214254).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95524133
  
    Yes, but this method will not always work either. It's a heuristic, and I think the much simpler one probably gives 95% of the benefit. I think this is overly complicated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95010254
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30717/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94711383
  
      [Test build #30658 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30658/consoleFull) for   PR 5608 at commit [`4c28e36`](https://github.com/apache/spark/commit/4c28e36f94cef254655d76ee9e290483943e4ba8).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5608#discussion_r29100295
  
    --- Diff: core/src/main/scala/org/apache/spark/util/SizeEstimator.scala ---
    @@ -204,25 +204,36 @@ private[spark] object SizeEstimator extends Logging {
             }
           } else {
             // Estimate the size of a large array by sampling elements without replacement.
    -        var size = 0.0
    +        // To exclude the shared objects that the array elements may link, sample twice
    +        // and use the min one to caculate array size.
             val rand = new Random(42)
    -        val drawn = new OpenHashSet[Int](ARRAY_SAMPLE_SIZE)
    -        var numElementsDrawn = 0
    -        while (numElementsDrawn < ARRAY_SAMPLE_SIZE) {
    -          var index = 0
    -          do {
    -            index = rand.nextInt(length)
    -          } while (drawn.contains(index))
    -          drawn.add(index)
    -          val elem = ScalaRunTime.array_apply(array, index).asInstanceOf[AnyRef]
    -          size += SizeEstimator.estimate(elem, state.visited)
    -          numElementsDrawn += 1
    -        }
    -        state.size += ((length / (ARRAY_SAMPLE_SIZE * 1.0)) * size).toLong
    +        val drawn = new OpenHashSet[Int](2 * ARRAY_SAMPLE_SIZE)
    --- End diff --
    
    Oh right, I am mixing up these two constants. It's fine. I still suspect this could be parameterized to sample N times, and simpler with replacement, but it's not essential.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95937721
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30931/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95904531
  
    @srowen 
    The last assertResult I have add in the testcase is the case that can't only discarding the first non-null sample, because half of the array elems are not link to the shared object, if the first non-null sample (which generate by random) is not link to the shared object, we can't exclude the shared object. But if we sampling twice, even if the twice has not exclude the shared object, it can also work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5608


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by shenh062326 <gi...@git.apache.org>.

Github user shenh062326 commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-95767900
  
    It seems always work in my cluster, at least I have not find a case not work. But if I change to the simpler one, sometimes it doesn't work. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94910837
  
    @srowen I haven't had a chance to look at this in detail yet, but, with SizeEstimator, we're purely trying to understand the size of the object in memory - the disk footprint is just not what we're trying to estimate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-6738] [CORE] Improve estimate the size ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/5608#issuecomment-94958006
  
    @shenh062326 can you explain why this particular change is likely to give the right result? I understand the problem with shared objects (it was a known problem when we designed SizeEstimator), but I'm not sure why sampling twice and using the min and max sizes this way is likely to be good. I also don't see why that particular formula at the end makes sense (for example, isn't it going to be wrong now if there are no shared objects?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org