You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2014/03/28 22:57:27 UTC

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/266

    SPARK-1057 (alternative) Remove fastutil

    (This is for discussion at this point -- I'm not suggesting this should be committed.)
    
    This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3.
    
    Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy.
    
    The rest is using `OpenHashMap` and `OpenHashSet`.  These are now written in terms of more scala-like operations.
    
    `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review:
    
    - It is no longer private
    - The key must be a `ClassTag`
    - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null`
    
    It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.
    
    But what about those last changes? 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-1057-alternate

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/266.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #266
    
----
commit e4c8adcfb4152141ca7046fdfe08778ecbcf58c5
Author: Sean Owen <so...@cloudera.com>
Date:   2014-03-28T21:50:20Z

    Remove use of fastutil and replace with use of java.io, spark.util and Guava classes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39651426
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39646423
  
    Hey @srowen does `FastBufferedOutputStream` offer any performance advantage over `BufferedOutputStream` that might cause regressions? We use these in a few places during the shuffle that are fairly performance sensitive (though maybe since it all ends up going through the OS/page cache the bottleneck isn't in Spark anyways). Just wondering if you had any knowledge of specific differences.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39040273
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13587/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11325597
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala ---
    @@ -26,8 +26,7 @@ import scala.reflect.ClassTag
      *
      * Under the hood, it uses our OpenHashSet implementation.
      */
    -private[spark]
    -class OpenHashMap[K >: Null : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
    +class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
    --- End diff --
    
    To elaborate a bit, `RDD.countByValue()` uses `OpenHashMap` in this patch, and it would only compile if `RDD`'s type parameter also was bounded by `>: Null`. And that alone makes a whole, whole lot of code require the same bound.
    
    I am new to Scala, but, does dropping the bound actually hurt? It means that one could use value types as keys, which seems conceptually OK. The class supports mapping `null` to a value with the `nullValue` field, and that is used only where a method gets a `null` key. For value types, the key will simply never be `null`. Even lines like `null.asInstanceOf[V]` work fine for value types; for `Int` you get 0 for example.
    
    So my read is that the class permits `null` (i.e. reference types) but does not require them, really? If I'm right I think the simplest thing is to just permit value types as keys?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39038528
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39643704
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11322086
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala ---
    @@ -26,8 +26,7 @@ import scala.reflect.ClassTag
      *
      * Under the hood, it uses our OpenHashSet implementation.
      */
    -private[spark]
    -class OpenHashMap[K >: Null : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
    +class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
    --- End diff --
    
    Yeah, it works OK as `private[spark]`. I'm not sure why I thought it had been a problem before. I'll restore that. The `>: Null` bound was requiring bounds in other common classes like `RDD` to be `>: Null`, which entailed a lot of knock-on changes. (See below for a snippet.) Should I go for it?
    
    I'll commit a change that addresses everything else, to start, and rebases.
    
    ```
    [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/partial/GroupedCountEvaluator.scala:35: type arguments [T,Long] do not conform to class OpenHashMap's type parameter bounds [K >: Null,V]
    [error]   extends ApproximateEvaluator[OpenHashMap[T,Long], Map[T, BoundedDouble]] {
    [error]           ^
    ...
    [error] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/rdd/RDD.scala:806: type arguments [T,Long] do not conform to class OpenHashMap's type parameter bounds [K >: Null,V]
    [error]     def countPartition(iter: Iterator[T]): Iterator[OpenHashMap[T,Long]] = {
    [error]                                            ^
    ...
    [error] 20 errors found
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11319330
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/util/RawTextHelper.scala ---
    @@ -19,40 +19,22 @@ package org.apache.spark.streaming.util
     
     import org.apache.spark.SparkContext
     import org.apache.spark.SparkContext._
    -import it.unimi.dsi.fastutil.objects.{Object2LongOpenHashMap => OLMap}
    +import org.apache.spark.util.collection.OpenHashMap
     import scala.collection.JavaConversions.mapAsScalaMap
     
     private[streaming]
     object RawTextHelper {
     
       /** 
    -   * Splits lines and counts the words in them using specialized object-to-long hashmap 
    -   * (to avoid boxing-unboxing overhead of Long in java/scala HashMap)
    +   * Splits lines and counts the words.
        */
       def splitAndCountPartitions(iter: Iterator[String]): Iterator[(String, Long)] = {
    -    val map = new OLMap[String]
    -    var i = 0
    -    var j = 0
    -    while (iter.hasNext) {
    -      val s = iter.next()
    -      i = 0
    -      while (i < s.length) {
    -        j = i
    -        while (j < s.length && s.charAt(j) != ' ') {
    -          j += 1
    -        }
    -        if (j > i) {
    -          val w = s.substring(i, j)
    -          val c = map.getLong(w)
    -          map.put(w, c + 1)
    -        }
    -        i = j
    -        while (i < s.length && s.charAt(i) == ' ') {
    -          i += 1
    -        }
    -      }
    +    val map = new OpenHashMap[String,Long]
    +    val tokenized = iter.flatMap(_.split(" ").filterNot(_.isEmpty))
    +    tokenized.foreach{ s =>
    +      map.changeValue(s, 1L, _ + 1L)
         }
    -    map.toIterator.map{case (k, v) => (k, v)}
    +    map.iterator
    --- End diff --
    
    (I mean the while stuff above in particular, not the map.iterator)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40271485
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14071/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40273850
  
    i will submit one soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11319323
  
    --- Diff: core/src/main/scala/org/apache/spark/partial/GroupedCountEvaluator.scala ---
    @@ -22,36 +22,33 @@ import java.util.{HashMap => JHashMap}
     import scala.collection.JavaConversions.mapAsScalaMap
     import scala.collection.Map
     import scala.collection.mutable.HashMap
    +import scala.reflect.ClassTag
     
     import cern.jet.stat.Probability
    -import it.unimi.dsi.fastutil.objects.{Object2LongOpenHashMap => OLMap}
    +
    +import org.apache.spark.util.collection.OpenHashMap
     
     /**
      * An ApproximateEvaluator for counts by key. Returns a map of key to confidence interval.
      */
    -private[spark] class GroupedCountEvaluator[T](totalOutputs: Int, confidence: Double)
    -  extends ApproximateEvaluator[OLMap[T], Map[T, BoundedDouble]] {
    +private[spark] class GroupedCountEvaluator[T : ClassTag](totalOutputs: Int, confidence: Double)
    +  extends ApproximateEvaluator[OpenHashMap[T,Long], Map[T, BoundedDouble]] {
     
       var outputsMerged = 0
    -  var sums = new OLMap[T]   // Sum of counts for each key
    +  var sums = new OpenHashMap[T,Long]()   // Sum of counts for each key
     
    -  override def merge(outputId: Int, taskResult: OLMap[T]) {
    +  override def merge(outputId: Int, taskResult: OpenHashMap[T,Long]) {
         outputsMerged += 1
    -    val iter = taskResult.object2LongEntrySet.fastIterator()
    -    while (iter.hasNext) {
    -      val entry = iter.next()
    -      sums.put(entry.getKey, sums.getLong(entry.getKey) + entry.getLongValue)
    +    taskResult.foreach{ case (key,value) =>
    +      sums.changeValue(key, value, _ + value)
         }
       }
     
       override def currentResult(): Map[T, BoundedDouble] = {
         if (outputsMerged == totalOutputs) {
           val result = new JHashMap[T, BoundedDouble](sums.size)
    -      val iter = sums.object2LongEntrySet.fastIterator()
    -      while (iter.hasNext) {
    -        val entry = iter.next()
    -        val sum = entry.getLongValue()
    -        result(entry.getKey) = new BoundedDouble(sum, 1.0, sum, sum)
    +      sums.foreach{ case (key,sum) =>
    --- End diff --
    
    Put a space before the `{` and after the `,`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39645327
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11319340
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala ---
    @@ -18,6 +18,7 @@
     package org.apache.spark.util.collection
     
     import java.util.{Arrays, Comparator}
    +import com.google.common.hash.Hashing
    --- End diff --
    
    Put a blank line before this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11319296
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/util/RawTextHelper.scala ---
    @@ -19,40 +19,22 @@ package org.apache.spark.streaming.util
     
     import org.apache.spark.SparkContext
     import org.apache.spark.SparkContext._
    -import it.unimi.dsi.fastutil.objects.{Object2LongOpenHashMap => OLMap}
    +import org.apache.spark.util.collection.OpenHashMap
     import scala.collection.JavaConversions.mapAsScalaMap
     
     private[streaming]
     object RawTextHelper {
     
       /** 
    -   * Splits lines and counts the words in them using specialized object-to-long hashmap 
    -   * (to avoid boxing-unboxing overhead of Long in java/scala HashMap)
    +   * Splits lines and counts the words.
        */
       def splitAndCountPartitions(iter: Iterator[String]): Iterator[(String, Long)] = {
    -    val map = new OLMap[String]
    -    var i = 0
    -    var j = 0
    -    while (iter.hasNext) {
    -      val s = iter.next()
    -      i = 0
    -      while (i < s.length) {
    -        j = i
    -        while (j < s.length && s.charAt(j) != ' ') {
    -          j += 1
    -        }
    -        if (j > i) {
    -          val w = s.substring(i, j)
    -          val c = map.getLong(w)
    -          map.put(w, c + 1)
    -        }
    -        i = j
    -        while (i < s.length && s.charAt(i) == ' ') {
    -          i += 1
    -        }
    -      }
    +    val map = new OpenHashMap[String,Long]
    +    val tokenized = iter.flatMap(_.split(" ").filterNot(_.isEmpty))
    +    tokenized.foreach{ s =>
    +      map.changeValue(s, 1L, _ + 1L)
         }
    -    map.toIterator.map{case (k, v) => (k, v)}
    +    map.iterator
    --- End diff --
    
    Keep this code the way it was before, I think it was there for some stress tests that passed in lots of data, to make sure the parsing is not the bottleneck. Just switch the map over.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40273651
  
    I think we can replace it with a custom impl - where we decide that it is ok to "waste" some memory within some threshold in case the copy is much more expensive - particularly given that most of this is almost immediately used and thrown away.
    For example, if the size > X bytes and wastage (capacity - size) < Y% of capacity.
    What we save from reallocating and compaction is not worth the effort.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11319321
  
    --- Diff: core/src/main/scala/org/apache/spark/partial/GroupedCountEvaluator.scala ---
    @@ -22,36 +22,33 @@ import java.util.{HashMap => JHashMap}
     import scala.collection.JavaConversions.mapAsScalaMap
     import scala.collection.Map
     import scala.collection.mutable.HashMap
    +import scala.reflect.ClassTag
     
     import cern.jet.stat.Probability
    -import it.unimi.dsi.fastutil.objects.{Object2LongOpenHashMap => OLMap}
    +
    +import org.apache.spark.util.collection.OpenHashMap
     
     /**
      * An ApproximateEvaluator for counts by key. Returns a map of key to confidence interval.
      */
    -private[spark] class GroupedCountEvaluator[T](totalOutputs: Int, confidence: Double)
    -  extends ApproximateEvaluator[OLMap[T], Map[T, BoundedDouble]] {
    +private[spark] class GroupedCountEvaluator[T : ClassTag](totalOutputs: Int, confidence: Double)
    +  extends ApproximateEvaluator[OpenHashMap[T,Long], Map[T, BoundedDouble]] {
     
       var outputsMerged = 0
    -  var sums = new OLMap[T]   // Sum of counts for each key
    +  var sums = new OpenHashMap[T,Long]()   // Sum of counts for each key
     
    -  override def merge(outputId: Int, taskResult: OLMap[T]) {
    +  override def merge(outputId: Int, taskResult: OpenHashMap[T,Long]) {
         outputsMerged += 1
    -    val iter = taskResult.object2LongEntrySet.fastIterator()
    -    while (iter.hasNext) {
    -      val entry = iter.next()
    -      sums.put(entry.getKey, sums.getLong(entry.getKey) + entry.getLongValue)
    +    taskResult.foreach{ case (key,value) =>
    +      sums.changeValue(key, value, _ + value)
         }
    --- End diff --
    
    Put a space before the `{`. Same thing applies elsewhere in the file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40160833
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40270660
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40256748
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14065/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11320767
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/util/RawTextSender.scala ---
    @@ -45,16 +43,15 @@ object RawTextSender extends Logging {
     
         // Repeat the input data multiple times to fill in a buffer
         val lines = Source.fromFile(file).getLines().toArray
    -    val bufferStream = new FastByteArrayOutputStream(blockSize + 1000)
    +    val bufferStream = new ByteArrayOutputStream(blockSize + 1000)
         val ser = new KryoSerializer(new SparkConf()).newInstance()
         val serStream = ser.serializeStream(bufferStream)
         var i = 0
    -    while (bufferStream.position < blockSize) {
    +    while (bufferStream.size < blockSize) {
    --- End diff --
    
    Yes it's the the count of bytes written:
    ```
    val stream = new java.io.ByteArrayOutputStream(1000)
    stream.write(-1)
    stream.size
    res1: Int = 1
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/266


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11558754
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
    @@ -125,8 +123,7 @@ private[spark] object Task {
         dataOut.flush()
         val taskBytes = serializer.serialize(task).array()
         out.write(taskBytes)
    -    out.trim()
    -    ByteBuffer.wrap(out.array)
    +    ByteBuffer.wrap(out.toByteArray)
    --- End diff --
    
    This does seem pretty bad ....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40270663
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-38976599
  
    Build is starting -or- tests failed to complete.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13561/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39677970
  
    For the OpenHashMap null lower bound, it should be fine to drop the lower bound (based on my 30 sec check). 
    
    The original intention was that if the key is primitive (non-null), PrimitiveKeyOpenHashMap.scala should be used. Maybe we can have a factory method to help users choose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-38973054
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40270616
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40160827
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39645328
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13796/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39646924
  
    @pwendell As I saw it, the reason it was used was for the ability to access the internal `byte[]` buffer directly rather than a copy. However in 2 of 3 cases, it ended up copying anyway (the call to `trim()`). The third case was in `Serializer.scala`, in `SerializerInstance.serializeMany`. But even there, I think the copying of the `byte[]` is offset by the fact that it can be merely `wrap()`ed by a `ByteBuffer` and avoids `ByteBuffer.allocate()` -- another copy in disguise. From my read, that was the potential difference, and I don't think there ends up being a meaningful difference. I can't say I'm 100% certain, but feel pretty sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40271482
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39038533
  
    Merged build started. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40272577
  
    Thanks - I've merged this! Decided to pull it into 1.0 as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40232019
  
    Okay this looks good, thanks @srowen!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39649767
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11320447
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/util/RawTextSender.scala ---
    @@ -45,16 +43,15 @@ object RawTextSender extends Logging {
     
         // Repeat the input data multiple times to fill in a buffer
         val lines = Source.fromFile(file).getLines().toArray
    -    val bufferStream = new FastByteArrayOutputStream(blockSize + 1000)
    +    val bufferStream = new ByteArrayOutputStream(blockSize + 1000)
         val ser = new KryoSerializer(new SparkConf()).newInstance()
         val serStream = ser.serializeStream(bufferStream)
         var i = 0
    -    while (bufferStream.position < blockSize) {
    +    while (bufferStream.size < blockSize) {
    --- End diff --
    
    The [docs](http://docs.oracle.com/javase/6/docs/api/java/io/ByteArrayOutputStream.html#size()) seem to indicate that it returns the number of valid bytes in the stream (which is what we want), and the [source](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/io/ByteArrayOutputStream.java#160) seems to verify this claim.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40256653
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-38973044
  
     Merged build triggered. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40232250
  
    Ah looks like this is not merging cleanly. Any chance you could bring it up to date?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39651427
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13800/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11319358
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala ---
    @@ -26,8 +26,7 @@ import scala.reflect.ClassTag
      *
      * Under the hood, it uses our OpenHashSet implementation.
      */
    -private[spark]
    -class OpenHashMap[K >: Null : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
    +class OpenHashMap[K : ClassTag, @specialized(Long, Int, Double) V: ClassTag](
    --- End diff --
    
    You should keep this `private[spark]`, we can still use it in sub packages. Also, why did you need to remove the `>: Null`? I'll CC @rxin to see if this can cause any problems. It might mean we don't get a reasonable "no value" indicator here, so perhaps we should keep the `>: Null` and change the case where this was not working.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40163330
  
    Build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40256670
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40256747
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39040272
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39038387
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39649772
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-38976597
  
    Merged build finished. Build is starting -or- tests failed to complete.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40163331
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14033/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39643701
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40160684
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-39625714
  
    Hey Sean, I think this would be good to include. Made a few comments throughout it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mridulm <gi...@git.apache.org>.

Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40273037
  
    I did not notice this earlier.
    The toByteArray method is insanely expensive for anything nontrivial.
    A better solution would be to replace use of ByteArrayOutputStream with an inhouse variant which allows us direct access to the byte[] - if we dont want to use fastutil.
    
    Already we are hitting cases of the byteoutputstream failing due to 2G limit.
    This PR will make us create two copies of the same : the performance implication of this is terrible


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40275465
  
    Hold up a sec -- the array copy is not new. It was merely hidden in the call to `trim()` before, or to `ByteBuffer.allocate()`. Yes, it's better to avoid it if possible. The problem is not just about getting at a `byte[]` but one of the right size. So at least I think this is no worse than it was before, no emergency just yet. Might be best to really fix this up per Mridul's suggestion to use `Seq[ByteBuffer]`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/266#issuecomment-40160680
  
    Cool, this looks good. I'll rerun the tests because Jenkins had some false positives in the past few days.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1057 (alternative) Remove fastutil

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/266#discussion_r11319306
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/util/RawTextSender.scala ---
    @@ -45,16 +43,15 @@ object RawTextSender extends Logging {
     
         // Repeat the input data multiple times to fill in a buffer
         val lines = Source.fromFile(file).getLines().toArray
    -    val bufferStream = new FastByteArrayOutputStream(blockSize + 1000)
    +    val bufferStream = new ByteArrayOutputStream(blockSize + 1000)
         val ser = new KryoSerializer(new SparkConf()).newInstance()
         val serStream = ser.serializeStream(bufferStream)
         var i = 0
    -    while (bufferStream.position < blockSize) {
    +    while (bufferStream.size < blockSize) {
    --- End diff --
    
    I believe "size" returns the size of the current buffer, not necessarily the size written. Can you test it? We may need to keep a separate counter to measure the amount of data read.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---