You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mateiz <gi...@git.apache.org> on 2014/07/24 02:06:58 UTC

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

GitHub user mateiz opened a pull request:

    https://github.com/apache/spark/pull/1555

    SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup

    JIRA: https://issues.apache.org/jira/browse/SPARK-2657
    
    Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.
    
    There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `@DeveloperAPI` so I think it's okay to change it here.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mateiz/spark compact-groupby

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1555.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1555
    
----
commit 10f0de1ee86563b5bec6c8f1270a8198d6449393
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T22:36:45Z

    A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers

commit ed577ab3fa50de0ed1bd21eae43013ffa6dac51c
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T22:37:31Z

    Use CompactBuffer in groupByKey

commit 9b4c6e811159857c075528dab02f6c4db7688dde
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T23:05:14Z

    Use CompactBuffer in CoGroupedRDD

commit 775110fa6124e090c0aeed6baf7a408be3f30f9a
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T23:17:12Z

    Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
    
    CoGroupedRDD is a @DeveloperApi but this seemed worthwhile.

commit 197cde8dccb4c7dee1c9e6e9460b221988083d9b
Author: Matei Zaharia <ma...@databricks.com>
Date:   2014-07-23T23:41:27Z

    Make CompactBuffer extend Seq to make its toSeq more efficient

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-49953487
  
    QA tests have started for PR 1555. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17070/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-49984593
  
    QA results for PR 1555:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17115/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1555#discussion_r15361316
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala ---
    @@ -0,0 +1,155 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.collection
    +
    +/**
    + * An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
    + * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
    + * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
    + * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
    + * entries than that. This makes it more efficient for operations like groupBy where we expect
    + * some keys to have very few elements.
    + */
    +private[spark] class CompactBuffer[T] extends Seq[T] with Serializable {
    +  // First two elements
    +  private var element0: T = _
    +  private var element1: T = _
    +
    +  // Number of elements, including our two in the main object
    +  private var curSize = 0
    +
    +  // Array for extra elements
    +  private var otherElements: Array[AnyRef] = null
    +
    +  def apply(position: Int): T = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0
    +    } else if (position == 1) {
    +      element1
    +    } else {
    +      otherElements(position - 2).asInstanceOf[T]
    +    }
    +  }
    +
    +  def update(position: Int, value: T): Unit = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0 = value
    +    } else if (position == 1) {
    +      element1 = value
    +    } else {
    +      otherElements(position - 2) = value.asInstanceOf[AnyRef]
    +    }
    +  }
    +
    +  def += (value: T): CompactBuffer[T] = {
    +    val newIndex = curSize
    +    if (newIndex == 0) {
    +      element0 = value
    +      curSize = 1
    +    } else if (newIndex == 1) {
    +      element1 = value
    +      curSize = 2
    +    } else {
    +      growToSize(curSize + 1)
    +      otherElements(newIndex - 2) = value.asInstanceOf[AnyRef]
    +    }
    +    this
    +  }
    +
    +  def ++= (values: TraversableOnce[T]): CompactBuffer[T] = {
    +    values match {
    +      case compactBuf: CompactBuffer[T] =>
    --- End diff --
    
    Nope, it somehow works


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-49980532
  
    QA tests have started for PR 1555. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17115/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-50105999
  
    QA tests have started for PR 1555. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17159/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-50117105
  
    Merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1555#discussion_r15334913
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala ---
    @@ -0,0 +1,155 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.collection
    +
    +/**
    + * An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
    + * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
    + * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
    + * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
    + * entries than that. This makes it more efficient for operations like groupBy where we expect
    + * some keys to have very few elements.
    + */
    +private[spark] class CompactBuffer[T] extends Seq[T] with Serializable {
    +  // First two elements
    +  private var element0: T = _
    +  private var element1: T = _
    +
    +  // Number of elements, including our two in the main object
    +  private var curSize = 0
    +
    +  // Array for extra elements
    +  private var otherElements: Array[AnyRef] = null
    +
    +  def apply(position: Int): T = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0
    +    } else if (position == 1) {
    +      element1
    +    } else {
    +      otherElements(position - 2).asInstanceOf[T]
    +    }
    +  }
    +
    +  def update(position: Int, value: T): Unit = {
    --- End diff --
    
    i guess internally. maybe private?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-49952884
  
    QA results for PR 1555:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17068/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-49955064
  
    QA tests have started for PR 1555. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17075/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-50114262
  
    QA results for PR 1555:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17171/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-49959572
  
    QA results for PR 1555:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17070/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-50111308
  
    QA tests have started for PR 1555. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17171/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-50106558
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-49952838
  
    QA tests have started for PR 1555. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17068/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1555


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-49960831
  
    QA results for PR 1555:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17075/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-50107549
  
    QA results for PR 1555:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17159/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1555#discussion_r15335029
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala ---
    @@ -0,0 +1,155 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.collection
    +
    +/**
    + * An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
    + * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
    + * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
    + * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
    + * entries than that. This makes it more efficient for operations like groupBy where we expect
    + * some keys to have very few elements.
    + */
    +private[spark] class CompactBuffer[T] extends Seq[T] with Serializable {
    +  // First two elements
    +  private var element0: T = _
    +  private var element1: T = _
    +
    +  // Number of elements, including our two in the main object
    +  private var curSize = 0
    +
    +  // Array for extra elements
    +  private var otherElements: Array[AnyRef] = null
    +
    +  def apply(position: Int): T = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0
    +    } else if (position == 1) {
    +      element1
    +    } else {
    +      otherElements(position - 2).asInstanceOf[T]
    +    }
    +  }
    +
    +  def update(position: Int, value: T): Unit = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0 = value
    +    } else if (position == 1) {
    +      element1 = value
    +    } else {
    +      otherElements(position - 2) = value.asInstanceOf[AnyRef]
    +    }
    +  }
    +
    +  def += (value: T): CompactBuffer[T] = {
    +    val newIndex = curSize
    +    if (newIndex == 0) {
    +      element0 = value
    +      curSize = 1
    +    } else if (newIndex == 1) {
    +      element1 = value
    +      curSize = 2
    +    } else {
    +      growToSize(curSize + 1)
    +      otherElements(newIndex - 2) = value.asInstanceOf[AnyRef]
    +    }
    +    this
    +  }
    +
    +  def ++= (values: TraversableOnce[T]): CompactBuffer[T] = {
    +    values match {
    +      case compactBuf: CompactBuffer[T] =>
    +        val oldSize = curSize
    +        // Copy the other buffer's size and elements to local variables in case it is equal to us
    +        val itsSize = compactBuf.curSize
    +        val itsElements = compactBuf.otherElements
    +        growToSize(curSize + itsSize)
    +        if (itsSize > 0) {
    --- End diff --
    
    it might make sense to slightly re-arrange the conditions to reduce the number of branches. Right now it's a constant (3 branches for every case). If you do 
    ```
    if (itsSize > 2) {
    
    } else if (itsSize > 1) {
    
    } else {
    }
    ```
    
    you can reduce it to 1 branch for larger arrays, 2 branches for 2 element array, and 3 branches for 1 element array.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1555#discussion_r15334976
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala ---
    @@ -0,0 +1,155 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.collection
    +
    +/**
    + * An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
    + * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
    + * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
    + * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
    + * entries than that. This makes it more efficient for operations like groupBy where we expect
    + * some keys to have very few elements.
    + */
    +private[spark] class CompactBuffer[T] extends Seq[T] with Serializable {
    +  // First two elements
    +  private var element0: T = _
    +  private var element1: T = _
    +
    +  // Number of elements, including our two in the main object
    +  private var curSize = 0
    +
    +  // Array for extra elements
    +  private var otherElements: Array[AnyRef] = null
    +
    +  def apply(position: Int): T = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0
    +    } else if (position == 1) {
    +      element1
    +    } else {
    +      otherElements(position - 2).asInstanceOf[T]
    +    }
    +  }
    +
    +  def update(position: Int, value: T): Unit = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0 = value
    +    } else if (position == 1) {
    +      element1 = value
    +    } else {
    +      otherElements(position - 2) = value.asInstanceOf[AnyRef]
    +    }
    +  }
    +
    +  def += (value: T): CompactBuffer[T] = {
    +    val newIndex = curSize
    +    if (newIndex == 0) {
    +      element0 = value
    +      curSize = 1
    +    } else if (newIndex == 1) {
    +      element1 = value
    +      curSize = 2
    +    } else {
    +      growToSize(curSize + 1)
    +      otherElements(newIndex - 2) = value.asInstanceOf[AnyRef]
    +    }
    +    this
    +  }
    +
    +  def ++= (values: TraversableOnce[T]): CompactBuffer[T] = {
    +    values match {
    +      case compactBuf: CompactBuffer[T] =>
    --- End diff --
    
    does CompactBuffer[T] here trigger a compile time warning due to erasure? Maybe CompactBuffer[_] ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1555#discussion_r15334884
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala ---
    @@ -0,0 +1,155 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.collection
    +
    +/**
    + * An append-only buffer similar to ArrayBuffer, but more memory-efficient for small buffers.
    + * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
    + * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
    + * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
    + * entries than that. This makes it more efficient for operations like groupBy where we expect
    + * some keys to have very few elements.
    + */
    +private[spark] class CompactBuffer[T] extends Seq[T] with Serializable {
    +  // First two elements
    +  private var element0: T = _
    +  private var element1: T = _
    +
    +  // Number of elements, including our two in the main object
    +  private var curSize = 0
    +
    +  // Array for extra elements
    +  private var otherElements: Array[AnyRef] = null
    +
    +  def apply(position: Int): T = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0
    +    } else if (position == 1) {
    +      element1
    +    } else {
    +      otherElements(position - 2).asInstanceOf[T]
    +    }
    +  }
    +
    +  def update(position: Int, value: T): Unit = {
    --- End diff --
    
    is this used at all?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1555#discussion_r15324126
  
    --- Diff: core/src/main/scala/org/apache/spark/util/collection/CompactBuffer.scala ---
    @@ -0,0 +1,153 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.collection
    +
    +/**
    + * An append-only buffer similar to ArrayBuffer, but more efficient for small buffers.
    + * ArrayBuffer always allocates an Object array to store the data, with 16 entries by default,
    + * so it has about 80-100 bytes of overhead. In contrast, CompactBuffer can keep up to two
    + * elements in fields of the main object, and only allocates an Array[AnyRef] if there are more
    + * entries than that. This makes it more efficient for operations like groupBy where we expect
    + * some keys to have very few elements.
    + */
    +private[spark] class CompactBuffer[T] extends Seq[T] with Serializable {
    +  // First two elements
    +  private var element0: T = _
    +  private var element1: T = _
    +
    +  // Number of elements, including our two in the main object
    +  private var curSize = 0
    +
    +  // Array for extra elements
    +  private var otherElements: Array[AnyRef] = null
    +
    +  def apply(position: Int): T = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0
    +    } else if (position == 1) {
    +      element1
    +    } else {
    +      otherElements(position - 2).asInstanceOf[T]
    +    }
    +  }
    +
    +  def update(position: Int, value: T): Unit = {
    +    if (position < 0 || position >= curSize) {
    +      throw new IndexOutOfBoundsException
    +    }
    +    if (position == 0) {
    +      element0 = value
    +    } else if (position == 1) {
    +      element1 = value
    +    } else {
    +      otherElements(position - 2) = value.asInstanceOf[AnyRef]
    +    }
    +  }
    +
    +  def += (value: T): CompactBuffer[T] = {
    +    growToSize(curSize + 1)
    --- End diff --
    
    Maybe do this only in the else case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1555#issuecomment-50110354
  
    Let me make one other change actually, I'll decrease the initial size of CompactBuffer's array to 8.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---