You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2016/01/25 03:58:29 UTC

[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/10893

    [SPARK-12934][SQL] Count-min sketch serialization

    This PR adds serialization support for `CountMinSketch`.
    
    A version number is added to version the serialized binary format.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark cms-serialization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10893.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10893
    
----
commit 692f93e98477fe6fc5d25a0f19eb1880683dfdf2
Author: Cheng Lian <li...@databricks.com>
Date:   2016-01-25T02:50:40Z

    CountMinSketch serialization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174707729
  
    **[Test build #50020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50020/consoleFull)** for PR 10893 at commit [`4abc4e0`](https://github.com/apache/spark/commit/4abc4e06fc7077e10acad442ed2d81e5b156118f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50668201
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -51,6 +53,53 @@ public CountMinSketchImpl(double eps, double confidence, int seed) {
         initTablesWith(depth, width, seed);
       }
     
    +  public CountMinSketchImpl(int depth, int width, long totalCount, long hashA[], long table[][]) {
    +    this.depth = depth;
    +    this.width = width;
    +    this.eps = 2.0 / width;
    +    this.confidence = 1 - 1 / Math.pow(2, depth);
    +    this.hashA = hashA;
    +    this.table = table;
    +    this.totalCount = totalCount;
    +  }
    +
    +  @Override
    +  public boolean equals(Object other) {
    +    if (other == this) {
    +      return true;
    +    }
    +
    +    if (!(other instanceof CountMinSketchImpl)) {
    +      return false;
    +    }
    +
    +    CountMinSketchImpl that = (CountMinSketchImpl) other;
    +
    +    if (this.depth == that.depth &&
    +        this.width == that.width &&
    +        this.totalCount == that.totalCount) {
    +      for (int i = 0; i < depth; ++i) {
    +        if (this.hashA[i] != that.hashA[i]) {
    +          return false;
    +        }
    +
    +        for (int j = 0; j < width; ++j) {
    --- End diff --
    
    can we use array equality test instead of writing a loop?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174683458
  
    **[Test build #50020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50020/consoleFull)** for PR 10893 at commit [`4abc4e0`](https://github.com/apache/spark/commit/4abc4e06fc7077e10acad442ed2d81e5b156118f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50738340
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchMergeException.java ---
    @@ -0,0 +1,24 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.sketch;
    +
    +public class CountMinSketchMergeException extends Exception {
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174708046
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174708818
  
    LGTM overall


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174701361
  
    **[Test build #50025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50025/consoleFull)** for PR 10893 at commit [`4636af1`](https://github.com/apache/spark/commit/4636af1b3e37c9be30d4e50e125a16005f8d0fab).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174685049
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50668071
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java ---
    @@ -99,19 +120,44 @@
        *
        * Note that only Count-Min sketches with the same {@code depth}, {@code width}, and random seed
        * can be merged.
    +   *
    +   * @exception CountMinSketchMergeException if the {@code other} {@link CountMinSketch} has
    +   *            incompatible depth, width, relative-error, confidence, or random seed.
        */
    -  public abstract CountMinSketch mergeInPlace(CountMinSketch other);
    +  public abstract CountMinSketch mergeInPlace(CountMinSketch other)
    +      throws CountMinSketchMergeException;
     
       /**
        * Writes out this {@link CountMinSketch} to an output stream in binary format.
        */
    -  public abstract void writeTo(OutputStream out);
    +  public abstract void writeTo(OutputStream out) throws IOException;
     
       /**
        * Reads in a {@link CountMinSketch} from an input stream.
        */
    -  public static CountMinSketch readFrom(InputStream in) {
    -    throw new UnsupportedOperationException("Not implemented yet");
    +  public static CountMinSketch readFrom(InputStream in) throws IOException {
    +    DataInputStream dis = new DataInputStream(in);
    --- End diff --
    
    let's put the implementation in CountMinSketchImpl, and simply delegate to a static method there


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50774861
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -256,13 +324,48 @@ public CountMinSketch mergeInPlace(CountMinSketch other) {
       }
     
       @Override
    -  public void writeTo(OutputStream out) {
    -    throw new UnsupportedOperationException("Not implemented yet");
    +  public void writeTo(OutputStream out) throws IOException {
    +    DataOutputStream dos = new DataOutputStream(out);
    +
    +    dos.writeInt(version().getVersionNumber());
    +
    +    dos.writeLong(this.totalCount);
    +    dos.writeInt(this.depth);
    +    dos.writeInt(this.width);
    +
    +    for (int i = 0; i < this.depth; ++i) {
    +      dos.writeLong(this.hashA[i]);
    +    }
    +
    +    for (int i = 0; i < this.depth; ++i) {
    +      for (int j = 0; j < this.width; ++j) {
    +        dos.writeLong(table[i][j]);
    +      }
    +    }
       }
     
    -  protected static class CMSMergeException extends RuntimeException {
    -    public CMSMergeException(String message) {
    -      super(message);
    +  public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
    +    DataInputStream dis = new DataInputStream(in);
    --- End diff --
    
    This should be closed before returning from the method, right ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50726946
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -51,6 +53,53 @@ public CountMinSketchImpl(double eps, double confidence, int seed) {
         initTablesWith(depth, width, seed);
       }
     
    +  public CountMinSketchImpl(int depth, int width, long totalCount, long hashA[], long table[][]) {
    +    this.depth = depth;
    +    this.width = width;
    +    this.eps = 2.0 / width;
    +    this.confidence = 1 - 1 / Math.pow(2, depth);
    +    this.hashA = hashA;
    +    this.table = table;
    +    this.totalCount = totalCount;
    +  }
    +
    +  @Override
    +  public boolean equals(Object other) {
    --- End diff --
    
    Also override hashcode?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174708050
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50020/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50769459
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -17,11 +17,30 @@
     
     package org.apache.spark.util.sketch;
     
    +import java.io.DataInputStream;
    +import java.io.DataOutputStream;
    +import java.io.IOException;
    +import java.io.InputStream;
     import java.io.OutputStream;
     import java.io.UnsupportedEncodingException;
     import java.util.Arrays;
     import java.util.Random;
     
    +/*
    + * Binary format of a serialized CountMinSketchImpl, version 1 (all values written in big-endian
    + * order):
    + *
    + * - Version number, always 1 (32 bit)
    + * - Total count of added items (64 bit)
    + * - Depth (32 bit)
    + * - Width (32 bit)
    + * - Hash functions (depth * 64 bit)
    + * - Count table
    + *   - Row 0 (width * 64 bit)
    + *   - Row 1 (width * 64 bit)
    + *   - ...
    + *   - Row depth - 1 (width * 64 bit)
    + */
    --- End diff --
    
    should we move this comment near `writeTo`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174688313
  
    **[Test build #50023 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50023/consoleFull)** for PR 10893 at commit [`9acfe33`](https://github.com/apache/spark/commit/9acfe33037cbe9c1c8b5f38e2fdc45159b523ed9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50668080
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java ---
    @@ -99,19 +120,44 @@
        *
        * Note that only Count-Min sketches with the same {@code depth}, {@code width}, and random seed
        * can be merged.
    +   *
    +   * @exception CountMinSketchMergeException if the {@code other} {@link CountMinSketch} has
    +   *            incompatible depth, width, relative-error, confidence, or random seed.
        */
    -  public abstract CountMinSketch mergeInPlace(CountMinSketch other);
    +  public abstract CountMinSketch mergeInPlace(CountMinSketch other)
    +      throws CountMinSketchMergeException;
     
       /**
        * Writes out this {@link CountMinSketch} to an output stream in binary format.
        */
    -  public abstract void writeTo(OutputStream out);
    +  public abstract void writeTo(OutputStream out) throws IOException;
     
       /**
        * Reads in a {@link CountMinSketch} from an input stream.
        */
    -  public static CountMinSketch readFrom(InputStream in) {
    -    throw new UnsupportedOperationException("Not implemented yet");
    +  public static CountMinSketch readFrom(InputStream in) throws IOException {
    +    DataInputStream dis = new DataInputStream(in);
    --- End diff --
    
    also somewhere in CountMinSketchImpl we should document the serialization format.
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174685053
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50019/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50771577
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -256,13 +324,48 @@ public CountMinSketch mergeInPlace(CountMinSketch other) {
       }
     
       @Override
    -  public void writeTo(OutputStream out) {
    -    throw new UnsupportedOperationException("Not implemented yet");
    +  public void writeTo(OutputStream out) throws IOException {
    +    DataOutputStream dos = new DataOutputStream(out);
    +
    +    dos.writeInt(version().getVersionNumber());
    +
    +    dos.writeLong(this.totalCount);
    +    dos.writeInt(this.depth);
    +    dos.writeInt(this.width);
    +
    +    for (int i = 0; i < this.depth; ++i) {
    +      dos.writeLong(this.hashA[i]);
    +    }
    +
    +    for (int i = 0; i < this.depth; ++i) {
    +      for (int j = 0; j < this.width; ++j) {
    +        dos.writeLong(table[i][j]);
    +      }
    +    }
       }
     
    -  protected static class CMSMergeException extends RuntimeException {
    -    public CMSMergeException(String message) {
    -      super(message);
    +  public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
    +    DataInputStream dis = new DataInputStream(in);
    +
    +    // Ignores version number
    +    dis.readInt();
    --- End diff --
    
    should add some check here to throw an exception if the version number is not 1.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174693131
  
    **[Test build #50025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50025/consoleFull)** for PR 10893 at commit [`4636af1`](https://github.com/apache/spark/commit/4636af1b3e37c9be30d4e50e125a16005f8d0fab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50786004
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -256,13 +324,48 @@ public CountMinSketch mergeInPlace(CountMinSketch other) {
       }
     
       @Override
    -  public void writeTo(OutputStream out) {
    -    throw new UnsupportedOperationException("Not implemented yet");
    +  public void writeTo(OutputStream out) throws IOException {
    +    DataOutputStream dos = new DataOutputStream(out);
    +
    +    dos.writeInt(version().getVersionNumber());
    +
    +    dos.writeLong(this.totalCount);
    +    dos.writeInt(this.depth);
    +    dos.writeInt(this.width);
    +
    +    for (int i = 0; i < this.depth; ++i) {
    +      dos.writeLong(this.hashA[i]);
    +    }
    +
    +    for (int i = 0; i < this.depth; ++i) {
    +      for (int j = 0; j < this.width; ++j) {
    +        dos.writeLong(table[i][j]);
    +      }
    +    }
       }
     
    -  protected static class CMSMergeException extends RuntimeException {
    -    public CMSMergeException(String message) {
    -      super(message);
    +  public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
    +    DataInputStream dis = new DataInputStream(in);
    --- End diff --
    
    But `DataInputStream.close()` also closes the underlying input stream, which isn't our expected behavior, is it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174701484
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50769946
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -17,11 +17,30 @@
     
     package org.apache.spark.util.sketch;
     
    +import java.io.DataInputStream;
    +import java.io.DataOutputStream;
    +import java.io.IOException;
    +import java.io.InputStream;
     import java.io.OutputStream;
     import java.io.UnsupportedEncodingException;
     import java.util.Arrays;
     import java.util.Random;
     
    +/*
    + * Binary format of a serialized CountMinSketchImpl, version 1 (all values written in big-endian
    + * order):
    + *
    + * - Version number, always 1 (32 bit)
    + * - Total count of added items (64 bit)
    + * - Depth (32 bit)
    + * - Width (32 bit)
    + * - Hash functions (depth * 64 bit)
    + * - Count table
    + *   - Row 0 (width * 64 bit)
    + *   - Row 1 (width * 64 bit)
    + *   - ...
    + *   - Row depth - 1 (width * 64 bit)
    + */
    --- End diff --
    
    Actually I couldn't decide to put it before `writeTo` or `readFrom`, and then ended up here...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174385244
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174697669
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50760286
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -51,6 +53,53 @@ public CountMinSketchImpl(double eps, double confidence, int seed) {
         initTablesWith(depth, width, seed);
       }
     
    +  public CountMinSketchImpl(int depth, int width, long totalCount, long hashA[], long table[][]) {
    +    this.depth = depth;
    +    this.width = width;
    +    this.eps = 2.0 / width;
    +    this.confidence = 1 - 1 / Math.pow(2, depth);
    +    this.hashA = hashA;
    +    this.table = table;
    +    this.totalCount = totalCount;
    +  }
    +
    +  @Override
    +  public boolean equals(Object other) {
    --- End diff --
    
    Added, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174385245
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49967/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174697675
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50023/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174719324
  
    Going to merge this. Thanks. Please address my (one) feedback in your next pr.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174385201
  
    **[Test build #49967 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49967/consoleFull)** for PR 10893 at commit [`e97d7f9`](https://github.com/apache/spark/commit/e97d7f92ad4cb075234772c24f246bf51eff6cc7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      throw new CountMinSketchMergeException(\"Cannot merge estimator of class \" + other.getClass().getName());`
      * `public class CountMinSketchMergeException extends Exception `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10893


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174382920
  
    **[Test build #49967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49967/consoleFull)** for PR 10893 at commit [`e97d7f9`](https://github.com/apache/spark/commit/e97d7f92ad4cb075234772c24f246bf51eff6cc7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174701488
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50025/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50738797
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java ---
    @@ -55,6 +57,25 @@
      */
     abstract public class CountMinSketch {
       /**
    +   * Version number of the serialized binary format.
    +   */
    +  public enum Version {
    --- End diff --
    
    This should be count min sketch specific, because the two binary protocols can and should evolve separately.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50668355
  
    --- Diff: common/sketch/src/test/scala/org/apache/spark/util/sketch/CountMinSketchSuite.scala ---
    @@ -29,6 +31,16 @@ class CountMinSketchSuite extends FunSuite { // scalastyle:ignore funsuite
     
       private val seed = 42
     
    +  private def checkSerDe(sketch: CountMinSketch): Unit = {
    --- End diff --
    
    add an inline documentation explaining what this function does 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10893#issuecomment-174697161
  
    **[Test build #50023 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50023/consoleFull)** for PR 10893 at commit [`9acfe33`](https://github.com/apache/spark/commit/9acfe33037cbe9c1c8b5f38e2fdc45159b523ed9).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50668251
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchMergeException.java ---
    @@ -0,0 +1,24 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.util.sketch;
    +
    +public class CountMinSketchMergeException extends Exception {
    --- End diff --
    
    maybe we should make this slightly more general and reuse it in both bloom filter and count-min sketch.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12934][SQL] Count-min sketch serializat...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10893#discussion_r50738486
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java ---
    @@ -55,6 +57,25 @@
      */
     abstract public class CountMinSketch {
       /**
    +   * Version number of the serialized binary format.
    +   */
    +  public enum Version {
    --- End diff --
    
    Can we move this out so that we can also use it in bloom filter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org