You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by liancheng <gi...@git.apache.org> on 2016/01/26 02:18:02 UTC

[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/10911

    [SPARK-12935][SQL] DataFrame API for Count-Min Sketch

    This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark cms-df-api

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10911.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10911
    
----
commit 4200636b435b686ac6fe2e979544f1aca4058209
Author: Cheng Lian <li...@databricks.com>
Date:   2016-01-26T01:15:17Z

    DataFrame API for Count-Min Sketch

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50925316
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -325,27 +321,43 @@ public void writeTo(OutputStream out) throws IOException {
       }
     
       public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
    +    CountMinSketchImpl sketch = new CountMinSketchImpl();
    +    sketch.readFrom0(in);
    +    return sketch;
    --- End diff --
    
    how about we move these codes into `CountMinSketch.readFrom`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50786602
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -309,4 +311,88 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
         sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
       }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    --- End diff --
    
    "Count-min Sketch"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175305090
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175254946
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50117/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50925607
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -325,27 +321,43 @@ public void writeTo(OutputStream out) throws IOException {
       }
     
       public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
    +    CountMinSketchImpl sketch = new CountMinSketchImpl();
    +    sketch.readFrom0(in);
    +    return sketch;
    +  }
    +
    +  private void readFrom0(InputStream in) throws IOException {
    --- End diff --
    
    this is actually a common naming style in java - to have the private version named xxx0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50786449
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -52,6 +57,10 @@
       private double eps;
       private double confidence;
     
    +  public CountMinSketchImpl() {
    --- End diff --
    
    private?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50786975
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -309,4 +311,88 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
         sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
       }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), depth, width, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(
    +      colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), eps, confidence, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(depth, width, seed))
    +  }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
    +  }
    +
    +  private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = {
    +    val singleCol = df.select(col)
    +    val colType = singleCol.schema.head.dataType
    +
    +    require({
    +      val supportedTypes: Set[DataType] =
    +        Set(ByteType, ShortType, IntegerType, LongType, StringType)
    +      supportedTypes.contains(colType)
    +    }, s"Count-Min Sketch doesn't support data type $colType yet.")
    --- End diff --
    
    "Count-min Sketch only supports byte, short, int, long, string type and does not support xxx type."
    
    
    Don't say yet because that sounds like we already have plan to fix all of them.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50919784
  
    --- Diff: sql/core/pom.xml ---
    @@ -44,6 +44,11 @@
         </dependency>
         <dependency>
           <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-sketch_2.10</artifactId>
    --- End diff --
    
    @rxin told this. I'm not quite sure about the details though :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174756777
  
    cc @cloud-fan @rxin @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50878077
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -309,4 +311,84 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
         sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
       }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), depth, width, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(
    +      colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), eps, confidence, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(depth, width, seed))
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
    +  }
    +
    +  private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = {
    +    val singleCol = df.select(col)
    +    val colType = singleCol.schema.head.dataType
    +    val supportedTypes: Set[DataType] = Set(ByteType, ShortType, IntegerType, LongType, StringType)
    +
    +    require(
    +      supportedTypes.contains(colType),
    --- End diff --
    
    Actually after thinking about it - let's avoid doing that and list the explicit types. It is plausible in the future we introduce an int96 or int128 data type, and I bet we won't remember this is one place we need to update it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175324264
  
    **[Test build #50126 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50126/consoleFull)** for PR 10911 at commit [`4a40802`](https://github.com/apache/spark/commit/4a4080289d025848a66f288429652c713c004ae3).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50893099
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -348,4 +374,30 @@ public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
     
         return new CountMinSketchImpl(depth, width, totalCount, hashA, table);
       }
    +
    +  @Override
    +  public void writeExternal(ObjectOutput out) throws IOException {
    +    try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
    +      this.writeTo(bos);
    +      byte[] bytes = bos.toByteArray();
    +      out.writeObject(bytes);
    --- End diff --
    
    So we copy the data twice here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50907571
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -348,4 +374,30 @@ public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
     
         return new CountMinSketchImpl(depth, width, totalCount, hashA, table);
       }
    +
    +  @Override
    +  public void writeExternal(ObjectOutput out) throws IOException {
    +    try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
    +      this.writeTo(bos);
    +      byte[] bytes = bos.toByteArray();
    +      out.writeObject(bytes);
    --- End diff --
    
    bloom filters can get pretty big when cardinality is high (e.g. 500MB). It's best if we can avoid the extra copying.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175322730
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50144/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50786480
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -347,8 +356,10 @@ public void writeTo(OutputStream out) throws IOException {
       public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
         DataInputStream dis = new DataInputStream(in);
     
    -    // Ignores version number
    -    dis.readInt();
    +    int version = dis.readInt();
    +    if (version != Version.V1.getVersionNumber()) {
    +      throw new IOException("Unexpected Count-Min Sketch version number " + version);
    --- End diff --
    
    might be better to put it in parentheses, i.e. "version number (15)".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175305092
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50138/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50906771
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -17,17 +17,37 @@
     
     package org.apache.spark.util.sketch;
     
    +import java.io.ByteArrayInputStream;
    +import java.io.ByteArrayOutputStream;
     import java.io.DataInputStream;
     import java.io.DataOutputStream;
    +import java.io.Externalizable;
     import java.io.IOException;
     import java.io.InputStream;
    +import java.io.ObjectInput;
    +import java.io.ObjectOutput;
     import java.io.OutputStream;
     import java.io.UnsupportedEncodingException;
     import java.util.Arrays;
     import java.util.Random;
     
    -class CountMinSketchImpl extends CountMinSketch {
    -  public static final long PRIME_MODULUS = (1L << 31) - 1;
    +/*
    + * Binary format of a serialized CountMinSketchImpl, version 1 (all values written in big-endian
    --- End diff --
    
    Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175362482
  
    **[Test build #50146 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50146/consoleFull)** for PR 10911 at commit [`3ff902a`](https://github.com/apache/spark/commit/3ff902af15538386a9d3020d82efa76b8098d947).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174847279
  
    **[Test build #50061 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50061/consoleFull)** for PR 10911 at commit [`32a9860`](https://github.com/apache/spark/commit/32a98600705c951b202b0a060eaca536b1477713).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50880387
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -17,17 +17,37 @@
     
     package org.apache.spark.util.sketch;
     
    +import java.io.ByteArrayInputStream;
    +import java.io.ByteArrayOutputStream;
     import java.io.DataInputStream;
     import java.io.DataOutputStream;
    +import java.io.Externalizable;
     import java.io.IOException;
     import java.io.InputStream;
    +import java.io.ObjectInput;
    +import java.io.ObjectOutput;
     import java.io.OutputStream;
     import java.io.UnsupportedEncodingException;
     import java.util.Arrays;
     import java.util.Random;
     
    -class CountMinSketchImpl extends CountMinSketch {
    -  public static final long PRIME_MODULUS = (1L << 31) - 1;
    +/*
    + * Binary format of a serialized CountMinSketchImpl, version 1 (all values written in big-endian
    --- End diff --
    
    This comment has been moved to `CountMinSketch.Version` as @rxin suggested in https://github.com/apache/spark/pull/10920#discussion_r50802512


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50925566
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -1270,4 +1270,37 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Seq(1 -> "a").toDF("i", "j").filter($"i".cast(StringType) === "1"),
           Row(1, "a"))
       }
    +
    +  // This test case only verifies that `DataFrame.countMinSketch()` methods do return
    +  // `CountMinSketch`es that meed required specs.  Test cases for `CountMinSketch` can be found in
    --- End diff --
    
    how about move these tests to `DataFrameStatSuite`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50914282
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -348,4 +374,30 @@ public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
     
         return new CountMinSketchImpl(depth, width, totalCount, hashA, table);
       }
    +
    +  @Override
    +  public void writeExternal(ObjectOutput out) throws IOException {
    +    try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
    +      this.writeTo(bos);
    +      byte[] bytes = bos.toByteArray();
    +      out.writeObject(bytes);
    --- End diff --
    
    Oh actually we can just use extend `Serializable` and use private `writeObject` and `readObject` override serialization. There we have `ObjectInputStream`/`ObjectOutputStream` instead of `ObjectInput`/`ObjectOutput`.
    
    cc @cloud-fan


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50925449
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -1270,4 +1270,37 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Seq(1 -> "a").toDF("i", "j").filter($"i".cast(StringType) === "1"),
           Row(1, "a"))
       }
    +
    +  // This test case only verifies that `DataFrame.countMinSketch()` methods do return
    +  // `CountMinSketch`es that meed required specs.  Test cases for `CountMinSketch` can be found in
    --- End diff --
    
    `meed` => `meet`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50918206
  
    --- Diff: sql/core/pom.xml ---
    @@ -44,6 +44,11 @@
         </dependency>
         <dependency>
           <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-sketch_2.10</artifactId>
    --- End diff --
    
    use `scala.binary.version`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175324830
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175319396
  
    **[Test build #50146 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50146/consoleFull)** for PR 10911 at commit [`3ff902a`](https://github.com/apache/spark/commit/3ff902af15538386a9d3020d82efa76b8098d947).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50926992
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketch.java ---
    @@ -59,16 +59,17 @@
       public enum Version {
         /**
          * {@code CountMinSketch} binary format version 1 (all values written in big-endian order):
    -     * - Version number, always 1 (32 bit)
    -     * - Total count of added items (64 bit)
    -     * - Depth (32 bit)
    -     * - Width (32 bit)
    -     * - Hash functions (depth * 64 bit)
    -     * - Count table
    -     *   - Row 0 (width * 64 bit)
    -     *   - Row 1 (width * 64 bit)
    -     *   - ...
    -     *   - Row depth - 1 (width * 64 bit)
    +     *
    +     *  - Version number, always 1 (32 bit)
    +     *  - Total count of added items (64 bit)
    +     *  - Depth (32 bit)
    +     *  - Width (32 bit)
    +     *  - Hash functions (depth * 64 bit)
    +     *  - Count table
    +     *    - Row 0 (width * 64 bit)
    +     *    - Row 1 (width * 64 bit)
    +     *    - ...
    +     *    - Row depth - 1 (width * 64 bit)
    --- End diff --
    
    I just realized that this is now in a Javadoc block. Should reformat this using HTML tags. Same thing applies to the bloom filter format description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174761419
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50787971
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -52,6 +57,10 @@
       private double eps;
       private double confidence;
     
    +  public CountMinSketchImpl() {
    --- End diff --
    
    This is required by `Externalizable`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175362791
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50146/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50786852
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -309,4 +311,88 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
         sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
       }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), depth, width, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(
    +      colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), eps, confidence, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(depth, width, seed))
    +  }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
    +  }
    +
    +  private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = {
    +    val singleCol = df.select(col)
    +    val colType = singleCol.schema.head.dataType
    +
    +    require({
    +      val supportedTypes: Set[DataType] =
    --- End diff --
    
    this seems too convoluted. just put supportedTypes outside the require so it is more obvious what is being required.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174785197
  
    **[Test build #50061 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50061/consoleFull)** for PR 10911 at commit [`32a9860`](https://github.com/apache/spark/commit/32a98600705c951b202b0a060eaca536b1477713).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50784455
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -45,7 +48,6 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
        *    df.stat.cov("rand1", "rand2")
        *    res1: Double = 0.065...
        * }}}
    -   *
    --- End diff --
    
    Weird, I didn't make these empty comment line changes. Reverting them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50926614
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -1270,4 +1270,37 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Seq(1 -> "a").toDF("i", "j").filter($"i".cast(StringType) === "1"),
           Row(1, "a"))
       }
    +
    +  // This test case only verifies that `DataFrame.countMinSketch()` methods do return
    +  // `CountMinSketch`es that meed required specs.  Test cases for `CountMinSketch` can be found in
    --- End diff --
    
    Thanks, moved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50925247
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -325,27 +321,43 @@ public void writeTo(OutputStream out) throws IOException {
       }
     
       public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
    +    CountMinSketchImpl sketch = new CountMinSketchImpl();
    +    sketch.readFrom0(in);
    +    return sketch;
    +  }
    +
    +  private void readFrom0(InputStream in) throws IOException {
    --- End diff --
    
    this name is quite weird...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175381894
  
    I'm going to merge this. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50786646
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -309,4 +311,88 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
         sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
       }
    +
    +  /**
    +   * Builds a Count-Min sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @see [[http://www.eecs.harvard.edu/~michaelm/CS222/countmin.pdf]]
    --- End diff --
    
    remove this reference (the link actually doesn't work).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175256282
  
    **[Test build #50126 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50126/consoleFull)** for PR 10911 at commit [`4a40802`](https://github.com/apache/spark/commit/4a4080289d025848a66f288429652c713c004ae3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175322727
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174816049
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50055/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10911


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174783482
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50057/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174847369
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50061/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175254410
  
    **[Test build #50117 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50117/consoleFull)** for PR 10911 at commit [`fb23a24`](https://github.com/apache/spark/commit/fb23a24d8ccef486e241b0eb0d9fece44a89e338).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174783481
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174760913
  
    **[Test build #50055 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50055/consoleFull)** for PR 10911 at commit [`4e5d1af`](https://github.com/apache/spark/commit/4e5d1afb8949b4b82398faffc8380a6d7230464e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50907599
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -348,4 +374,30 @@ public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
     
         return new CountMinSketchImpl(depth, width, totalCount, hashA, table);
       }
    +
    +  @Override
    +  public void writeExternal(ObjectOutput out) throws IOException {
    +    try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
    +      this.writeTo(bos);
    +      byte[] bytes = bos.toByteArray();
    +      out.writeObject(bytes);
    --- End diff --
    
    btw the same applies to count-min sketch.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50797694
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -368,4 +379,30 @@ public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
     
         return new CountMinSketchImpl(depth, width, totalCount, hashA, table);
       }
    +
    +  @Override
    +  public void writeExternal(ObjectOutput out) throws IOException {
    +    try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
    +      this.writeTo(bos);
    +      byte[] bytes = bos.toByteArray();
    +      out.writeObject(bytes);
    +    }
    +  }
    +
    +  @Override
    +  public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
    +    byte[] bytes = (byte[]) in.readObject();
    +
    +    try (ByteArrayInputStream bis = new ByteArrayInputStream(bytes)) {
    +      CountMinSketchImpl sketch = CountMinSketchImpl.readFrom(bis);
    +
    +      this.depth = sketch.depth;
    --- End diff --
    
    it'd be good to refactor this so we don't need to assign the variables. one way is to take the serialization/deserialization code out of readFrom into a function.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50786439
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -41,7 +46,7 @@
      *   - ...
      *   - Row depth - 1 (width * 64 bit)
      */
    -class CountMinSketchImpl extends CountMinSketch {
    +class CountMinSketchImpl extends CountMinSketch implements Externalizable {
       public static final long PRIME_MODULUS = (1L << 31) - 1;
    --- End diff --
    
    why is this public?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174816044
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174847368
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174848875
  
    cc @JoshRosen is the python tests broken?
    
    ```
    Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log
    Error: unrecognized module 'root'. Supported modules: pyspark-mllib, pyspark-core, pyspark-ml, pyspark-sql, pyspark-streaming
    [error] running /home/jenkins/workspace/SparkPullRequestBuilder/python/run-tests --modules=pyspark-mllib,pyspark-ml,pyspark-sql,root --parallelism=4 ; received return code 255
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175275779
  
    Josh is looking into the PySpark test failure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50877569
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -309,4 +311,84 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
         sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
       }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), depth, width, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(
    +      colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), eps, confidence, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(depth, width, seed))
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
    +  }
    +
    +  private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = {
    +    val singleCol = df.select(col)
    +    val colType = singleCol.schema.head.dataType
    +    val supportedTypes: Set[DataType] = Set(ByteType, ShortType, IntegerType, LongType, StringType)
    +
    +    require(
    +      supportedTypes.contains(colType),
    --- End diff --
    
    Good point


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174761421
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50053/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175176766
  
    **[Test build #50117 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50117/consoleFull)** for PR 10911 at commit [`fb23a24`](https://github.com/apache/spark/commit/fb23a24d8ccef486e241b0eb0d9fece44a89e338).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175254943
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175362788
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-174815506
  
    **[Test build #50055 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50055/consoleFull)** for PR 10911 at commit [`4e5d1af`](https://github.com/apache/spark/commit/4e5d1afb8949b4b82398faffc8380a6d7230464e).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class CountMinSketchImpl extends CountMinSketch implements Externalizable `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50919715
  
    --- Diff: sql/core/pom.xml ---
    @@ -44,6 +44,11 @@
         </dependency>
         <dependency>
           <groupId>org.apache.spark</groupId>
    +      <artifactId>spark-sketch_2.10</artifactId>
    --- End diff --
    
    Actually this is always hard coded as `_2.10` to make publishing easier.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50906324
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -348,4 +374,30 @@ public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
     
         return new CountMinSketchImpl(depth, width, totalCount, hashA, table);
       }
    +
    +  @Override
    +  public void writeExternal(ObjectOutput out) throws IOException {
    +    try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
    +      this.writeTo(bos);
    +      byte[] bytes = bos.toByteArray();
    +      out.writeObject(bytes);
    --- End diff --
    
    Yeah, essentially we can't retrieve the underlying output stream from an `ObjectOutput`. On the other hand, for typical use cases `df.countMinSketch()` this method isn't on performance critical path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50926598
  
    --- Diff: common/sketch/src/main/java/org/apache/spark/util/sketch/CountMinSketchImpl.java ---
    @@ -325,27 +321,43 @@ public void writeTo(OutputStream out) throws IOException {
       }
     
       public static CountMinSketchImpl readFrom(InputStream in) throws IOException {
    +    CountMinSketchImpl sketch = new CountMinSketchImpl();
    +    sketch.readFrom0(in);
    +    return sketch;
    --- End diff --
    
    Hm, I prefer not, because then we'll have to make the no-arg constructor package private instead of private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10911#issuecomment-175324834
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50126/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50802122
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -309,4 +311,84 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
         sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
       }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), depth, width, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(
    +      colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), eps, confidence, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(depth, width, seed))
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
    +  }
    +
    +  private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = {
    +    val singleCol = df.select(col)
    +    val colType = singleCol.schema.head.dataType
    +    val supportedTypes: Set[DataType] = Set(ByteType, ShortType, IntegerType, LongType, StringType)
    +
    +    require(
    +      supportedTypes.contains(colType),
    +      s"Count-min Sketch only supports string type and integral types, " +
    +        s"and does not support type $colType."
    +    )
    +
    +    singleCol.rdd.aggregate(zero)(
    --- End diff --
    
    Maybe we can improve it by UDAF in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50801910
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -309,4 +311,84 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
       def sampleBy[T](col: String, fractions: ju.Map[T, jl.Double], seed: Long): DataFrame = {
         sampleBy(col, fractions.asScala.toMap.asInstanceOf[Map[T, Double]], seed)
       }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), depth, width, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param colName name of the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(
    +      colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(Column(colName), eps, confidence, seed)
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param depth depth of the sketch
    +   * @param width width of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(depth, width, seed))
    +  }
    +
    +  /**
    +   * Builds a Count-min Sketch over a specified column.
    +   *
    +   * @param col the column over which the sketch is built
    +   * @param eps relative error of the sketch
    +   * @param confidence confidence of the sketch
    +   * @param seed random seed
    +   * @return a [[CountMinSketch]] over column `colName`
    +   * @since 2.0.0
    +   */
    +  def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch = {
    +    countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
    +  }
    +
    +  private def countMinSketch(col: Column, zero: CountMinSketch): CountMinSketch = {
    +    val singleCol = df.select(col)
    +    val colType = singleCol.schema.head.dataType
    +    val supportedTypes: Set[DataType] = Set(ByteType, ShortType, IntegerType, LongType, StringType)
    +
    +    require(
    +      supportedTypes.contains(colType),
    --- End diff --
    
    how about `colType == StringType || colType.isInstanceOf[IntegralType]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12935][SQL] DataFrame API for Count-Min...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10911#discussion_r50925871
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ---
    @@ -1270,4 +1270,37 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
           Seq(1 -> "a").toDF("i", "j").filter($"i".cast(StringType) === "1"),
           Row(1, "a"))
       }
    +
    +  // This test case only verifies that `DataFrame.countMinSketch()` methods do return
    +  // `CountMinSketch`es that meed required specs.  Test cases for `CountMinSketch` can be found in
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org