You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by brkyvz <gi...@git.apache.org> on 2015/04/30 07:33:18 UTC

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/5799

    [SPARK-7242][SQL][MLLIB] Frequent items for DataFrames

    Finding frequent items with possibly false positives, using the algorithm described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    public API under:
    ```
    df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame
    ```
    
    The output is a local DataFrame having the input column names with `-freqItems` appended to it. This is a single pass algorithm that may return false positives, but no false negatives.
    
    cc @mengxr @rxin 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark freq-items

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5799.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5799
    
----
commit 3d82168544a29e7e1ae1326ab933db7e78a72dcc
Author: Burak Yavuz <br...@gmail.com>
Date:   2015-04-29T23:07:48Z

    made base implementation
    
    implemented frequent items

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97830647
  
      [Test build #31426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31426/consoleFull) for   PR 5799 at commit [`39b1bba`](https://github.com/apache/spark/commit/39b1bbae0f2e1e0bd661a709c2ffb08fc492aedc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405251
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
    @@ -0,0 +1,124 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.ml
    +
    +
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Row, DataFrame, functions}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /**
    +   * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in
    +   * any emptied slots with the most frequent of `otherMap`.
    +   * @param baseMap The map containing the global counts
    +   * @param otherMap The map containing the counts for that partition
    +   * @param maxSize The maximum number of counts to keep in memory
    +   */
    +  private def mergeCounts[A](
    +      baseMap: MutableMap[A, Long],
    +      otherMap: MutableMap[A, Long],
    +      maxSize: Int): Unit = {
    +    val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
    +      if (!baseMap.contains(k)) sum + v else sum
    +    }
    +    baseMap.retain((k, v) => v > otherSum)
    +    // sort in decreasing order, so that we will add the most frequent items first
    +    val sorted = otherMap.toSeq.sortBy(-_._2)
    +    var i = 0
    +    val otherSize = sorted.length
    +    while (i < otherSize && baseMap.size < maxSize) {
    +      val keyVal = sorted(i)
    +      baseMap += keyVal._1 -> keyVal._2
    +      i += 1
    +    }
    +  }
    +  
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the algorithm 
    +   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Array[String], 
    --- End diff --
    
    If multiple columns are provided, shall we search the combination of them instead of each individually? For example, if I call
    
    ~~~scala
    freqItems(Array("gender", "title"), 0.01)
    ~~~
    
    I'm expecting the frequent combinations instead of each of them. The current implementation is more flexible because users can create a struct from multiple columns, and this allows to find frequent items on multiple columns in parallel. But I'm a little worried about what users expect when they call `freqItems(Array("gender", "title"))` @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29406968
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ---
    @@ -0,0 +1,127 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.execution.stat
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Column, DataFrame, Row}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
    +  private class FreqItemCounter(size: Int) extends Serializable {
    +    val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
    +
    +    /**
    +     * Add a new example to the counts if it exists, otherwise deduct the count
    +     * from existing items.
    +     */
    +    def add(key: Any, count: Long): this.type = {
    +      if (baseMap.contains(key))  {
    +        baseMap(key) += count
    +      } else {
    +        if (baseMap.size < size) {
    +          baseMap += key -> count
    +        } else {
    +          // TODO: Make this more efficient... A flatMap?
    +          baseMap.retain((k, v) => v > count)
    +          baseMap.transform((k, v) => v - count)
    +        }
    +      }
    +      this
    +    }
    +
    +    /**
    +     * Merge two maps of counts.
    +     * @param other The map containing the counts for that partition
    +     */
    +    def merge(other: FreqItemCounter): this.type = {
    +      other.toSeq.foreach { case (k, v) =>
    +        add(k, v)
    +      }
    +      this
    +    }
    +    
    +    def toSeq: Seq[(Any, Long)] = baseMap.toSeq
    +    
    +    def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = baseMap.foldLeft(start)(f)
    +    
    +    def freqItems: Seq[Any] = baseMap.keys.toSeq
    +  }
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the 
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Seq[String],
    +      support: Double): DataFrame = {
    +    if (support < 1e-6) {
    --- End diff --
    
    ```scala
    require(support >= 1e-6, s"support ($support) must be greater than 1e-6.")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97679097
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97681263
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29406755
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -0,0 +1,55 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.sql.execution.stat.FrequentItems
    +
    +/**
    + * :: Experimental ::
    + * Statistic functions for [[DataFrame]]s.
    + */
    +@Experimental
    +final class DataFrameStatFunctions private[sql](df: DataFrame) {
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   *
    --- End diff --
    
    make sure you document the range of support allowed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29470505
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -0,0 +1,68 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.sql.execution.stat.FrequentItems
    +
    +/**
    + * :: Experimental ::
    + * Statistic functions for [[DataFrame]]s.
    + */
    +@Experimental
    +final class DataFrameStatFunctions private[sql](df: DataFrame) {
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   * The `support` should be greater than 1e-4.
    +   *
    +   * @param cols the names of the columns to search frequent items in.
    +   * @param support The minimum frequency for an item to be considered `frequent`. Should be greater
    +   *                than 1e-4.
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  def freqItems(cols: Array[String], support: Double): DataFrame = {
    +    FrequentItems.singlePassFreqItems(df, cols, support)
    +  }
    +
    +  /**
    +   * Runs `freqItems` with a default `support` of 1%.
    +   *
    +   * @param cols the names of the columns to search frequent items in.
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  def freqItems(cols: Array[String]): DataFrame = {
    +    FrequentItems.singlePassFreqItems(df, cols, 0.01)
    +  }
    +
    +  /**
    +   * Python friendly implementation for `freqItems`
    +   */
    +  def freqItems(cols: List[String], support: Double): DataFrame = {
    --- End diff --
    
    I think we can just use Seq here, since Python has helper functions that can convert List into Seq.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29404815
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1414,4 +1415,37 @@ class DataFrame private[sql](
         val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD()
         SerDeUtil.javaToPython(jrdd)
       }
    +
    +  /////////////////////////////////////////////////////////////////////////////
    +  // Statistic functions
    +  /////////////////////////////////////////////////////////////////////////////
    +
    +  // scalastyle:off
    +  object stat {
    +  // scalastyle:on
    +
    +    /**
    +     * Finding frequent items for columns, possibly with false positives. Using the algorithm
    +     * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    +     *
    +     * @param cols the names of the columns to search frequent items in
    +     * @param support The minimum frequency for an item to be considered `frequent`
    +     * @return A Local DataFrame with the Array of frequent items for each column.
    +     */
    +    def freqItems(cols: Array[String], support: Double): DataFrame = {
    --- End diff --
    
    in df we usually support List[String] and Seq[String]. This is one reason why we are using a separate name space.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97822247
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97830338
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-98000923
  
    @rxin, can you merge this in please. I can follow up with the comments on the PR where I'll add Python support. This is blocking `df.stat.cov`. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97691943
  
      [Test build #31386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31386/consoleFull) for   PR 5799 at commit [`8279d4d`](https://github.com/apache/spark/commit/8279d4d4cb09f78e2f8f83f9a3738101b940ed40).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97681270
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31391/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29406784
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ---
    @@ -0,0 +1,127 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.execution.stat
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Column, DataFrame, Row}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
    +  private class FreqItemCounter(size: Int) extends Serializable {
    +    val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
    +
    +    /**
    +     * Add a new example to the counts if it exists, otherwise deduct the count
    +     * from existing items.
    +     */
    +    def add(key: Any, count: Long): this.type = {
    +      if (baseMap.contains(key))  {
    +        baseMap(key) += count
    +      } else {
    +        if (baseMap.size < size) {
    +          baseMap += key -> count
    +        } else {
    +          // TODO: Make this more efficient... A flatMap?
    +          baseMap.retain((k, v) => v > count)
    +          baseMap.transform((k, v) => v - count)
    +        }
    +      }
    +      this
    +    }
    +
    +    /**
    +     * Merge two maps of counts.
    +     * @param other The map containing the counts for that partition
    +     */
    +    def merge(other: FreqItemCounter): this.type = {
    +      other.toSeq.foreach { case (k, v) =>
    +        add(k, v)
    +      }
    +      this
    +    }
    +    
    +    def toSeq: Seq[(Any, Long)] = baseMap.toSeq
    --- End diff --
    
    u don't need this, do you? you can just operate on the map directly. i'm asking because i'm not sure whether baseMap.toSeq materializes a whole seq, which might be unnecessary


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405622
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
    @@ -0,0 +1,124 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.ml
    +
    +
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Row, DataFrame, functions}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /**
    +   * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in
    +   * any emptied slots with the most frequent of `otherMap`.
    +   * @param baseMap The map containing the global counts
    +   * @param otherMap The map containing the counts for that partition
    +   * @param maxSize The maximum number of counts to keep in memory
    +   */
    +  private def mergeCounts[A](
    --- End diff --
    
    I think the implementation could be cleaner if we wrap `MutableMap[A, Long]` with a utility class:
    
    ~~~scala
    class FreqItemCounter(size: k) {
      def add(any: Any, count: Long = 1L): this.type
      def merge(other: FreqItemCounter): this.type = {
        other.toSeq.foreach { case (k, c) =>
          add(k, c)
        }
      }
      def items: Array[Any]
      def toSeq: Seq[(Any, Long)]
    }
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97830370
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97709046
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31392/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29404981
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1414,4 +1415,25 @@ class DataFrame private[sql](
         val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD()
         SerDeUtil.javaToPython(jrdd)
       }
    +
    +  /////////////////////////////////////////////////////////////////////////////
    +  // Statistic functions
    +  /////////////////////////////////////////////////////////////////////////////
    +
    +  // scalastyle:off
    +  object stat {
    --- End diff --
    
    I think it looks like `df.stat$.MODULE$.freqItems()`. I don't know how we can otherwise make it `df.stat.freqItems` in scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97865895
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97679655
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97991940
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97703350
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29406886
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -0,0 +1,43 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.test.TestSQLContext
    +import org.apache.spark.sql.types._
    +import org.scalatest.FunSuite
    +
    +class DataFrameStatSuite extends FunSuite  {
    +
    +  val sqlCtx = TestSQLContext
    +
    +  test("Frequent Items") {
    +    def toLetter(i: Int): String = (i + 96).toChar.toString
    +    val rows = Array.tabulate(1000)(i => if (i % 3 == 0) (1, toLetter(1)) else (i, toLetter(i)))
    +    val rowRdd = sqlCtx.sparkContext.parallelize(rows.map(v => Row(v._1, v._2)))
    +    val schema = StructType(StructField("numbers", IntegerType, false) ::
    +                            StructField("letters", StringType, false) :: Nil)
    +    val df = sqlCtx.createDataFrame(rowRdd, schema)
    +
    +    val results = df.stat.freqItems(Array("numbers", "letters"), 0.1)
    +    val items = results.collect().head
    +    assert(items.getSeq(0).contains(1),
    --- End diff --
    
    use the should matcher for scala test so it prints a better error message. (you don't need your custom error message anymore)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97709044
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405254
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
    @@ -0,0 +1,124 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.ml
    +
    +
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Row, DataFrame, functions}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /**
    +   * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in
    +   * any emptied slots with the most frequent of `otherMap`.
    +   * @param baseMap The map containing the global counts
    +   * @param otherMap The map containing the counts for that partition
    +   * @param maxSize The maximum number of counts to keep in memory
    +   */
    +  private def mergeCounts[A](
    +      baseMap: MutableMap[A, Long],
    +      otherMap: MutableMap[A, Long],
    +      maxSize: Int): Unit = {
    +    val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
    +      if (!baseMap.contains(k)) sum + v else sum
    +    }
    +    baseMap.retain((k, v) => v > otherSum)
    +    // sort in decreasing order, so that we will add the most frequent items first
    +    val sorted = otherMap.toSeq.sortBy(-_._2)
    +    var i = 0
    +    val otherSize = sorted.length
    +    while (i < otherSize && baseMap.size < maxSize) {
    +      val keyVal = sorted(i)
    +      baseMap += keyVal._1 -> keyVal._2
    +      i += 1
    +    }
    +  }
    +  
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the algorithm 
    +   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Array[String], 
    +      support: Double): DataFrame = {
    +    val numCols = cols.length
    --- End diff --
    
    Check the range of `support`. Warn if the it is too small (e.g., 1e-6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97679720
  
      [Test build #31392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31392/consoleFull) for   PR 5799 at commit [`482e741`](https://github.com/apache/spark/commit/482e74180445d30d0b5a769cd5f9bd0e94abfd17).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405043
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1414,4 +1415,37 @@ class DataFrame private[sql](
         val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD()
         SerDeUtil.javaToPython(jrdd)
       }
    +
    +  /////////////////////////////////////////////////////////////////////////////
    +  // Statistic functions
    +  /////////////////////////////////////////////////////////////////////////////
    +
    +  // scalastyle:off
    +  object stat {
    +  // scalastyle:on
    +
    +    /**
    +     * Finding frequent items for columns, possibly with false positives. Using the algorithm
    +     * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    --- End diff --
    
    Use DOI link: http://dx.doi.org/10.1145/762471.762473


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97697639
  
    I think it's better to just have freqItems to on a per column basis, and then I can add a struct expression to data frame so users can easily create composite columns to run freqItems on.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405470
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
    @@ -0,0 +1,124 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.ml
    +
    +
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Row, DataFrame, functions}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /**
    +   * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in
    +   * any emptied slots with the most frequent of `otherMap`.
    +   * @param baseMap The map containing the global counts
    +   * @param otherMap The map containing the counts for that partition
    +   * @param maxSize The maximum number of counts to keep in memory
    +   */
    +  private def mergeCounts[A](
    +      baseMap: MutableMap[A, Long],
    +      otherMap: MutableMap[A, Long],
    +      maxSize: Int): Unit = {
    +    val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
    +      if (!baseMap.contains(k)) sum + v else sum
    +    }
    +    baseMap.retain((k, v) => v > otherSum)
    +    // sort in decreasing order, so that we will add the most frequent items first
    +    val sorted = otherMap.toSeq.sortBy(-_._2)
    +    var i = 0
    +    val otherSize = sorted.length
    +    while (i < otherSize && baseMap.size < maxSize) {
    +      val keyVal = sorted(i)
    +      baseMap += keyVal._1 -> keyVal._2
    +      i += 1
    +    }
    +  }
    +  
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the algorithm 
    +   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Array[String], 
    +      support: Double): DataFrame = {
    +    val numCols = cols.length
    +    // number of max items to keep counts for
    +    val sizeOfMap = math.floor(1 / support).toInt
    +    val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, Long])
    +    val originalSchema = df.schema
    +    val colInfo = cols.map { name =>
    +      val index = originalSchema.fieldIndex(name)
    +      val dataType = originalSchema.fields(index)
    +      (index, dataType.dataType)
    +    }
    +    val colIndices = colInfo.map(_._1)
    +    
    +    val freqItems: Array[MutableMap[Any, Long]] = df.rdd.aggregate(countMaps)(
    +      seqOp = (counts, row) => {
    +        var i = 0
    +        colIndices.foreach { index =>
    +          val thisMap = counts(i)
    +          val key = row.get(index)
    +          if (thisMap.contains(key))  {
    +            thisMap(key) += 1
    +          } else {
    +            if (thisMap.size < sizeOfMap) {
    +              thisMap += key -> 1
    +            } else {
    +              // TODO: Make this more efficient... A flatMap?
    +              thisMap.retain((k, v) => v > 1)
    +              thisMap.transform((k, v) => v - 1)
    +            }
    +          }
    +          i += 1
    +        }
    +        counts
    +      },
    +      combOp = (baseCounts, counts) => {
    +        var i = 0
    +        while (i < numCols) {
    +          mergeCounts(baseCounts(i), counts(i), sizeOfMap)
    +          i += 1
    +        }
    +        baseCounts
    +      }
    +    )
    +    //
    +    val justItems = freqItems.map(m => m.keys.toSeq)
    +    val resultRow = Row(justItems:_*)
    +    // append frequent Items to the column name for easy debugging
    +    val outputCols = cols.zip(colInfo).map{ v =>
    +      StructField(v._1 + "-freqItems", ArrayType(v._2._2, false))
    --- End diff --
    
    `-freqItems` -> `_freqItems`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97703273
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97965384
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97679106
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29470549
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -0,0 +1,68 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.sql.execution.stat.FrequentItems
    +
    +/**
    + * :: Experimental ::
    + * Statistic functions for [[DataFrame]]s.
    + */
    +@Experimental
    +final class DataFrameStatFunctions private[sql](df: DataFrame) {
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   * The `support` should be greater than 1e-4.
    +   *
    +   * @param cols the names of the columns to search frequent items in.
    +   * @param support The minimum frequency for an item to be considered `frequent`. Should be greater
    +   *                than 1e-4.
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  def freqItems(cols: Array[String], support: Double): DataFrame = {
    +    FrequentItems.singlePassFreqItems(df, cols, support)
    +  }
    +
    +  /**
    +   * Runs `freqItems` with a default `support` of 1%.
    --- End diff --
    
    It's better to just put the same javadoc, and say support = 0.01. Rather than saying run freqItems...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97876748
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31426/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97876720
  
      [Test build #31426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31426/consoleFull) for   PR 5799 at commit [`39b1bba`](https://github.com/apache/spark/commit/39b1bbae0f2e1e0bd661a709c2ffb08fc492aedc).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97704110
  
      [Test build #31404 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31404/consoleFull) for   PR 5799 at commit [`3a5c177`](https://github.com/apache/spark/commit/3a5c177e247ddb44a38e4ee4211c57ec3cad58eb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29449439
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ---
    @@ -0,0 +1,121 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.execution.stat
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Column, DataFrame, Row}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    --- End diff --
    
    organize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29452212
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -0,0 +1,90 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql
    +
    +import java.lang.{String => JavaString}
    --- End diff --
    
    javastring is the same as scala string, so we don't need to do this one.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97991944
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31453/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405009
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1414,4 +1415,25 @@ class DataFrame private[sql](
         val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD()
         SerDeUtil.javaToPython(jrdd)
       }
    +
    +  /////////////////////////////////////////////////////////////////////////////
    +  // Statistic functions
    +  /////////////////////////////////////////////////////////////////////////////
    +
    +  // scalastyle:off
    +  object stat {
    --- End diff --
    
    take a look at how we implemented na.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97822356
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97865900
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31423/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97691955
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31386/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29404754
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
    @@ -0,0 +1,124 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.ml
    --- End diff --
    
    let's put this in execution.stat?
    
    It's annoying to add a top level package because we have rules to specifically exclude existing packages.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97991927
  
      [Test build #31453 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31453/consoleFull) for   PR 5799 at commit [`a6ec82c`](https://github.com/apache/spark/commit/a6ec82cef528c22eff00cf8294d92798c6d2aa9d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29404717
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1414,4 +1415,25 @@ class DataFrame private[sql](
         val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD()
         SerDeUtil.javaToPython(jrdd)
       }
    +
    +  /////////////////////////////////////////////////////////////////////////////
    +  // Statistic functions
    +  /////////////////////////////////////////////////////////////////////////////
    +
    +  // scalastyle:off
    +  object stat {
    --- End diff --
    
    does this work in java?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97731695
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29406677
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -0,0 +1,55 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.sql.execution.stat.FrequentItems
    +
    +/**
    + * :: Experimental ::
    + * Statistic functions for [[DataFrame]]s.
    + */
    +@Experimental
    +final class DataFrameStatFunctions private[sql](df: DataFrame) {
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   *
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  def freqItems(cols: Seq[String], support: Double): DataFrame = {
    --- End diff --
    
    don't forget to add java.util.List ones


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97667923
  
    I'm going to let @mengxr to comment on the actual algorithm implementation.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405307
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
    @@ -0,0 +1,124 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.ml
    +
    +
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Row, DataFrame, functions}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /**
    +   * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in
    +   * any emptied slots with the most frequent of `otherMap`.
    +   * @param baseMap The map containing the global counts
    +   * @param otherMap The map containing the counts for that partition
    +   * @param maxSize The maximum number of counts to keep in memory
    +   */
    +  private def mergeCounts[A](
    +      baseMap: MutableMap[A, Long],
    +      otherMap: MutableMap[A, Long],
    +      maxSize: Int): Unit = {
    +    val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
    +      if (!baseMap.contains(k)) sum + v else sum
    +    }
    +    baseMap.retain((k, v) => v > otherSum)
    +    // sort in decreasing order, so that we will add the most frequent items first
    +    val sorted = otherMap.toSeq.sortBy(-_._2)
    +    var i = 0
    +    val otherSize = sorted.length
    +    while (i < otherSize && baseMap.size < maxSize) {
    +      val keyVal = sorted(i)
    +      baseMap += keyVal._1 -> keyVal._2
    +      i += 1
    +    }
    +  }
    +  
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the algorithm 
    +   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Array[String], 
    +      support: Double): DataFrame = {
    +    val numCols = cols.length
    +    // number of max items to keep counts for
    +    val sizeOfMap = math.floor(1 / support).toInt
    +    val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, Long])
    +    val originalSchema = df.schema
    +    val colInfo = cols.map { name =>
    +      val index = originalSchema.fieldIndex(name)
    +      val dataType = originalSchema.fields(index)
    +      (index, dataType.dataType)
    +    }
    +    val colIndices = colInfo.map(_._1)
    +    
    +    val freqItems: Array[MutableMap[Any, Long]] = df.rdd.aggregate(countMaps)(
    --- End diff --
    
    `df.select(cols).rdd.aggregate` (then you don't need to skip elements)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97965407
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29406990
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ---
    @@ -0,0 +1,127 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.execution.stat
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Column, DataFrame, Row}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
    +  private class FreqItemCounter(size: Int) extends Serializable {
    +    val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
    +
    +    /**
    +     * Add a new example to the counts if it exists, otherwise deduct the count
    +     * from existing items.
    +     */
    +    def add(key: Any, count: Long): this.type = {
    +      if (baseMap.contains(key))  {
    +        baseMap(key) += count
    +      } else {
    +        if (baseMap.size < size) {
    +          baseMap += key -> count
    +        } else {
    +          // TODO: Make this more efficient... A flatMap?
    +          baseMap.retain((k, v) => v > count)
    +          baseMap.transform((k, v) => v - count)
    +        }
    +      }
    +      this
    +    }
    +
    +    /**
    +     * Merge two maps of counts.
    +     * @param other The map containing the counts for that partition
    +     */
    +    def merge(other: FreqItemCounter): this.type = {
    +      other.toSeq.foreach { case (k, v) =>
    +        add(k, v)
    +      }
    +      this
    +    }
    +    
    +    def toSeq: Seq[(Any, Long)] = baseMap.toSeq
    +    
    +    def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = baseMap.foldLeft(start)(f)
    +    
    +    def freqItems: Seq[Any] = baseMap.keys.toSeq
    +  }
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the 
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Seq[String],
    +      support: Double): DataFrame = {
    +    if (support < 1e-6) {
    --- End diff --
    
    might as well do e-5 since e-6 can be super large and not very practical.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29407007
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ---
    @@ -0,0 +1,127 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.execution.stat
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Column, DataFrame, Row}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
    +  private class FreqItemCounter(size: Int) extends Serializable {
    +    val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
    +
    +    /**
    +     * Add a new example to the counts if it exists, otherwise deduct the count
    +     * from existing items.
    +     */
    +    def add(key: Any, count: Long): this.type = {
    +      if (baseMap.contains(key))  {
    +        baseMap(key) += count
    +      } else {
    +        if (baseMap.size < size) {
    +          baseMap += key -> count
    +        } else {
    +          // TODO: Make this more efficient... A flatMap?
    +          baseMap.retain((k, v) => v > count)
    +          baseMap.transform((k, v) => v - count)
    +        }
    +      }
    +      this
    +    }
    +
    +    /**
    +     * Merge two maps of counts.
    +     * @param other The map containing the counts for that partition
    +     */
    +    def merge(other: FreqItemCounter): this.type = {
    +      other.toSeq.foreach { case (k, v) =>
    +        add(k, v)
    +      }
    +      this
    +    }
    +    
    +    def toSeq: Seq[(Any, Long)] = baseMap.toSeq
    +    
    +    def foldLeft[A, B](start: A)(f: (A, (Any, Long)) => A): A = baseMap.foldLeft(start)(f)
    +    
    +    def freqItems: Seq[Any] = baseMap.keys.toSeq
    +  }
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the 
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Seq[String],
    +      support: Double): DataFrame = {
    +    if (support < 1e-6) {
    --- End diff --
    
    I talked more with @mengxr. Let's do default e-2, and bound it at e-4 initially.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97884920
  
    LGTM except minor inline comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29406904
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -0,0 +1,55 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.sql.execution.stat.FrequentItems
    +
    +/**
    + * :: Experimental ::
    + * Statistic functions for [[DataFrame]]s.
    + */
    +@Experimental
    +final class DataFrameStatFunctions private[sql](df: DataFrame) {
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   *
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  def freqItems(cols: Seq[String], support: Double): DataFrame = {
    --- End diff --
    
    also make sure you add a test to the JavaDataFrameSuite


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29449175
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala ---
    @@ -0,0 +1,121 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.execution.stat
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Column, DataFrame, Row}
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
    +  private class FreqItemCounter(size: Int) extends Serializable {
    +    val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
    +
    +    /**
    +     * Add a new example to the counts if it exists, otherwise deduct the count
    +     * from existing items.
    +     */
    +    def add(key: Any, count: Long): this.type = {
    +      if (baseMap.contains(key))  {
    +        baseMap(key) += count
    +      } else {
    +        if (baseMap.size < size) {
    +          baseMap += key -> count
    +        } else {
    +          // TODO: Make this more efficient... A flatMap?
    +          baseMap.retain((k, v) => v > count)
    +          baseMap.transform((k, v) => v - count)
    +        }
    +      }
    +      this
    +    }
    +
    +    /**
    +     * Merge two maps of counts.
    +     * @param other The map containing the counts for that partition
    +     */
    +    def merge(other: FreqItemCounter): this.type = {
    +      other.baseMap.foreach { case (k, v) =>
    +        add(k, v)
    +      }
    +      this
    +    }
    +  }
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the 
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   * The `support` should be greater than 1e-4.
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`. Should be greater
    +   *                than 1e-4.
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Seq[String],
    +      support: Double): DataFrame = {
    +    require(support >= 1e-4, s"support ($support) must be greater than 1e-4.")
    +    val numCols = cols.length
    +    // number of max items to keep counts for
    +    val sizeOfMap = (1 / support).toInt
    +    val countMaps = Seq.tabulate(numCols)(i => new FreqItemCounter(sizeOfMap))
    +    val originalSchema = df.schema
    +    val colInfo = cols.map { name =>
    +      val index = originalSchema.fieldIndex(name)
    +      (name, originalSchema.fields(index).dataType)
    +    }
    +    
    +    val freqItems = df.select(cols.map(Column(_)):_*).rdd.aggregate(countMaps)(
    +      seqOp = (counts, row) => {
    +        var i = 0
    +        while (i < numCols) {
    +          val thisMap = counts(i)
    +          val key = row.get(i)
    +          thisMap.add(key, 1L)
    +          i += 1
    +        }
    +        counts
    +      },
    +      combOp = (baseCounts, counts) => {
    +        var i = 0
    +        while (i < numCols) {
    +          baseCounts(i).merge(counts(i))
    +          i += 1
    +        }
    +        baseCounts
    +      }
    +    )
    +    val justItems = freqItems.map(m => m.baseMap.keys.toSeq)
    +    val resultRow = Row(justItems:_*)
    +    // append frequent Items to the column name for easy debugging
    +    val outputCols = colInfo.map{ v =>
    --- End diff --
    
    space before `{`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97667532
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97865861
  
      [Test build #31423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31423/consoleFull) for   PR 5799 at commit [`0915e23`](https://github.com/apache/spark/commit/0915e23e78f7efb3c90b0a140a6cd1300113acd6).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29406864
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala ---
    @@ -0,0 +1,43 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql
    +
    +import org.apache.spark.sql.test.TestSQLContext
    +import org.apache.spark.sql.types._
    +import org.scalatest.FunSuite
    +
    +class DataFrameStatSuite extends FunSuite  {
    +
    +  val sqlCtx = TestSQLContext
    +
    +  test("Frequent Items") {
    +    def toLetter(i: Int): String = (i + 96).toChar.toString
    +    val rows = Array.tabulate(1000)(i => if (i % 3 == 0) (1, toLetter(1)) else (i, toLetter(i)))
    +    val rowRdd = sqlCtx.sparkContext.parallelize(rows.map(v => Row(v._1, v._2)))
    --- End diff --
    
    this can be a lot simpler.
    ```scala
    val df = sqlCtx.sparkContext.parallelize(rows.map(v =>(v._1, v._2))).toDF("numbers", "letters")
    ```
    
    remove the struct type and stuff.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97667521
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97667662
  
      [Test build #31386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31386/consoleFull) for   PR 5799 at commit [`8279d4d`](https://github.com/apache/spark/commit/8279d4d4cb09f78e2f8f83f9a3738101b940ed40).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29404786
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1414,4 +1415,37 @@ class DataFrame private[sql](
         val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD()
         SerDeUtil.javaToPython(jrdd)
       }
    +
    +  /////////////////////////////////////////////////////////////////////////////
    +  // Statistic functions
    +  /////////////////////////////////////////////////////////////////////////////
    +
    +  // scalastyle:off
    +  object stat {
    +  // scalastyle:on
    +
    +    /**
    +     * Finding frequent items for columns, possibly with false positives. Using the algorithm
    +     * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    --- End diff --
    
    should do proper citation rather than giving an url, since this url might disappear.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97731696
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31404/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-98001284
  
    Alright - merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29449746
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala ---
    @@ -0,0 +1,90 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql
    +
    +import java.lang.{String => JavaString}
    +import java.util.{List => JavaList}
    +
    +import scala.collection.JavaConversions._
    +
    +import org.apache.spark.annotation.Experimental
    +import org.apache.spark.sql.execution.stat.FrequentItems
    +
    +/**
    + * :: Experimental ::
    + * Statistic functions for [[DataFrame]]s.
    + */
    +@Experimental
    +final class DataFrameStatFunctions private[sql](df: DataFrame) {
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the
    +   * frequent element count algorithm described in
    +   * [[http://dx.doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou]].
    +   * The `support` should be greater than 1e-4.
    +   *
    +   * @param cols the names of the columns to search frequent items in.
    +   * @param support The minimum frequency for an item to be considered `frequent`. Should be greater
    +   *                than 1e-4.
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  def freqItems(cols: Seq[String], support: Double): DataFrame = {
    --- End diff --
    
    mention `struct` when #5802 is merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97731686
  
      [Test build #31404 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31404/consoleFull) for   PR 5799 at commit [`3a5c177`](https://github.com/apache/spark/commit/3a5c177e247ddb44a38e4ee4211c57ec3cad58eb).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch **adds the following new dependencies:**
       * `jaxb-api-2.2.7.jar`
       * `jaxb-core-2.2.7.jar`
       * `jaxb-impl-2.2.7.jar`
       * `pmml-agent-1.1.15.jar`
       * `pmml-model-1.1.15.jar`
       * `pmml-schema-1.1.15.jar`
    
     * This patch **removes the following dependencies:**
       * `activation-1.1.jar`
       * `jaxb-api-2.2.2.jar`
       * `jaxb-impl-2.2.3-1.jar`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97876744
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405257
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
    @@ -0,0 +1,124 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.ml
    +
    +
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Row, DataFrame, functions}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /**
    +   * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in
    +   * any emptied slots with the most frequent of `otherMap`.
    +   * @param baseMap The map containing the global counts
    +   * @param otherMap The map containing the counts for that partition
    +   * @param maxSize The maximum number of counts to keep in memory
    +   */
    +  private def mergeCounts[A](
    +      baseMap: MutableMap[A, Long],
    +      otherMap: MutableMap[A, Long],
    +      maxSize: Int): Unit = {
    +    val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
    +      if (!baseMap.contains(k)) sum + v else sum
    +    }
    +    baseMap.retain((k, v) => v > otherSum)
    +    // sort in decreasing order, so that we will add the most frequent items first
    +    val sorted = otherMap.toSeq.sortBy(-_._2)
    +    var i = 0
    +    val otherSize = sorted.length
    +    while (i < otherSize && baseMap.size < maxSize) {
    +      val keyVal = sorted(i)
    +      baseMap += keyVal._1 -> keyVal._2
    +      i += 1
    +    }
    +  }
    +  
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the algorithm 
    +   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Array[String], 
    +      support: Double): DataFrame = {
    +    val numCols = cols.length
    +    // number of max items to keep counts for
    +    val sizeOfMap = math.floor(1 / support).toInt
    --- End diff --
    
    `math.floor` is not necessary: `(1.0 / support).toInt`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5799


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405486
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
    @@ -1414,4 +1415,25 @@ class DataFrame private[sql](
         val jrdd = rdd.map(EvaluatePython.rowToArray(_, fieldTypes)).toJavaRDD()
         SerDeUtil.javaToPython(jrdd)
       }
    +
    +  /////////////////////////////////////////////////////////////////////////////
    +  // Statistic functions
    +  /////////////////////////////////////////////////////////////////////////////
    +
    +  // scalastyle:off
    +  object stat {
    --- End diff --
    
    aha! I like it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29404857
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ml/FrequentItemsSuite.scala ---
    @@ -0,0 +1,45 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.ml
    +
    +import org.apache.spark.sql.Row
    +import org.apache.spark.sql.types._
    +import org.scalatest.FunSuite
    +
    +import org.apache.spark.sql.test.TestSQLContext
    +
    +class FrequentItemsSuite extends FunSuite  {
    --- End diff --
    
    move this to .sql package, and call it DataFrameStatSuite?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97679663
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97709011
  
      [Test build #31392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31392/consoleFull) for   PR 5799 at commit [`482e741`](https://github.com/apache/spark/commit/482e74180445d30d0b5a769cd5f9bd0e94abfd17).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch **adds the following new dependencies:**
       * `jaxb-api-2.2.7.jar`
       * `jaxb-core-2.2.7.jar`
       * `jaxb-impl-2.2.7.jar`
       * `pmml-agent-1.1.15.jar`
       * `pmml-model-1.1.15.jar`
       * `pmml-schema-1.1.15.jar`
    
     * This patch **removes the following dependencies:**
       * `activation-1.1.jar`
       * `jaxb-api-2.2.2.jar`
       * `jaxb-impl-2.2.3-1.jar`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5799#discussion_r29405344
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ml/FrequentItems.scala ---
    @@ -0,0 +1,124 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to You under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +*    http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +
    +package org.apache.spark.sql.ml
    +
    +
    +import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
    +import org.apache.spark.sql.types.{StructType, ArrayType, StructField}
    +
    +import scala.collection.mutable.{Map => MutableMap}
    +
    +import org.apache.spark.Logging
    +import org.apache.spark.sql.{Row, DataFrame, functions}
    +
    +private[sql] object FrequentItems extends Logging {
    +
    +  /**
    +   * Merge two maps of counts. Subtracts the sum of `otherMap` from `baseMap`, and fills in
    +   * any emptied slots with the most frequent of `otherMap`.
    +   * @param baseMap The map containing the global counts
    +   * @param otherMap The map containing the counts for that partition
    +   * @param maxSize The maximum number of counts to keep in memory
    +   */
    +  private def mergeCounts[A](
    +      baseMap: MutableMap[A, Long],
    +      otherMap: MutableMap[A, Long],
    +      maxSize: Int): Unit = {
    +    val otherSum = otherMap.foldLeft(0L) { case (sum, (k, v)) =>
    +      if (!baseMap.contains(k)) sum + v else sum
    +    }
    +    baseMap.retain((k, v) => v > otherSum)
    +    // sort in decreasing order, so that we will add the most frequent items first
    +    val sorted = otherMap.toSeq.sortBy(-_._2)
    +    var i = 0
    +    val otherSize = sorted.length
    +    while (i < otherSize && baseMap.size < maxSize) {
    +      val keyVal = sorted(i)
    +      baseMap += keyVal._1 -> keyVal._2
    +      i += 1
    +    }
    +  }
    +  
    +
    +  /**
    +   * Finding frequent items for columns, possibly with false positives. Using the algorithm 
    +   * described in `http://www.cs.umd.edu/~samir/498/karp.pdf`.
    +   * For Internal use only.
    +   *
    +   * @param df The input DataFrame
    +   * @param cols the names of the columns to search frequent items in
    +   * @param support The minimum frequency for an item to be considered `frequent`
    +   * @return A Local DataFrame with the Array of frequent items for each column.
    +   */
    +  private[sql] def singlePassFreqItems(
    +      df: DataFrame, 
    +      cols: Array[String], 
    +      support: Double): DataFrame = {
    +    val numCols = cols.length
    +    // number of max items to keep counts for
    +    val sizeOfMap = math.floor(1 / support).toInt
    +    val countMaps = Array.tabulate(numCols)(i => MutableMap.empty[Any, Long])
    +    val originalSchema = df.schema
    +    val colInfo = cols.map { name =>
    +      val index = originalSchema.fieldIndex(name)
    +      val dataType = originalSchema.fields(index)
    +      (index, dataType.dataType)
    +    }
    +    val colIndices = colInfo.map(_._1)
    +    
    +    val freqItems: Array[MutableMap[Any, Long]] = df.rdd.aggregate(countMaps)(
    +      seqOp = (counts, row) => {
    +        var i = 0
    +        colIndices.foreach { index =>
    +          val thisMap = counts(i)
    +          val key = row.get(index)
    +          if (thisMap.contains(key))  {
    +            thisMap(key) += 1
    --- End diff --
    
    `1` -> `1L`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97965887
  
      [Test build #31453 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31453/consoleFull) for   PR 5799 at commit [`a6ec82c`](https://github.com/apache/spark/commit/a6ec82cef528c22eff00cf8294d92798c6d2aa9d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97823146
  
      [Test build #31423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31423/consoleFull) for   PR 5799 at commit [`0915e23`](https://github.com/apache/spark/commit/0915e23e78f7efb3c90b0a140a6cd1300113acd6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-7242][SQL][MLLIB] Frequent items for Da...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5799#issuecomment-97691954
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org