You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by zhengruifeng <gi...@git.apache.org> on 2016/05/01 07:52:03 UTC

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

GitHub user zhengruifeng opened a pull request:

    https://github.com/apache/spark/pull/12819

    [DO-NOT-MERGE][TEST] Refactor NaiveBayes to support weighted instances

    ## What changes were proposed in this pull request?
    1,support weighted data
    2,use dataframe instead of rdd
    
    ## How was this patch tested?
    local manual tests in spark-shell
    
    I will link this PR to the JIRA when all case tests pass

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhengruifeng/spark weighted_nb

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12819.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12819
    
----
commit 510ff2480ef782c6f4bb2401d5a7c3f49ac6ed62
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2016-05-01T02:54:27Z

    refactor nb to support weight

commit a7c1d8e94ef128c7519380260bb20128fdfdb06b
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2016-05-01T07:43:43Z

    add test case

commit 2dc4b23f5b5e840a182c1c076c5adbb5f7511fe5
Author: Zheng RuiFeng <ru...@foxmail.com>
Date:   2016-05-01T07:50:51Z

    reorder imports

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216032436
  
    **[Test build #57477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57477/consoleFull)** for PR 12819 at commit [`b8625cc`](https://github.com/apache/spark/commit/b8625ccc1dec1ac484452ad41d17b11a16bbe015).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79603681
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    +      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
    +      seqOp = {
    +         case (agg, (weight, features)) =>
    +           requireValues(features)
    +           BLAS.axpy(weight, features, agg._2)
    +           (agg._1 + weight, agg._2)
    +      },
    +      combOp = {
    +         case (agg1, agg2) =>
    +           BLAS.axpy(1.0, agg2._2, agg1._2)
    +           (agg1._1 + agg2._1, agg1._2)
    +      }).collect().sortBy(_._1)
    +
    +    val numLabels = aggregated.length
    +    val numDocuments = aggregated.map(_._2._1).sum
    +
    +    val pi = Array.fill[Double](numLabels)(0.0)
    +    val theta = Array.fill[Double](numLabels, numFeatures)(0.0)
    +
    +    val lambda = $(smoothing)
    +    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    +    var i = 0
    +    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    +      pi(i) = math.log(n + lambda) - piLogDenom
    +      val thetaLogDenom = $(modelType) match {
    +        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    +        case Bernoulli => math.log(n + 2.0 * lambda)
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +      var j = 0
    +      while (j < numFeatures) {
    +        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    +        j += 1
    +      }
    +      i += 1
    +    }
    +
    +    val uid = Identifiable.randomUID("nb")
    +    val piVector = Vectors.dense(pi)
    +    val thetaMatrix = new DenseMatrix(numLabels, theta(0).length, theta.flatten, true)
    --- End diff --
    
    agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79799464
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala ---
    @@ -150,6 +150,75 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
         validateProbabilities(featureAndProbabilities, model, "multinomial")
       }
     
    +  test("Naive Bayes Multinomial with weighted samples") {
    +    val (dataset, weightedDataset) = {
    +      val nPoints = 1000
    +      val piArray = Array(0.5, 0.1, 0.4).map(math.log)
    +      val thetaArray = Array(
    +        Array(0.70, 0.10, 0.10, 0.10), // label 0
    +        Array(0.10, 0.70, 0.10, 0.10), // label 1
    +        Array(0.10, 0.10, 0.70, 0.10) // label 2
    +      ).map(_.map(math.log))
    +      val pi = Vectors.dense(piArray)
    +      val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
    +
    +      val testData = generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "multinomial")
    +
    +      // Let's over-sample the label-1 samples twice, label-2 samples triple.
    +      val data1 = testData.flatMap { case labeledPoint: LabeledPoint =>
    +        labeledPoint.label match {
    +          case 0.0 => Iterator(labeledPoint)
    +          case 1.0 => Iterator(labeledPoint, labeledPoint)
    +          case 2.0 => Iterator(labeledPoint, labeledPoint, labeledPoint)
    +        }
    +      }
    +
    +      val rnd = new Random(8392)
    +      val data2 = testData.flatMap { case LabeledPoint(label: Double, features: Vector) =>
    --- End diff --
    
    I think it need some time before I can harness the new framework. What about let it alone for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79889093
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala ---
    @@ -150,6 +150,75 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
         validateProbabilities(featureAndProbabilities, model, "multinomial")
       }
     
    +  test("Naive Bayes Multinomial with weighted samples") {
    +    val (dataset, weightedDataset) = {
    +      val nPoints = 1000
    +      val piArray = Array(0.5, 0.1, 0.4).map(math.log)
    +      val thetaArray = Array(
    +        Array(0.70, 0.10, 0.10, 0.10), // label 0
    +        Array(0.10, 0.70, 0.10, 0.10), // label 1
    +        Array(0.10, 0.10, 0.70, 0.10) // label 2
    +      ).map(_.map(math.log))
    +      val pi = Vectors.dense(piArray)
    +      val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
    +
    +      val testData = generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "multinomial")
    +
    +      // Let's over-sample the label-1 samples twice, label-2 samples triple.
    +      val data1 = testData.flatMap { case labeledPoint: LabeledPoint =>
    +        labeledPoint.label match {
    +          case 0.0 => Iterator(labeledPoint)
    +          case 1.0 => Iterator(labeledPoint, labeledPoint)
    +          case 2.0 => Iterator(labeledPoint, labeledPoint, labeledPoint)
    +        }
    +      }
    +
    +      val rnd = new Random(8392)
    +      val data2 = testData.flatMap { case LabeledPoint(label: Double, features: Vector) =>
    --- End diff --
    
    I submitted a pr to your pr, with the weighted tests. (Hopefully I've done that correctly). Actually, I also think it is nice to test a case where the majority of the samples are outliers, but have small weights so they should not affect the predictions. This is semi-automated in MLUtils, but since NaiveBayes requires a certain type of features (0/1 in some cases) I don't think it integrates nicely yet. I think we should create a JIRA to automate weighted testing where we can think about this all together. For now, this test should be sufficient.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79569671
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +357,32 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .getOrCreate()
     
    -    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    -        throw new SparkException(
    -          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    -      }
    -    }
    +    import spark.implicits._
     
    -    // Aggregates term frequencies per label.
    -    // TODO: Calling combineByKey and collect creates two stages, we can implement something
    -    // TODO: similar to reduceByKeyLocally to save one stage.
    -    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
    -      createCombiner = (v: Vector) => {
    -        if (modelType == Bernoulli) {
    -          requireZeroOneBernoulliValues(v)
    -        } else {
    -          requireNonnegativeValues(v)
    -        }
    -        (1L, v.copy.toDense)
    -      },
    -      mergeValue = (c: (Long, DenseVector), v: Vector) => {
    -        requireNonnegativeValues(v)
    -        BLAS.axpy(1.0, v, c._2)
    -        (c._1 + 1L, c._2)
    -      },
    -      mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
    -        BLAS.axpy(1.0, c2._2, c1._2)
    -        (c1._1 + c2._1, c1._2)
    -      }
    -    ).collect().sortBy(_._1)
    +    val nb = new NewNaiveBayes()
    +      .setModelType(modelType)
    +      .setSmoothing(lambda)
     
    -    val numLabels = aggregated.length
    -    var numDocuments = 0L
    -    aggregated.foreach { case (_, (n, _)) =>
    -      numDocuments += n
    -    }
    -    val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
    -
    -    val labels = new Array[Double](numLabels)
    -    val pi = new Array[Double](numLabels)
    -    val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
    -
    -    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    -    var i = 0
    -    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    -      labels(i) = label
    -      pi(i) = math.log(n + lambda) - piLogDenom
    -      val thetaLogDenom = modelType match {
    -        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    -        case Bernoulli => math.log(n + 2.0 * lambda)
    -        case _ =>
    -          // This should never happen.
    -          throw new UnknownError(s"Invalid modelType: $modelType.")
    -      }
    -      var j = 0
    -      while (j < numFeatures) {
    -        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    -        j += 1
    -      }
    -      i += 1
    +    val labels = data.map(_.label).distinct().collect().sorted
    +
    +    // Input labels for [[org.apache.spark.ml.classification.NaiveBayes]] must be
    +    // in range [0, numClasses).
    +    val dataset = data.map {
    +      case LabeledPoint(label, features) =>
    +        (labels.indexOf(label).toDouble, features.asML)
    +    }.toDF("label", "features")
    --- End diff --
    
    It's better to compare the performance between ```toDF``` and ```createDataFrame```, and the regression performance is also necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216141557
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65289 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65289/consoleFull)** for PR 12819 at commit [`5cf7dd7`](https://github.com/apache/spark/commit/5cf7dd7a4a8373f1e2a887b373891eb893ef5047).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63264/consoleFull)** for PR 12819 at commit [`beb9243`](https://github.com/apache/spark/commit/beb9243d8eb6f24899d0816819749aa11adf4e40).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63251/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79566824
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    +      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
    +      seqOp = {
    +         case (agg, (weight, features)) =>
    +           requireValues(features)
    +           BLAS.axpy(weight, features, agg._2)
    +           (agg._1 + weight, agg._2)
    +      },
    +      combOp = {
    +         case (agg1, agg2) =>
    +           BLAS.axpy(1.0, agg2._2, agg1._2)
    +           (agg1._1 + agg2._1, agg1._2)
    +      }).collect().sortBy(_._1)
    +
    +    val numLabels = aggregated.length
    +    val numDocuments = aggregated.map(_._2._1).sum
    +
    +    val pi = Array.fill[Double](numLabels)(0.0)
    +    val theta = Array.fill[Double](numLabels, numFeatures)(0.0)
    +
    +    val lambda = $(smoothing)
    +    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    +    var i = 0
    +    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    +      pi(i) = math.log(n + lambda) - piLogDenom
    +      val thetaLogDenom = $(modelType) match {
    +        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    +        case Bernoulli => math.log(n + 2.0 * lambda)
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +      var j = 0
    +      while (j < numFeatures) {
    +        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    +        j += 1
    +      }
    +      i += 1
    +    }
    +
    +    val uid = Identifiable.randomUID("nb")
    --- End diff --
    
    ```uid``` has been generated at ```def this() = this(Identifiable.randomUID("nb"))```, you should use it directly rather than generate a new one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63250/consoleFull)** for PR 12819 at commit [`41e97ca`](https://github.com/apache/spark/commit/41e97cadc6dc37fce1a33c2d20e45c59c5fc676a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    @zhengruifeng Please create JIRAs for the follow-up works:
    * Parity check between the ml and mllib test suites, and complement missing test cases for ml.
    * Investigate how to handle ```-1, +1``` or similar labels efficiently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65869/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65599 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65599/consoleFull)** for PR 12819 at commit [`4311f88`](https://github.com/apache/spark/commit/4311f886da4bd4b66f32f9c891f003af55bc2b6b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81128334
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +356,33 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .sparkContext(data.context)
    +      .getOrCreate()
     
    -    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    -        throw new SparkException(
    -          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    -      }
    -    }
    +    import spark.implicits._
     
    -    // Aggregates term frequencies per label.
    -    // TODO: Calling combineByKey and collect creates two stages, we can implement something
    -    // TODO: similar to reduceByKeyLocally to save one stage.
    -    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
    -      createCombiner = (v: Vector) => {
    -        if (modelType == Bernoulli) {
    -          requireZeroOneBernoulliValues(v)
    -        } else {
    -          requireNonnegativeValues(v)
    -        }
    -        (1L, v.copy.toDense)
    -      },
    -      mergeValue = (c: (Long, DenseVector), v: Vector) => {
    -        requireNonnegativeValues(v)
    -        BLAS.axpy(1.0, v, c._2)
    -        (c._1 + 1L, c._2)
    -      },
    -      mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
    -        BLAS.axpy(1.0, c2._2, c1._2)
    -        (c1._1 + c2._1, c1._2)
    -      }
    -    ).collect().sortBy(_._1)
    +    val nb = new NewNaiveBayes()
    +      .setModelType(modelType)
    +      .setSmoothing(lambda)
     
    -    val numLabels = aggregated.length
    -    var numDocuments = 0L
    -    aggregated.foreach { case (_, (n, _)) =>
    -      numDocuments += n
    -    }
    -    val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
    -
    -    val labels = new Array[Double](numLabels)
    -    val pi = new Array[Double](numLabels)
    -    val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
    -
    -    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    -    var i = 0
    -    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    -      labels(i) = label
    -      pi(i) = math.log(n + lambda) - piLogDenom
    -      val thetaLogDenom = modelType match {
    -        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    -        case Bernoulli => math.log(n + 2.0 * lambda)
    -        case _ =>
    -          // This should never happen.
    -          throw new UnknownError(s"Invalid modelType: $modelType.")
    -      }
    -      var j = 0
    -      while (j < numFeatures) {
    -        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    -        j += 1
    -      }
    -      i += 1
    +    val labels = data.map(_.label).distinct().collect().sorted
    --- End diff --
    
    Labels in mllib's NB implementation are not guaranteed to be in range [0, numClasses), and the following code with label set to `{-1,+1}` run successfully
    ```
    import org.apache.spark.mllib.linalg._
    import org.apache.spark.mllib.regression.LabeledPoint
    import org.apache.spark.mllib.classification.NaiveBayes
    
    val points = Seq(LabeledPoint(-1.0, Vectors.dense(Array(1.0,2.0))), LabeledPoint(+1.0, Vectors.dense(Array(1.0,2.0))))
    val rdd = sc.parallelize(points)
    val nbm = NaiveBayes.train(rdd)
    
    ```
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79602209
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    --- End diff --
    
    ok, but this TODO and comments should be added in ml's implement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r87029278
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +118,89 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +
    +      require(values.forall(_ >= 0.0),
    +        s"Naive Bayes requires nonnegative feature values but found $v.")
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +
    +      require(values.forall(v => v == 0.0 || v == 1.0),
    +        s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    --- End diff --
    
    Looking at the decompiled bytecode, the situation is a bit more complicated: the pattern match is inlined and is not wrapped in a closure, so it would work as you would expect. However, because this relies on a compiler optimization, we should still evaluate the model type before outside `requireValues`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65712/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66100/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81106353
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala ---
    @@ -150,6 +150,54 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
         validateProbabilities(featureAndProbabilities, model, "multinomial")
       }
     
    +  test("Naive Bayes Multinomial with weighted samples") {
    +    val nPoints = 1000
    +    val piArray = Array(0.5, 0.1, 0.4).map(math.log)
    +    val thetaArray = Array(
    +      Array(0.70, 0.10, 0.10, 0.10), // label 0
    +      Array(0.10, 0.70, 0.10, 0.10), // label 1
    +      Array(0.10, 0.10, 0.70, 0.10) // label 2
    +    ).map(_.map(math.log))
    +
    +    val testData = spark.createDataFrame(
    +      generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "multinomial"))
    +    val (overSampledData, weightedData) =
    +      MLTestingUtils.genEquivalentOversampledAndWeightedInstances(testData,
    +        "label", "features", 42L)
    +    val nb = new NaiveBayes().setModelType("multinomial")
    +    val unweightedModel = nb.fit(weightedData)
    +    val overSampledModel = nb.fit(overSampledData)
    +    val weightedModel = nb.setWeightCol("weight").fit(weightedData)
    +    assert(weightedModel.theta ~== overSampledModel.theta relTol 0.001)
    +    assert(weightedModel.pi ~== overSampledModel.pi relTol 0.001)
    +    assert(unweightedModel.theta !~= overSampledModel.theta relTol 0.001)
    +    assert(unweightedModel.pi !~= overSampledModel.pi relTol 0.001)
    +  }
    +
    +  test("Naive Bayes Bernoulli with weighted samples") {
    +    val nPoints = 10000
    +    val piArray = Array(0.5, 0.3, 0.2).map(math.log)
    +    val thetaArray = Array(
    +      Array(0.50, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.40), // label 0
    +      Array(0.02, 0.70, 0.10, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02), // label 1
    +      Array(0.02, 0.02, 0.60, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.30)  // label 2
    +    ).map(_.map(math.log))
    +
    +    val testData = spark.createDataFrame(
    +      generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "bernoulli"))
    --- End diff --
    
    Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63264 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63264/consoleFull)** for PR 12819 at commit [`beb9243`](https://github.com/apache/spark/commit/beb9243d8eb6f24899d0816819749aa11adf4e40).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r80884362
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala ---
    @@ -150,6 +150,75 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
         validateProbabilities(featureAndProbabilities, model, "multinomial")
       }
     
    +  test("Naive Bayes Multinomial with weighted samples") {
    +    val (dataset, weightedDataset) = {
    +      val nPoints = 1000
    +      val piArray = Array(0.5, 0.1, 0.4).map(math.log)
    +      val thetaArray = Array(
    +        Array(0.70, 0.10, 0.10, 0.10), // label 0
    +        Array(0.10, 0.70, 0.10, 0.10), // label 1
    +        Array(0.10, 0.10, 0.70, 0.10) // label 2
    +      ).map(_.map(math.log))
    +      val pi = Vectors.dense(piArray)
    +      val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
    +
    +      val testData = generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "multinomial")
    +
    +      // Let's over-sample the label-1 samples twice, label-2 samples triple.
    +      val data1 = testData.flatMap { case labeledPoint: LabeledPoint =>
    +        labeledPoint.label match {
    +          case 0.0 => Iterator(labeledPoint)
    +          case 1.0 => Iterator(labeledPoint, labeledPoint)
    +          case 2.0 => Iterator(labeledPoint, labeledPoint, labeledPoint)
    +        }
    +      }
    +
    +      val rnd = new Random(8392)
    +      val data2 = testData.flatMap { case LabeledPoint(label: Double, features: Vector) =>
    --- End diff --
    
    @sethah Thanks a lot!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81105041
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -27,11 +27,14 @@ import org.json4s.jackson.JsonMethods._
     import org.apache.spark.{SparkContext, SparkException}
     import org.apache.spark.annotation.Since
     import org.apache.spark.internal.Logging
    -import org.apache.spark.mllib.linalg.{BLAS, DenseMatrix, DenseVector, SparseVector, Vector}
    +import org.apache.spark.ml.classification.{NaiveBayes => NewNaiveBayes}
    +import org.apache.spark.ml.linalg.VectorUDT
    --- End diff --
    
    Remove unused import.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65869/consoleFull)** for PR 12819 at commit [`8742be3`](https://github.com/apache/spark/commit/8742be32eb8100a0de9c0b6a419749b095e52c0c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63341 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63341/consoleFull)** for PR 12819 at commit [`3d71543`](https://github.com/apache/spark/commit/3d71543b76729cf90580d8535ddbe43d75670fa7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63250 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63250/consoleFull)** for PR 12819 at commit [`41e97ca`](https://github.com/apache/spark/commit/41e97cadc6dc37fce1a33c2d20e45c59c5fc676a).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81256496
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +356,33 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .sparkContext(data.context)
    +      .getOrCreate()
     
    -    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    -        throw new SparkException(
    -          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    -      }
    -    }
    +    import spark.implicits._
     
    -    // Aggregates term frequencies per label.
    -    // TODO: Calling combineByKey and collect creates two stages, we can implement something
    -    // TODO: similar to reduceByKeyLocally to save one stage.
    -    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
    -      createCombiner = (v: Vector) => {
    -        if (modelType == Bernoulli) {
    -          requireZeroOneBernoulliValues(v)
    -        } else {
    -          requireNonnegativeValues(v)
    -        }
    -        (1L, v.copy.toDense)
    -      },
    -      mergeValue = (c: (Long, DenseVector), v: Vector) => {
    -        requireNonnegativeValues(v)
    -        BLAS.axpy(1.0, v, c._2)
    -        (c._1 + 1L, c._2)
    -      },
    -      mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
    -        BLAS.axpy(1.0, c2._2, c1._2)
    -        (c1._1 + c2._1, c1._2)
    -      }
    -    ).collect().sortBy(_._1)
    +    val nb = new NewNaiveBayes()
    +      .setModelType(modelType)
    +      .setSmoothing(lambda)
     
    -    val numLabels = aggregated.length
    -    var numDocuments = 0L
    -    aggregated.foreach { case (_, (n, _)) =>
    -      numDocuments += n
    -    }
    -    val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
    -
    -    val labels = new Array[Double](numLabels)
    -    val pi = new Array[Double](numLabels)
    -    val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
    -
    -    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    -    var i = 0
    -    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    -      labels(i) = label
    -      pi(i) = math.log(n + lambda) - piLogDenom
    -      val thetaLogDenom = modelType match {
    -        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    -        case Bernoulli => math.log(n + 2.0 * lambda)
    -        case _ =>
    -          // This should never happen.
    -          throw new UnknownError(s"Invalid modelType: $modelType.")
    -      }
    -      var j = 0
    -      while (j < numFeatures) {
    -        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    -        j += 1
    -      }
    -      i += 1
    +    val labels = data.map(_.label).distinct().collect().sorted
    +
    +    // Input labels for [[org.apache.spark.ml.classification.NaiveBayes]] must be
    +    // in range [0, numClasses).
    +    val dataset = data.map {
    --- End diff --
    
    `data.toDF()` dont work, because there are two types of `VectorUDT`.
    This change will cause MLlib's test failures, so I revert it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63252 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63252/consoleFull)** for PR 12819 at commit [`a75dccd`](https://github.com/apache/spark/commit/a75dccdbbdf9b31dbce5c196f8f46e90916681d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81106187
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala ---
    @@ -150,6 +150,54 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
         validateProbabilities(featureAndProbabilities, model, "multinomial")
       }
     
    +  test("Naive Bayes Multinomial with weighted samples") {
    +    val nPoints = 1000
    +    val piArray = Array(0.5, 0.1, 0.4).map(math.log)
    +    val thetaArray = Array(
    +      Array(0.70, 0.10, 0.10, 0.10), // label 0
    +      Array(0.10, 0.70, 0.10, 0.10), // label 1
    +      Array(0.10, 0.10, 0.70, 0.10) // label 2
    +    ).map(_.map(math.log))
    +
    +    val testData = spark.createDataFrame(
    +      generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "multinomial"))
    --- End diff --
    
    ```val testData = generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "multinomial").toDF()```.
    We promote ```toDF()``` which is more succinct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    I will make a pass soon. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216141409
  
    **[Test build #57525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57525/consoleFull)** for PR 12819 at commit [`552c785`](https://github.com/apache/spark/commit/552c7858511aa6cb43922219f701898c82538603).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65257/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63338/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #66105 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66105/consoleFull)** for PR 12819 at commit [`3dd6333`](https://github.com/apache/spark/commit/3dd6333064d1f2418d5457d9a5a6bf71b56b3138).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63341/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216161515
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57526/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65289/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66138/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81105501
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +356,33 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .sparkContext(data.context)
    +      .getOrCreate()
     
    -    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    -        throw new SparkException(
    -          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    -      }
    -    }
    +    import spark.implicits._
     
    -    // Aggregates term frequencies per label.
    -    // TODO: Calling combineByKey and collect creates two stages, we can implement something
    -    // TODO: similar to reduceByKeyLocally to save one stage.
    -    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
    -      createCombiner = (v: Vector) => {
    -        if (modelType == Bernoulli) {
    -          requireZeroOneBernoulliValues(v)
    -        } else {
    -          requireNonnegativeValues(v)
    -        }
    -        (1L, v.copy.toDense)
    -      },
    -      mergeValue = (c: (Long, DenseVector), v: Vector) => {
    -        requireNonnegativeValues(v)
    -        BLAS.axpy(1.0, v, c._2)
    -        (c._1 + 1L, c._2)
    -      },
    -      mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
    -        BLAS.axpy(1.0, c2._2, c1._2)
    -        (c1._1 + c2._1, c1._2)
    -      }
    -    ).collect().sortBy(_._1)
    +    val nb = new NewNaiveBayes()
    +      .setModelType(modelType)
    +      .setSmoothing(lambda)
     
    -    val numLabels = aggregated.length
    -    var numDocuments = 0L
    -    aggregated.foreach { case (_, (n, _)) =>
    -      numDocuments += n
    -    }
    -    val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
    -
    -    val labels = new Array[Double](numLabels)
    -    val pi = new Array[Double](numLabels)
    -    val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
    -
    -    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    -    var i = 0
    -    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    -      labels(i) = label
    -      pi(i) = math.log(n + lambda) - piLogDenom
    -      val thetaLogDenom = modelType match {
    -        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    -        case Bernoulli => math.log(n + 2.0 * lambda)
    -        case _ =>
    -          // This should never happen.
    -          throw new UnknownError(s"Invalid modelType: $modelType.")
    -      }
    -      var j = 0
    -      while (j < numFeatures) {
    -        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    -        j += 1
    -      }
    -      i += 1
    +    val labels = data.map(_.label).distinct().collect().sorted
    --- End diff --
    
    It's not necessary to get labels by RDD operation which is expensive. Since we sort the labels and guarantee it in range [0, numClass), we can use ```val labels = pi.indices.map(_.toDouble).toArray``` directly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66099/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r87025073
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +118,89 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    --- End diff --
    
    @zhengruifeng (minor) this method is static and does not depend on the state of the object. It is preferable to put it in the companion object. Also, because people do not have a functional programing background, it is better to define it with `def` rather than a `val`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65259 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65259/consoleFull)** for PR 12819 at commit [`b310fe0`](https://github.com/apache/spark/commit/b310fe00a6ef7294995e358064508986b9956a93).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63252/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79603503
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    +      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
    +      seqOp = {
    +         case (agg, (weight, features)) =>
    --- End diff --
    
    ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65706/consoleFull)** for PR 12819 at commit [`a9922b1`](https://github.com/apache/spark/commit/a9922b1a5f0a7fa98accf844add1edcb42a0e1cd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65591 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65591/consoleFull)** for PR 12819 at commit [`9b0ebc3`](https://github.com/apache/spark/commit/9b0ebc343db8a6ba29f0b71cbdf55e2bc6b90985).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81107309
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +356,33 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .sparkContext(data.context)
    +      .getOrCreate()
     
    -    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    -        throw new SparkException(
    -          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    -      }
    -    }
    +    import spark.implicits._
     
    -    // Aggregates term frequencies per label.
    -    // TODO: Calling combineByKey and collect creates two stages, we can implement something
    -    // TODO: similar to reduceByKeyLocally to save one stage.
    -    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
    -      createCombiner = (v: Vector) => {
    -        if (modelType == Bernoulli) {
    -          requireZeroOneBernoulliValues(v)
    -        } else {
    -          requireNonnegativeValues(v)
    -        }
    -        (1L, v.copy.toDense)
    -      },
    -      mergeValue = (c: (Long, DenseVector), v: Vector) => {
    -        requireNonnegativeValues(v)
    -        BLAS.axpy(1.0, v, c._2)
    -        (c._1 + 1L, c._2)
    -      },
    -      mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
    -        BLAS.axpy(1.0, c2._2, c1._2)
    -        (c1._1 + c2._1, c1._2)
    -      }
    -    ).collect().sortBy(_._1)
    +    val nb = new NewNaiveBayes()
    +      .setModelType(modelType)
    +      .setSmoothing(lambda)
     
    -    val numLabels = aggregated.length
    -    var numDocuments = 0L
    -    aggregated.foreach { case (_, (n, _)) =>
    -      numDocuments += n
    -    }
    -    val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
    -
    -    val labels = new Array[Double](numLabels)
    -    val pi = new Array[Double](numLabels)
    -    val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
    -
    -    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    -    var i = 0
    -    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    -      labels(i) = label
    -      pi(i) = math.log(n + lambda) - piLogDenom
    -      val thetaLogDenom = modelType match {
    -        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    -        case Bernoulli => math.log(n + 2.0 * lambda)
    -        case _ =>
    -          // This should never happen.
    -          throw new UnknownError(s"Invalid modelType: $modelType.")
    -      }
    -      var j = 0
    -      while (j < numFeatures) {
    -        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    -        j += 1
    -      }
    -      i += 1
    +    val labels = data.map(_.label).distinct().collect().sorted
    +
    +    // Input labels for [[org.apache.spark.ml.classification.NaiveBayes]] must be
    +    // in range [0, numClasses).
    +    val dataset = data.map {
    --- End diff --
    
    ```val dataset = data.toDF()```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216032578
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65653 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65653/consoleFull)** for PR 12819 at commit [`d503682`](https://github.com/apache/spark/commit/d503682b94582a33066bbba452c7b7919c684ca9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81127329
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,90 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    --- End diff --
    
    ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #66099 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66099/consoleFull)** for PR 12819 at commit [`af53267`](https://github.com/apache/spark/commit/af53267c4cb64501a75e8a91eb276e513c48e0bb).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r87025222
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +118,89 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    --- End diff --
    
    Also, this will help reducing the length of this method, which is very long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216138377
  
    **[Test build #57526 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57526/consoleFull)** for PR 12819 at commit [`0d3462a`](https://github.com/apache/spark/commit/0d3462a4e4aeea26183a4a8cedd6c64a92dcf0a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65653/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #66099 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66099/consoleFull)** for PR 12819 at commit [`af53267`](https://github.com/apache/spark/commit/af53267c4cb64501a75e8a91eb276e513c48e0bb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #66138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66138/consoleFull)** for PR 12819 at commit [`c469942`](https://github.com/apache/spark/commit/c46994262197a6144a49840bb170f15a18565bea).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65869/consoleFull)** for PR 12819 at commit [`8742be3`](https://github.com/apache/spark/commit/8742be32eb8100a0de9c0b6a419749b095e52c0c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79632172
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala ---
    @@ -150,6 +150,75 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
         validateProbabilities(featureAndProbabilities, model, "multinomial")
       }
     
    +  test("Naive Bayes Multinomial with weighted samples") {
    +    val (dataset, weightedDataset) = {
    +      val nPoints = 1000
    +      val piArray = Array(0.5, 0.1, 0.4).map(math.log)
    +      val thetaArray = Array(
    +        Array(0.70, 0.10, 0.10, 0.10), // label 0
    +        Array(0.10, 0.70, 0.10, 0.10), // label 1
    +        Array(0.10, 0.10, 0.70, 0.10) // label 2
    +      ).map(_.map(math.log))
    +      val pi = Vectors.dense(piArray)
    +      val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
    +
    +      val testData = generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "multinomial")
    +
    +      // Let's over-sample the label-1 samples twice, label-2 samples triple.
    +      val data1 = testData.flatMap { case labeledPoint: LabeledPoint =>
    +        labeledPoint.label match {
    +          case 0.0 => Iterator(labeledPoint)
    +          case 1.0 => Iterator(labeledPoint, labeledPoint)
    +          case 2.0 => Iterator(labeledPoint, labeledPoint, labeledPoint)
    +        }
    +      }
    +
    +      val rnd = new Random(8392)
    +      val data2 = testData.flatMap { case LabeledPoint(label: Double, features: Vector) =>
    --- End diff --
    
    So, we are adding more and more algorithms in MLlib that are accepting weights, and I think we need to take care to create standardized unit tests that we can reuse. Could you take a look at the test for `LogisticRegression` and reuse that framework here? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #66138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66138/consoleFull)** for PR 12819 at commit [`c469942`](https://github.com/apache/spark/commit/c46994262197a6144a49840bb170f15a18565bea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66105/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79569085
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +357,32 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .getOrCreate()
    --- End diff --
    
    ```val spark = SparkSession.builder().sparkContext(data.context).getOrCreate()```
    We should guarantee SparkSession was generated based on the existing data RDD context.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63165/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65259 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65259/consoleFull)** for PR 12819 at commit [`b310fe0`](https://github.com/apache/spark/commit/b310fe00a6ef7294995e358064508986b9956a93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65598 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65598/consoleFull)** for PR 12819 at commit [`33bd790`](https://github.com/apache/spark/commit/33bd790c260605cc14e6e7f0ed6df54b69370de0).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63341 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63341/consoleFull)** for PR 12819 at commit [`3d71543`](https://github.com/apache/spark/commit/3d71543b76729cf90580d8535ddbe43d75670fa7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216141564
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57525/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216025013
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65289/consoleFull)** for PR 12819 at commit [`5cf7dd7`](https://github.com/apache/spark/commit/5cf7dd7a4a8373f1e2a887b373891eb893ef5047).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63165 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63165/consoleFull)** for PR 12819 at commit [`8da481c`](https://github.com/apache/spark/commit/8da481c45e42afb10bb334a8702f596364e91c5c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216032580
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57477/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63338 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63338/consoleFull)** for PR 12819 at commit [`4230896`](https://github.com/apache/spark/commit/4230896a146e7239f2b4e5f30dcf514fbe50429c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65653 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65653/consoleFull)** for PR 12819 at commit [`d503682`](https://github.com/apache/spark/commit/d503682b94582a33066bbba452c7b7919c684ca9).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81285828
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +356,33 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .sparkContext(data.context)
    +      .getOrCreate()
     
    -    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    -        throw new SparkException(
    -          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    -      }
    -    }
    +    import spark.implicits._
     
    -    // Aggregates term frequencies per label.
    -    // TODO: Calling combineByKey and collect creates two stages, we can implement something
    -    // TODO: similar to reduceByKeyLocally to save one stage.
    -    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
    -      createCombiner = (v: Vector) => {
    -        if (modelType == Bernoulli) {
    -          requireZeroOneBernoulliValues(v)
    -        } else {
    -          requireNonnegativeValues(v)
    -        }
    -        (1L, v.copy.toDense)
    -      },
    -      mergeValue = (c: (Long, DenseVector), v: Vector) => {
    -        requireNonnegativeValues(v)
    -        BLAS.axpy(1.0, v, c._2)
    -        (c._1 + 1L, c._2)
    -      },
    -      mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
    -        BLAS.axpy(1.0, c2._2, c1._2)
    -        (c1._1 + c2._1, c1._2)
    -      }
    -    ).collect().sortBy(_._1)
    +    val nb = new NewNaiveBayes()
    +      .setModelType(modelType)
    +      .setSmoothing(lambda)
     
    -    val numLabels = aggregated.length
    -    var numDocuments = 0L
    -    aggregated.foreach { case (_, (n, _)) =>
    -      numDocuments += n
    -    }
    -    val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
    -
    -    val labels = new Array[Double](numLabels)
    -    val pi = new Array[Double](numLabels)
    -    val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
    -
    -    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    -    var i = 0
    -    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    -      labels(i) = label
    -      pi(i) = math.log(n + lambda) - piLogDenom
    -      val thetaLogDenom = modelType match {
    -        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    -        case Bernoulli => math.log(n + 2.0 * lambda)
    -        case _ =>
    -          // This should never happen.
    -          throw new UnknownError(s"Invalid modelType: $modelType.")
    -      }
    -      var j = 0
    -      while (j < numFeatures) {
    -        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    -        j += 1
    -      }
    -      i += 1
    +    val labels = data.map(_.label).distinct().collect().sorted
    --- End diff --
    
    Yeah, untouching old mllib implementation is one of the candidates, but not an elegant solution. I'm ambivalent about this issue. Let's leave this issue as follow-up work. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81225025
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +356,33 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .sparkContext(data.context)
    +      .getOrCreate()
     
    -    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    -        throw new SparkException(
    -          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    -      }
    -    }
    +    import spark.implicits._
     
    -    // Aggregates term frequencies per label.
    -    // TODO: Calling combineByKey and collect creates two stages, we can implement something
    -    // TODO: similar to reduceByKeyLocally to save one stage.
    -    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
    -      createCombiner = (v: Vector) => {
    -        if (modelType == Bernoulli) {
    -          requireZeroOneBernoulliValues(v)
    -        } else {
    -          requireNonnegativeValues(v)
    -        }
    -        (1L, v.copy.toDense)
    -      },
    -      mergeValue = (c: (Long, DenseVector), v: Vector) => {
    -        requireNonnegativeValues(v)
    -        BLAS.axpy(1.0, v, c._2)
    -        (c._1 + 1L, c._2)
    -      },
    -      mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
    -        BLAS.axpy(1.0, c2._2, c1._2)
    -        (c1._1 + c2._1, c1._2)
    -      }
    -    ).collect().sortBy(_._1)
    +    val nb = new NewNaiveBayes()
    +      .setModelType(modelType)
    +      .setSmoothing(lambda)
     
    -    val numLabels = aggregated.length
    -    var numDocuments = 0L
    -    aggregated.foreach { case (_, (n, _)) =>
    -      numDocuments += n
    -    }
    -    val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
    -
    -    val labels = new Array[Double](numLabels)
    -    val pi = new Array[Double](numLabels)
    -    val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
    -
    -    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    -    var i = 0
    -    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    -      labels(i) = label
    -      pi(i) = math.log(n + lambda) - piLogDenom
    -      val thetaLogDenom = modelType match {
    -        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    -        case Bernoulli => math.log(n + 2.0 * lambda)
    -        case _ =>
    -          // This should never happen.
    -          throw new UnknownError(s"Invalid modelType: $modelType.")
    -      }
    -      var j = 0
    -      while (j < numFeatures) {
    -        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    -        j += 1
    -      }
    -      i += 1
    +    val labels = data.map(_.label).distinct().collect().sorted
    --- End diff --
    
    Yeah, so I think this is going to introduce a performance regression into the mllib version. Before, you could get the labels from the aggregates, but if we make mllib call into the ml implementation we need an extra pass through the data. One way to avoid this is leave the mllib implementation untouched, which has been done in the past with GBT and RF. It's not a great solution, I'm open to other suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #66100 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66100/consoleFull)** for PR 12819 at commit [`9c3eb7f`](https://github.com/apache/spark/commit/9c3eb7ffe8cfcdedc3dbac637b34a256927e20da).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65591 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65591/consoleFull)** for PR 12819 at commit [`9b0ebc3`](https://github.com/apache/spark/commit/9b0ebc343db8a6ba29f0b71cbdf55e2bc6b90985).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79567686
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    +      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
    +      seqOp = {
    +         case (agg, (weight, features)) =>
    +           requireValues(features)
    +           BLAS.axpy(weight, features, agg._2)
    +           (agg._1 + weight, agg._2)
    +      },
    +      combOp = {
    +         case (agg1, agg2) =>
    +           BLAS.axpy(1.0, agg2._2, agg1._2)
    +           (agg1._1 + agg2._1, agg1._2)
    +      }).collect().sortBy(_._1)
    +
    +    val numLabels = aggregated.length
    +    val numDocuments = aggregated.map(_._2._1).sum
    +
    +    val pi = Array.fill[Double](numLabels)(0.0)
    +    val theta = Array.fill[Double](numLabels, numFeatures)(0.0)
    +
    +    val lambda = $(smoothing)
    +    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    +    var i = 0
    +    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    +      pi(i) = math.log(n + lambda) - piLogDenom
    +      val thetaLogDenom = $(modelType) match {
    +        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    +        case Bernoulli => math.log(n + 2.0 * lambda)
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +      var j = 0
    +      while (j < numFeatures) {
    +        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    +        j += 1
    +      }
    +      i += 1
    +    }
    +
    +    val uid = Identifiable.randomUID("nb")
    +    val piVector = Vectors.dense(pi)
    +    val thetaMatrix = new DenseMatrix(numLabels, theta(0).length, theta.flatten, true)
    --- End diff --
    
    ```pi -> piArray, piVector -> pi, theta -> thetaArrays, thetaMatrix -> theta``` should be better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216161197
  
    **[Test build #57526 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57526/consoleFull)** for PR 12819 at commit [`0d3462a`](https://github.com/apache/spark/commit/0d3462a4e4aeea26183a4a8cedd6c64a92dcf0a2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65257 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65257/consoleFull)** for PR 12819 at commit [`9d53987`](https://github.com/apache/spark/commit/9d5398796201204a24d552ea9b92d6b9fca738ea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    @zhengruifeng Only left some minor comments, otherwise, looks good. I think we should also make parity check between the ml and mllib test suites, and complement missing test cases for ml since we may remove mllib package in later release. But I think this can be done in a separate follow-up work.
    @sethah Any more comments about this? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216161513
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216023539
  
    **[Test build #57473 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57473/consoleFull)** for PR 12819 at commit [`2dc4b23`](https://github.com/apache/spark/commit/2dc4b23f5b5e840a182c1c076c5adbb5f7511fe5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65599/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65598/consoleFull)** for PR 12819 at commit [`33bd790`](https://github.com/apache/spark/commit/33bd790c260605cc14e6e7f0ed6df54b69370de0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #66100 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66100/consoleFull)** for PR 12819 at commit [`9c3eb7f`](https://github.com/apache/spark/commit/9c3eb7ffe8cfcdedc3dbac637b34a256927e20da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81127281
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,90 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    --- End diff --
    
    let's just use `require` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79564439
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    +      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
    +      seqOp = {
    +         case (agg, (weight, features)) =>
    --- End diff --
    
    ```agg -> (weightSum: Double, featureSum: Vector)```, we should avoid use ```._1``` and it's better to mark type explicitly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216032577
  
    **[Test build #57477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57477/consoleFull)** for PR 12819 at commit [`b8625cc`](https://github.com/apache/spark/commit/b8625ccc1dec1ac484452ad41d17b11a16bbe015).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63251/consoleFull)** for PR 12819 at commit [`a75dccd`](https://github.com/apache/spark/commit/a75dccdbbdf9b31dbce5c196f8f46e90916681d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63250/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79752969
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala ---
    @@ -150,6 +150,75 @@ class NaiveBayesSuite extends SparkFunSuite with MLlibTestSparkContext with Defa
         validateProbabilities(featureAndProbabilities, model, "multinomial")
       }
     
    +  test("Naive Bayes Multinomial with weighted samples") {
    +    val (dataset, weightedDataset) = {
    +      val nPoints = 1000
    +      val piArray = Array(0.5, 0.1, 0.4).map(math.log)
    +      val thetaArray = Array(
    +        Array(0.70, 0.10, 0.10, 0.10), // label 0
    +        Array(0.10, 0.70, 0.10, 0.10), // label 1
    +        Array(0.10, 0.10, 0.70, 0.10) // label 2
    +      ).map(_.map(math.log))
    +      val pi = Vectors.dense(piArray)
    +      val theta = new DenseMatrix(3, 4, thetaArray.flatten, true)
    +
    +      val testData = generateNaiveBayesInput(piArray, thetaArray, nPoints, 42, "multinomial")
    +
    +      // Let's over-sample the label-1 samples twice, label-2 samples triple.
    +      val data1 = testData.flatMap { case labeledPoint: LabeledPoint =>
    +        labeledPoint.label match {
    +          case 0.0 => Iterator(labeledPoint)
    +          case 1.0 => Iterator(labeledPoint, labeledPoint)
    +          case 2.0 => Iterator(labeledPoint, labeledPoint, labeledPoint)
    +        }
    +      }
    +
    +      val rnd = new Random(8392)
    +      val data2 = testData.flatMap { case LabeledPoint(label: Double, features: Vector) =>
    --- End diff --
    
    Good point. Of course, I will update this testsuite keep in line with other algorithms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65706/consoleFull)** for PR 12819 at commit [`a9922b1`](https://github.com/apache/spark/commit/a9922b1a5f0a7fa98accf844add1edcb42a0e1cd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r78325579
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -98,7 +99,17 @@ class NaiveBayes @Since("1.5.0") (
        */
       @Since("1.5.0")
       def setModelType(value: String): this.type = set(modelType, value)
    -  setDefault(modelType -> OldNaiveBayes.Multinomial)
    +  setDefault(modelType -> NaiveBayes.Multinomial)
    +
    +  /**
    +   * Whether to over-/under-sample training instances according to the given weights in weightCol.
    +   * If empty, all instances are treated equally (weight 1.0).
    +   * Default is empty, so all instances have weight one.
    +   * @group setParam
    +   */
    +  @Since("2.1.0")
    +  def setWeightCol(value: String): this.type = set(weightCol, value)
    +  setDefault(weightCol -> "")
    --- End diff --
    
    It's not necessary to set default for ```weightCol```. You can refer other place where used ```weightCol```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    @yanboliang I will make usage of `weightCol` keeping in line with other algorithms, and I will do some performance test. Thanks for reviewing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r87151780
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +118,89 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +
    +      require(values.forall(_ >= 0.0),
    +        s"Naive Bayes requires nonnegative feature values but found $v.")
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +
    +      require(values.forall(v => v == 0.0 || v == 1.0),
    +        s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    --- End diff --
    
    @thunterdb Good catch! I updated at #15826. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    I test on two datasets: a9a and mnist.scale. Download here http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
    The test codes are following:
    ```
    import org.apache.spark.ml.classification.NaiveBayes
    val path = "/Users/zrf/Dev/Data/mnist.scale"
    val data = spark.read.format("libsvm").load(path).persist()
    println(data.count)
    val t0 = System.currentTimeMillis()
    for(i <- 0 until 100) { println(i); val model = new NaiveBayes().fit(data) }
    val t1 = System.currentTimeMillis()
    println(t1-t0)
    ```
    
    Result duration:
    `a9a`:   Spark2.0  7944;   This PR  6877
    `mnist.scale`: Spark2.0  11881;  This PR 13287



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    @yanboliang What went into the decision to use RDD based aggregation? Just curious, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65599 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65599/consoleFull)** for PR 12819 at commit [`4311f88`](https://github.com/apache/spark/commit/4311f886da4bd4b66f32f9c891f003af55bc2b6b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63338 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63338/consoleFull)** for PR 12819 at commit [`4230896`](https://github.com/apache/spark/commit/4230896a146e7239f2b4e5f30dcf514fbe50429c).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65706/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81132746
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -280,15 +366,14 @@ object NaiveBayesModel extends MLReadable[NaiveBayesModel] {
           val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
     
           val dataPath = new Path(path, "data").toString
    -      val data = sparkSession.read.parquet(dataPath)
    -      val vecConverted = MLUtils.convertVectorColumnsToML(data, "pi")
    -      val Row(pi: Vector, theta: Matrix) = MLUtils.convertMatrixColumnsToML(vecConverted, "theta")
    -        .select("pi", "theta")
    -        .head()
    +      val data = sparkSession.read.parquet(dataPath).select("pi", "theta").head()
    +      val pi = data.getAs[Vector](0)
    +      val theta = data.getAs[Matrix](1)
           val model = new NaiveBayesModel(metadata.uid, pi, theta)
     
           DefaultParamsReader.getAndSetParams(model, metadata)
           model
         }
       }
    +
    --- End diff --
    
    remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216025006
  
    **[Test build #57473 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57473/consoleFull)** for PR 12819 at commit [`2dc4b23`](https://github.com/apache/spark/commit/2dc4b23f5b5e840a182c1c076c5adbb5f7511fe5).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79603468
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    +      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
    +      seqOp = {
    +         case (agg, (weight, features)) =>
    +           requireValues(features)
    +           BLAS.axpy(weight, features, agg._2)
    +           (agg._1 + weight, agg._2)
    +      },
    +      combOp = {
    +         case (agg1, agg2) =>
    +           BLAS.axpy(1.0, agg2._2, agg1._2)
    +           (agg1._1 + agg2._1, agg1._2)
    +      }).collect().sortBy(_._1)
    +
    +    val numLabels = aggregated.length
    +    val numDocuments = aggregated.map(_._2._1).sum
    +
    +    val pi = Array.fill[Double](numLabels)(0.0)
    +    val theta = Array.fill[Double](numLabels, numFeatures)(0.0)
    +
    +    val lambda = $(smoothing)
    +    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    +    var i = 0
    +    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    +      pi(i) = math.log(n + lambda) - piLogDenom
    +      val thetaLogDenom = $(modelType) match {
    +        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    +        case Bernoulli => math.log(n + 2.0 * lambda)
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +      var j = 0
    +      while (j < numFeatures) {
    +        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    +        j += 1
    +      }
    +      i += 1
    +    }
    +
    +    val uid = Identifiable.randomUID("nb")
    --- End diff --
    
    ok, I will remove this line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    @sethah It's a good questions, and we have some concerns:
    * The Dataset based implementation does not bring much performance improvement from the test result(even worse in some case).
    * I'm not sure the UDAF API is stable enough and have confidential performance.
    * This PR mainly focus on moving NB implementation from mllib to ml package and refactor it to support weighted instance. Let's focus on what it should be. We can change it to use dataset based UDAF in a separate work later if it's necessary. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    @zhengruifeng I saw your implementation switch the training process from RDD operation to Dataset operation with UDAF. I think we should do some performance test to verify there is no performance degradation. Otherwise, we can still use the RDD operation in this PR and update it to use Dataset later. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63264/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79565555
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    +      .map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
    +      }.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
    +      seqOp = {
    +         case (agg, (weight, features)) =>
    +           requireValues(features)
    +           BLAS.axpy(weight, features, agg._2)
    +           (agg._1 + weight, agg._2)
    +      },
    +      combOp = {
    +         case (agg1, agg2) =>
    --- End diff --
    
    Ditto: ```(agg1, agg2) -> ((weightSum1: Double, featureSum1: Vector), (weightSum2: Double, featureSum2: Vetor))```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65257 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65257/consoleFull)** for PR 12819 at commit [`9d53987`](https://github.com/apache/spark/commit/9d5398796201204a24d552ea9b92d6b9fca738ea).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63165 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63165/consoleFull)** for PR 12819 at commit [`8da481c`](https://github.com/apache/spark/commit/8da481c45e42afb10bb334a8702f596364e91c5c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65712/consoleFull)** for PR 12819 at commit [`8f4d671`](https://github.com/apache/spark/commit/8f4d671674c2a911ebacfc1722d163be0ca23ebc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65259/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r81105095
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -27,11 +27,14 @@ import org.json4s.jackson.JsonMethods._
     import org.apache.spark.{SparkContext, SparkException}
     import org.apache.spark.annotation.Since
     import org.apache.spark.internal.Logging
    -import org.apache.spark.mllib.linalg.{BLAS, DenseMatrix, DenseVector, SparseVector, Vector}
    +import org.apache.spark.ml.classification.{NaiveBayes => NewNaiveBayes}
    +import org.apache.spark.ml.linalg.VectorUDT
    +import org.apache.spark.mllib.linalg.{BLAS, DenseMatrix, DenseVector, Vector}
     import org.apache.spark.mllib.regression.LabeledPoint
     import org.apache.spark.mllib.util.{Loader, Saveable}
     import org.apache.spark.rdd.RDD
    -import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.{Row, SparkSession}
    +import org.apache.spark.sql.types._
    --- End diff --
    
    Remove unused import.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79795778
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
    @@ -355,79 +357,32 @@ class NaiveBayes private (
        */
       @Since("0.9.0")
       def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
    -    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(_ >= 0.0)) {
    -        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    -      }
    -    }
    +    val spark = SparkSession
    +      .builder()
    +      .getOrCreate()
     
    -    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    -      val values = v match {
    -        case sv: SparseVector => sv.values
    -        case dv: DenseVector => dv.values
    -      }
    -      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    -        throw new SparkException(
    -          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    -      }
    -    }
    +    import spark.implicits._
     
    -    // Aggregates term frequencies per label.
    -    // TODO: Calling combineByKey and collect creates two stages, we can implement something
    -    // TODO: similar to reduceByKeyLocally to save one stage.
    -    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
    -      createCombiner = (v: Vector) => {
    -        if (modelType == Bernoulli) {
    -          requireZeroOneBernoulliValues(v)
    -        } else {
    -          requireNonnegativeValues(v)
    -        }
    -        (1L, v.copy.toDense)
    -      },
    -      mergeValue = (c: (Long, DenseVector), v: Vector) => {
    -        requireNonnegativeValues(v)
    -        BLAS.axpy(1.0, v, c._2)
    -        (c._1 + 1L, c._2)
    -      },
    -      mergeCombiners = (c1: (Long, DenseVector), c2: (Long, DenseVector)) => {
    -        BLAS.axpy(1.0, c2._2, c1._2)
    -        (c1._1 + c2._1, c1._2)
    -      }
    -    ).collect().sortBy(_._1)
    +    val nb = new NewNaiveBayes()
    +      .setModelType(modelType)
    +      .setSmoothing(lambda)
     
    -    val numLabels = aggregated.length
    -    var numDocuments = 0L
    -    aggregated.foreach { case (_, (n, _)) =>
    -      numDocuments += n
    -    }
    -    val numFeatures = aggregated.head match { case (_, (_, v)) => v.size }
    -
    -    val labels = new Array[Double](numLabels)
    -    val pi = new Array[Double](numLabels)
    -    val theta = Array.fill(numLabels)(new Array[Double](numFeatures))
    -
    -    val piLogDenom = math.log(numDocuments + numLabels * lambda)
    -    var i = 0
    -    aggregated.foreach { case (label, (n, sumTermFreqs)) =>
    -      labels(i) = label
    -      pi(i) = math.log(n + lambda) - piLogDenom
    -      val thetaLogDenom = modelType match {
    -        case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
    -        case Bernoulli => math.log(n + 2.0 * lambda)
    -        case _ =>
    -          // This should never happen.
    -          throw new UnknownError(s"Invalid modelType: $modelType.")
    -      }
    -      var j = 0
    -      while (j < numFeatures) {
    -        theta(i)(j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
    -        j += 1
    -      }
    -      i += 1
    +    val labels = data.map(_.label).distinct().collect().sorted
    +
    +    // Input labels for [[org.apache.spark.ml.classification.NaiveBayes]] must be
    +    // in range [0, numClasses).
    +    val dataset = data.map {
    +      case LabeledPoint(label, features) =>
    +        (labels.indexOf(label).toDouble, features.asML)
    +    }.toDF("label", "features")
    --- End diff --
    
    @yanboliang  I test `toDF` and `createDF`, and the implement with `toDF` is faster:
    dataset:  mnist.scale, numFeatures : 780, numSamples 60000
    run 1000 times 
    Duration:  `toDF` 85488, `createDF` 97122.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65598/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    @zhengruifeng As we discussed offline, we would like to still use RDD operation underneath rather than UDAF. Please update the PR accordingly. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65591/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r78325688
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +120,51 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val wvsum = new WeightedVectorSum($(modelType), numFeatures)
    +
    +    val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated =
    +      dataset.select(col($(labelCol)).cast(DoubleType).as("label"), w.as("weight"),
    +        col($(featuresCol)).as("features"))
    +        .groupBy(col($(labelCol)))
    +        .agg(sum(col("weight")), wvsum(col("weight"), col("features")))
    +        .collect().map { row =>
    +        (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2).toDense))
    +      }.sortBy(_._1)
    --- End diff --
    
    Do you have some performance test for switching to Dataset based operation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #66105 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66105/consoleFull)** for PR 12819 at commit [`3dd6333`](https://github.com/apache/spark/commit/3dd6333064d1f2418d5457d9a5a6bf71b56b3138).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216129168
  
    **[Test build #57525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57525/consoleFull)** for PR 12819 at commit [`552c785`](https://github.com/apache/spark/commit/552c7858511aa6cb43922219f701898c82538603).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14077][ML] Refactor NaiveBayes to suppo...

Posted by zhengruifeng <gi...@git.apache.org>.

Github user zhengruifeng commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216167147
  
    cc @mengxr @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by yanboliang <gi...@git.apache.org>.

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r79561754
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +119,88 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(_ >= 0.0)) {
    +        throw new SparkException(s"Naive Bayes requires nonnegative feature values but found $v.")
    +      }
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +      if (!values.forall(v => v == 0.0 || v == 1.0)) {
    +        throw new SparkException(
    +          s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +      }
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    +        case Multinomial =>
    +          requireNonnegativeValues
    +        case Bernoulli =>
    +          requireZeroOneBernoulliValues
    +        case _ =>
    +          // This should never happen.
    +          throw new UnknownError(s"Invalid modelType: ${$(modelType)}.")
    +      }
    +    }
    +
    +    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
    +
    +    val aggregated = dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd
    --- End diff --
    
    It's better to keep the original comments and TODOs here, and it can help developers or users to improve the code continuously.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #65712 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65712/consoleFull)** for PR 12819 at commit [`8f4d671`](https://github.com/apache/spark/commit/8f4d671674c2a911ebacfc1722d163be0ca23ebc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63251 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63251/consoleFull)** for PR 12819 at commit [`a75dccd`](https://github.com/apache/spark/commit/a75dccdbbdf9b31dbce5c196f8f46e90916681d0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12819


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [DO-NOT-MERGE][TEST] Refactor NaiveBayes to su...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12819#issuecomment-216025014
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57473/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    **[Test build #63252 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63252/consoleFull)** for PR 12819 at commit [`a75dccd`](https://github.com/apache/spark/commit/a75dccdbbdf9b31dbce5c196f8f46e90916681d0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #12819: [SPARK-14077][ML] Refactor NaiveBayes to support weighte...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/12819
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #12819: [SPARK-14077][ML] Refactor NaiveBayes to support ...

Posted by thunterdb <gi...@git.apache.org>.

Github user thunterdb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12819#discussion_r87025609
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala ---
    @@ -109,10 +118,89 @@ class NaiveBayes @Since("1.5.0") (
             s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
         }
     
    -    val oldDataset: RDD[OldLabeledPoint] =
    -      extractLabeledPoints(dataset).map(OldLabeledPoint.fromML)
    -    val oldModel = OldNaiveBayes.train(oldDataset, $(smoothing), $(modelType))
    -    NaiveBayesModel.fromOld(oldModel, this)
    +    val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
    +
    +    val requireNonnegativeValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +
    +      require(values.forall(_ >= 0.0),
    +        s"Naive Bayes requires nonnegative feature values but found $v.")
    +    }
    +
    +    val requireZeroOneBernoulliValues: Vector => Unit = (v: Vector) => {
    +      val values = v match {
    +        case sv: SparseVector => sv.values
    +        case dv: DenseVector => dv.values
    +      }
    +
    +      require(values.forall(v => v == 0.0 || v == 1.0),
    +        s"Bernoulli naive Bayes requires 0 or 1 feature values but found $v.")
    +    }
    +
    +    val requireValues: Vector => Unit = {
    +      $(modelType) match {
    --- End diff --
    
    @zhengruifeng (major) you are capturing the outer object, through `$` and `modelType`. You should make a `val` outside the method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org