You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Barry Becker (JIRA)" <ji...@apache.org> on 2016/08/01 22:34:20 UTC

[jira] [Created] (SPARK-16840) Please save the aggregate term frequencies as part of the NaiveBayesModel

Barry Becker created SPARK-16840:
------------------------------------

             Summary: Please save the aggregate term frequencies as part of the NaiveBayesModel
                 Key: SPARK-16840
                 URL: https://issues.apache.org/jira/browse/SPARK-16840
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 2.0.0, 1.6.2
            Reporter: Barry Becker


I would like to visualize the structure of the NaiveBayes model in order to get additional insight into the patterns in the data. In order to do that I need the frequencies for each feature value per label.

This exact information is computed in the  NaiveBayes.run method (see "aggregated" variable), but then discarded when creating the model. Pi and theta are computed based on the aggregated frequency counts, but surprisingly those counts are not needed to apply the model. It would not add much to the model size to add these aggregated counts, but could be very useful for some applications of the model.

{code}
  def run(data: RDD[LabeledPoint]): NaiveBayesModel = {
     :
    // Aggregates term frequencies per label.
    val aggregated = data.map(p => (p.label, p.features)).combineByKey[(Long, DenseVector)](
      createCombiner = (v: Vector) => {
        :
      },
    :
    new NaiveBayesModel(labels, pi, theta, modelType) // <- please include "aggregated" here.
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org