You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by yinxusen <gi...@git.apache.org> on 2016/07/16 01:59:38 UTC

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

GitHub user yinxusen opened a pull request:

    https://github.com/apache/spark/pull/14229

    [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

    ## What changes were proposed in this pull request?
    
    Add LDA Wrapper in SparkR with the following interfaces:
    
    - spark.lda(data, ...)
    
    - spark.posterior(object, newData, ...)
    
    - spark.perplexity(object, ...)
    
    - summary(object)
    
    - write.ml(object)
    
    - read.ml(path)
    
    ## How was this patch tested?
    
    Test with SparkR unit test.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yinxusen/spark SPARK-16447

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14229.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14229
    
----
commit 4db86c16768cafff2a3091520a282764ce69bf84
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-09T23:30:23Z

    a runnable version

commit 7f8650dae796f85c66600c85dfb7ced26ed3e29e
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-10T02:53:46Z

    runnable version with complex args

commit 1487dcc9de85e95af3e2865d3a9068c9bc395928
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-10T03:59:45Z

    add summary without new dictionary

commit bdc38191f41e8ecb3b2c8caa46671f33db6576fc
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-11T21:25:07Z

    add test for spark.lda

commit 324871f3519465cd10ec85d670ddb3459416569e
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-12T22:18:33Z

    add new functions

commit 3be7105a4b6661b2c74e8e6f0a3936ca50c2c414
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-12T23:36:18Z

    merge with master

commit 7f3fcc63197ddc8f49ed2ee21956c2527d18a542
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-14T20:10:49Z

    add raw text input support

commit 27fa94b9c74814459a54286212e11b6426635ef1
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-14T22:21:18Z

    add vocabulary

commit 4f6aa1ecd6ae50212b456b7ae1c1ef83d10b8bbd
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-15T22:31:34Z

    add index to term dict

commit db61624ca9de39700c2fab83eedcb026a13bf8d7
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-15T22:49:50Z

    change likelihood to log one

commit 00e4e07a4fcbfe5b093511d1e5dfe09b6880a45f
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-15T23:29:35Z

    refine R docs

commit 89f0ae4b23a2d20d62c851db522718aa08c12514
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-16T01:10:39Z

    update docs and more tests

commit 02a7719f08dbc6c985b9c2768a444bbb4995ca28
Author: Xusen Yin <yi...@gmail.com>
Date:   2016-07-16T01:57:44Z

    fix interface

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #62401 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62401/consoleFull)** for PR 14229 at commit [`fa87794`](https://github.com/apache/spark/commit/fa87794311d69ea4a9eb9019c581aea21de2b006).
     * This patch **fails some tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by wangmiao1981 <gi...@git.apache.org>.

Github user wangmiao1981 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74542484
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
    @@ -0,0 +1,210 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import scala.collection.mutable
    +
    +import org.apache.hadoop.fs.Path
    +import org.json4s._
    +import org.json4s.JsonDSL._
    +import org.json4s.jackson.JsonMethods._
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
    +import org.apache.spark.ml.clustering.{LDA, LDAModel}
    +import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
    +import org.apache.spark.ml.linalg.VectorUDT
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.StringType
    +
    +
    +private[r] class LDAWrapper private (
    +    val pipeline: PipelineModel,
    +    val logLikelihood: Double,
    +    val logPerplexity: Double,
    +    val vocabulary: Array[String]) extends MLWritable {
    +
    +  import LDAWrapper._
    +
    +  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
    +  private val preprocessor: PipelineModel =
    +    new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", pipeline.stages.dropRight(1))
    +
    +  def transform(data: Dataset[_]): DataFrame = {
    +    pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, COUNT_VECTOR_COL)
    +  }
    +
    +  def computeLogPerplexity(data: Dataset[_]): Double = {
    +    lda.logPerplexity(preprocessor.transform(data))
    +  }
    +
    +  lazy val topicIndices: DataFrame = lda.describeTopics(10)
    +
    +  lazy val topics = if (vocabulary.isEmpty || vocabulary.length < vocabSize) {
    +    topicIndices
    +  } else {
    +    val index2term = udf { indices: mutable.WrappedArray[Int] => indices.map(i => vocabulary(i)) }
    +    topicIndices.select(col("topic"), index2term(col("termIndices")).as("term"), col("termWeights"))
    +  }
    +
    +  lazy val isDistributed: Boolean = lda.isDistributed
    +  lazy val vocabSize: Int = lda.vocabSize
    +  lazy val docConcentration: Array[Double] = lda.getEffectiveDocConcentration
    +  lazy val topicConcentration: Double = lda.getEffectiveTopicConcentration
    +
    +  override def write: MLWriter = new LDAWrapper.LDAWrapperWriter(this)
    +}
    +
    +private[r] object LDAWrapper extends MLReadable[LDAWrapper] with Logging {
    +
    +  val TOKENIZER_COL = s"${Identifiable.randomUID("rawTokens")}"
    +  val STOPWORDS_REMOVER_COL = s"${Identifiable.randomUID("tokens")}"
    +  val COUNT_VECTOR_COL = s"${Identifiable.randomUID("features")}"
    +
    +  private def getPreStages(
    +      features: String,
    +      customizedStopWords: Array[String],
    +      maxVocabSize: Int): Array[PipelineStage] = {
    +    val tokenizer = new RegexTokenizer()
    +      .setInputCol(features)
    +      .setOutputCol(TOKENIZER_COL)
    +    val stopWordsRemover = new StopWordsRemover()
    +      .setInputCol(TOKENIZER_COL)
    +      .setOutputCol(STOPWORDS_REMOVER_COL)
    +    stopWordsRemover.setStopWords(stopWordsRemover.getStopWords ++ customizedStopWords)
    +    val countVectorizer = new CountVectorizer()
    +      .setVocabSize(maxVocabSize)
    +      .setInputCol(STOPWORDS_REMOVER_COL)
    +      .setOutputCol(COUNT_VECTOR_COL)
    +
    +    Array(tokenizer, stopWordsRemover, countVectorizer)
    +  }
    +
    +  def fit(
    +      data: DataFrame,
    +      features: String,
    +      k: Int,
    +      maxIter: Int,
    +      optimizer: String,
    +      subsamplingRate: Double,
    +      topicConcentration: Double,
    +      docConcentration: Array[Double],
    +      customizedStopWords: Array[String],
    +      maxVocabSize: Int): LDAWrapper = {
    +
    +    val lda = new LDA()
    +      .setK(k)
    +      .setMaxIter(maxIter)
    +      .setSubsamplingRate(subsamplingRate)
    +
    +    val featureSchema = data.schema(features)
    +    val stages = featureSchema.dataType match {
    +      case d: StringType =>
    +        logDebug(s"Feature ($features) schema is StringType, use the built-in preprocessor.")
    --- End diff --
    
    Remove this debug message? Other wrappers do not have LogDebug messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14229


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74852090
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    --- End diff --
    
    If the user uses string-format column as features, e.g.
    
    column_str
    "this is the first document"
    "this is another one"
    
    then he/she can use the `customizedStopWords` to filter stop words.
    
    If he/she chooses vector-format column, then this parameter is useless.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r75034333
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +306,94 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.posterior,LDAModel,SparkDataFrame-method
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}.
    +#' @param maxTermsPerTopic Maximum number of terms to collect for each topic. Default value of 10.
    +#' @return \code{summary} returns a list containing
    +#'         \item{\code{docConcentration}}{concentration parameter commonly named \code{alpha} for
    +#'               the prior placed on documents distributions over topics \code{theta}}
    +#'         \item{\code{topicConcentration}}{concentration parameter commonly named \code{beta} or
    +#'               \code{eta} for the prior placed on topic distributions over terms}
    +#'         \item{\code{logLikelihood}}{log likelihood of the entire corpus}
    +#'         \item{\code{logPerplexity}}{log perplexity}
    +#'         \item{\code{isDistributed}}{TRUE for distributed model while FALSE for local model}
    +#'         \item{\code{vocabSize}}{number of terms in the corpus}
    +#'         \item{\code{topics}}{top 10 terms and their weights of all topics}
    +#'         \item{\code{vocabulary}}{whole terms of the training corpus, NULL if libsvm format file
    +#'               used as training set}
    +#' @rdname spark.lda
    +#' @aliases summary,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, maxTermsPerTopic, ...) {
    --- End diff --
    
    Not useful here. But other `summary`s also have the vararg.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #62400 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62400/consoleFull)** for PR 14229 at commit [`02a7719`](https://github.com/apache/spark/commit/02a7719f08dbc6c985b9c2768a444bbb4995ca28).
     * This patch **fails some tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r75120183
  
    --- Diff: R/pkg/R/generics.R ---
    @@ -1279,6 +1279,19 @@ setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("s
     #' @export
     setGeneric("spark.survreg", function(data, formula, ...) { standardGeneric("spark.survreg") })
     
    +#' @rdname spark.lda
    +#' @export
    +setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") })
    --- End diff --
    
    that's WIP - add that in generics.R, ie. here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    @felixcheung Yes. Sorry I missed the email.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62439/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74865027
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    +#'         \code{vocabSize}, number of terms in the corpus;
    +#'         \code{topics}, top 10 terms and their weights of all topics;
    +#'         \code{vocabulary}, whole terms of the training corpus, NULL if libsvm format file used as
    +#'         training set.
    +#' @rdname spark.lda
    +#' @aliases summary,spark.lda,LDAModel-method
    --- End diff --
    
    I think this should be
    `@aliases summary,LDAModel-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74869399
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    +#'         \code{vocabSize}, number of terms in the corpus;
    +#'         \code{topics}, top 10 terms and their weights of all topics;
    +#'         \code{vocabulary}, whole terms of the training corpus, NULL if libsvm format file used as
    +#'         training set.
    +#' @rdname spark.lda
    +#' @aliases summary,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, ...) {
    +            jobj <- object@jobj
    +            docConcentration <- callJMethod(jobj, "docConcentration")
    +            topicConcentration <- callJMethod(jobj, "topicConcentration")
    +            logLikelihood <- callJMethod(jobj, "logLikelihood")
    +            logPerplexity <- callJMethod(jobj, "logPerplexity")
    +            isDistributed <- callJMethod(jobj, "isDistributed")
    +            vocabSize <- callJMethod(jobj, "vocabSize")
    +            topics <- dataFrame(callJMethod(jobj, "topics"))
    +            vocabulary <- callJMethod(jobj, "vocabulary")
    +            return(list(docConcentration = unlist(docConcentration),
    +                        topicConcentration = topicConcentration,
    +                        logLikelihood = logLikelihood, logPerplexity = logPerplexity,
    +                        isDistributed = isDistributed, vocabSize = vocabSize,
    +                        topics = topics,
    +                        vocabulary = unlist(vocabulary)))
    +          })
    +
    +# Returns the log perplexity of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @return \code{spark.perplexity} returns the log perplexity of given SparkDataFrame, or the log
    --- End diff --
    
    add @param


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63956 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63956/consoleFull)** for PR 14229 at commit [`41249d7`](https://github.com/apache/spark/commit/41249d76ecd2e89ace3a30212d6e5a74f1376117).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63661 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63661/consoleFull)** for PR 14229 at commit [`ca5ea9e`](https://github.com/apache/spark/commit/ca5ea9e10b53d2d1dc6ff6b350301eb79a944eb8).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by junyangq <gi...@git.apache.org>.

Github user junyangq commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r75036372
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +702,70 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either libSVM-format column or
    +#'        character-format column are valid.
    --- End diff --
    
    is valid?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    @felixcheung Merged with master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74676054
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    --- End diff --
    
    roxygen trims whitespaces - a list like this will be hard to read in generated doc. try \item or \cr to format them


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74662435
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
    @@ -0,0 +1,210 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import scala.collection.mutable
    +
    +import org.apache.hadoop.fs.Path
    +import org.json4s._
    +import org.json4s.JsonDSL._
    +import org.json4s.jackson.JsonMethods._
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
    +import org.apache.spark.ml.clustering.{LDA, LDAModel}
    +import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
    +import org.apache.spark.ml.linalg.VectorUDT
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.StringType
    +
    +
    +private[r] class LDAWrapper private (
    +    val pipeline: PipelineModel,
    +    val logLikelihood: Double,
    +    val logPerplexity: Double,
    +    val vocabulary: Array[String]) extends MLWritable {
    +
    +  import LDAWrapper._
    +
    +  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
    +  private val preprocessor: PipelineModel =
    +    new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", pipeline.stages.dropRight(1))
    +
    +  def transform(data: Dataset[_]): DataFrame = {
    +    pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, COUNT_VECTOR_COL)
    +  }
    +
    +  def computeLogPerplexity(data: Dataset[_]): Double = {
    +    lda.logPerplexity(preprocessor.transform(data))
    +  }
    +
    +  lazy val topicIndices: DataFrame = lda.describeTopics(10)
    +
    +  lazy val topics = if (vocabulary.isEmpty || vocabulary.length < vocabSize) {
    +    topicIndices
    +  } else {
    +    val index2term = udf { indices: mutable.WrappedArray[Int] => indices.map(i => vocabulary(i)) }
    +    topicIndices.select(col("topic"), index2term(col("termIndices")).as("term"), col("termWeights"))
    +  }
    +
    +  lazy val isDistributed: Boolean = lda.isDistributed
    +  lazy val vocabSize: Int = lda.vocabSize
    +  lazy val docConcentration: Array[Double] = lda.getEffectiveDocConcentration
    +  lazy val topicConcentration: Double = lda.getEffectiveTopicConcentration
    +
    +  override def write: MLWriter = new LDAWrapper.LDAWrapperWriter(this)
    +}
    +
    +private[r] object LDAWrapper extends MLReadable[LDAWrapper] with Logging {
    +
    +  val TOKENIZER_COL = s"${Identifiable.randomUID("rawTokens")}"
    +  val STOPWORDS_REMOVER_COL = s"${Identifiable.randomUID("tokens")}"
    +  val COUNT_VECTOR_COL = s"${Identifiable.randomUID("features")}"
    +
    +  private def getPreStages(
    +      features: String,
    +      customizedStopWords: Array[String],
    +      maxVocabSize: Int): Array[PipelineStage] = {
    +    val tokenizer = new RegexTokenizer()
    +      .setInputCol(features)
    +      .setOutputCol(TOKENIZER_COL)
    +    val stopWordsRemover = new StopWordsRemover()
    +      .setInputCol(TOKENIZER_COL)
    +      .setOutputCol(STOPWORDS_REMOVER_COL)
    +    stopWordsRemover.setStopWords(stopWordsRemover.getStopWords ++ customizedStopWords)
    +    val countVectorizer = new CountVectorizer()
    +      .setVocabSize(maxVocabSize)
    +      .setInputCol(STOPWORDS_REMOVER_COL)
    +      .setOutputCol(COUNT_VECTOR_COL)
    +
    +    Array(tokenizer, stopWordsRemover, countVectorizer)
    +  }
    +
    +  def fit(
    +      data: DataFrame,
    +      features: String,
    +      k: Int,
    +      maxIter: Int,
    +      optimizer: String,
    +      subsamplingRate: Double,
    +      topicConcentration: Double,
    +      docConcentration: Array[Double],
    +      customizedStopWords: Array[String],
    +      maxVocabSize: Int): LDAWrapper = {
    +
    +    val lda = new LDA()
    +      .setK(k)
    +      .setMaxIter(maxIter)
    +      .setSubsamplingRate(subsamplingRate)
    +
    +    val featureSchema = data.schema(features)
    +    val stages = featureSchema.dataType match {
    +      case d: StringType =>
    +        logDebug(s"Feature ($features) schema is StringType, use the built-in preprocessor.")
    --- End diff --
    
    Removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74851793
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    --- End diff --
    
    Here I want to say the features column should be String format, e.g.
    
    column_str
    "this is the first document"
    "this is another one"
    
    or libSVM format, which represents in ml.Vector-format, e.g.
    
    column_vec
    ml.Vector(...)
    ml.Vector(...)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74868095
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    --- End diff --
    
    `@aliases spark.posterior,LDAModel,SparkDataFrame-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74841214
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
    @@ -0,0 +1,207 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import scala.collection.mutable
    +
    +import org.apache.hadoop.fs.Path
    +import org.json4s._
    +import org.json4s.JsonDSL._
    +import org.json4s.jackson.JsonMethods._
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
    +import org.apache.spark.ml.clustering.{LDA, LDAModel}
    +import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
    +import org.apache.spark.ml.linalg.VectorUDT
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.StringType
    +
    +
    +private[r] class LDAWrapper private (
    +    val pipeline: PipelineModel,
    +    val logLikelihood: Double,
    +    val logPerplexity: Double,
    +    val vocabulary: Array[String]) extends MLWritable {
    +
    +  import LDAWrapper._
    +
    +  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
    +  private val preprocessor: PipelineModel =
    +    new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", pipeline.stages.dropRight(1))
    +
    +  def transform(data: Dataset[_]): DataFrame = {
    +    pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, COUNT_VECTOR_COL)
    +  }
    +
    +  def computeLogPerplexity(data: Dataset[_]): Double = {
    +    lda.logPerplexity(preprocessor.transform(data))
    +  }
    +
    +  lazy val topicIndices: DataFrame = lda.describeTopics(10)
    --- End diff --
    
    This value is used in `summary(LDAModel)`. Usually `summary` only accept one parameter, i.e. the LDAModel.
    
    Also, there is a default parameter value for `describeTopics` in Scala side, which is 10.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r72880758
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -596,6 +688,68 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    +#' @param maxVocabSize maximum vocabulary size, default 1 << 18
    +#' @return \code{spark.lda} returns a fitted Latent Dirichlet Allocation model
    +#' @rdname spark.lda
    +#' @seealso survival: \url{https://cran.r-project.org/web/packages/topicmodels/}
    --- End diff --
    
    is this the right link?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74675514
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -313,6 +313,7 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
     #' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
     #'         vectors named "topicDistribution"
     #' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    --- End diff --
    
    `-method` should be at the end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #62442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62442/consoleFull)** for PR 14229 at commit [`90dad9d`](https://github.com/apache/spark/commit/90dad9d199e7540d455f99db4d028686c36f9f99).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74864994
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    +#'         \code{vocabSize}, number of terms in the corpus;
    +#'         \code{topics}, top 10 terms and their weights of all topics;
    +#'         \code{vocabulary}, whole terms of the training corpus, NULL if libsvm format file used as
    +#'         training set.
    +#' @rdname spark.lda
    +#' @aliases summary,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, ...) {
    +            jobj <- object@jobj
    +            docConcentration <- callJMethod(jobj, "docConcentration")
    +            topicConcentration <- callJMethod(jobj, "topicConcentration")
    +            logLikelihood <- callJMethod(jobj, "logLikelihood")
    +            logPerplexity <- callJMethod(jobj, "logPerplexity")
    +            isDistributed <- callJMethod(jobj, "isDistributed")
    +            vocabSize <- callJMethod(jobj, "vocabSize")
    +            topics <- dataFrame(callJMethod(jobj, "topics"))
    +            vocabulary <- callJMethod(jobj, "vocabulary")
    +            return(list(docConcentration = unlist(docConcentration),
    +                        topicConcentration = topicConcentration,
    +                        logLikelihood = logLikelihood, logPerplexity = logPerplexity,
    +                        isDistributed = isDistributed, vocabSize = vocabSize,
    +                        topics = topics,
    +                        vocabulary = unlist(vocabulary)))
    +          })
    +
    +# Returns the log perplexity of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @return \code{spark.perplexity} returns the log perplexity of given SparkDataFrame, or the log
    +#'         perplexity of the training data if missing argument "data".
    +#' @rdname spark.lda
    +#" @aliases spark.perplexity,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    --- End diff --
    
    please fix the function name in @note


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74990526
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    +#'         \code{vocabSize}, number of terms in the corpus;
    +#'         \code{topics}, top 10 terms and their weights of all topics;
    +#'         \code{vocabulary}, whole terms of the training corpus, NULL if libsvm format file used as
    +#'         training set.
    +#' @rdname spark.lda
    +#' @aliases summary,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, ...) {
    +            jobj <- object@jobj
    +            docConcentration <- callJMethod(jobj, "docConcentration")
    +            topicConcentration <- callJMethod(jobj, "topicConcentration")
    +            logLikelihood <- callJMethod(jobj, "logLikelihood")
    +            logPerplexity <- callJMethod(jobj, "logPerplexity")
    +            isDistributed <- callJMethod(jobj, "isDistributed")
    +            vocabSize <- callJMethod(jobj, "vocabSize")
    +            topics <- dataFrame(callJMethod(jobj, "topics"))
    +            vocabulary <- callJMethod(jobj, "vocabulary")
    +            return(list(docConcentration = unlist(docConcentration),
    +                        topicConcentration = topicConcentration,
    +                        logLikelihood = logLikelihood, logPerplexity = logPerplexity,
    +                        isDistributed = isDistributed, vocabSize = vocabSize,
    +                        topics = topics,
    +                        vocabulary = unlist(vocabulary)))
    +          })
    +
    +# Returns the log perplexity of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @return \code{spark.perplexity} returns the log perplexity of given SparkDataFrame, or the log
    --- End diff --
    
    I removed the `@param` intentionally. Because this method, `summary`, and `spark.posterior` all have the `object` param. If adding `@param object` to all of them, then the generated Rd file `spark.lda.Rd` will have three duplicated items of the param object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    LGTM merging.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74675709
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    --- End diff --
    
    `-method` should be at the end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63890/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Hi @yinxusen would you be able to continue working on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62442/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #62400 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62400/consoleFull)** for PR 14229 at commit [`02a7719`](https://github.com/apache/spark/commit/02a7719f08dbc6c985b9c2768a444bbb4995ca28).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #62401 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62401/consoleFull)** for PR 14229 at commit [`fa87794`](https://github.com/apache/spark/commit/fa87794311d69ea4a9eb9019c581aea21de2b006).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63661/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74869802
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
    @@ -0,0 +1,207 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import scala.collection.mutable
    +
    +import org.apache.hadoop.fs.Path
    +import org.json4s._
    +import org.json4s.JsonDSL._
    +import org.json4s.jackson.JsonMethods._
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
    +import org.apache.spark.ml.clustering.{LDA, LDAModel}
    +import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
    +import org.apache.spark.ml.linalg.VectorUDT
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.StringType
    +
    +
    +private[r] class LDAWrapper private (
    +    val pipeline: PipelineModel,
    +    val logLikelihood: Double,
    +    val logPerplexity: Double,
    +    val vocabulary: Array[String]) extends MLWritable {
    +
    +  import LDAWrapper._
    +
    +  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
    +  private val preprocessor: PipelineModel =
    +    new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", pipeline.stages.dropRight(1))
    +
    +  def transform(data: Dataset[_]): DataFrame = {
    +    pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, COUNT_VECTOR_COL)
    +  }
    +
    +  def computeLogPerplexity(data: Dataset[_]): Double = {
    +    lda.logPerplexity(preprocessor.transform(data))
    +  }
    +
    +  lazy val topicIndices: DataFrame = lda.describeTopics(10)
    --- End diff --
    
    I think we could add additional parameter if it could be useful. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74869578
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    --- End diff --
    
    right, I think "affects" is the right word to use


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by lapanda <gi...@git.apache.org>.

Github user lapanda commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74676118
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    --- End diff --
    
    `distributed`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #62439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62439/consoleFull)** for PR 14229 at commit [`1886c1d`](https://github.com/apache/spark/commit/1886c1dc1005128b868ac849542d08d069fb5bcc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74864948
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    +#'         \code{vocabSize}, number of terms in the corpus;
    +#'         \code{topics}, top 10 terms and their weights of all topics;
    +#'         \code{vocabulary}, whole terms of the training corpus, NULL if libsvm format file used as
    +#'         training set.
    +#' @rdname spark.lda
    +#' @aliases summary,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, ...) {
    +            jobj <- object@jobj
    +            docConcentration <- callJMethod(jobj, "docConcentration")
    +            topicConcentration <- callJMethod(jobj, "topicConcentration")
    +            logLikelihood <- callJMethod(jobj, "logLikelihood")
    +            logPerplexity <- callJMethod(jobj, "logPerplexity")
    +            isDistributed <- callJMethod(jobj, "isDistributed")
    +            vocabSize <- callJMethod(jobj, "vocabSize")
    +            topics <- dataFrame(callJMethod(jobj, "topics"))
    +            vocabulary <- callJMethod(jobj, "vocabulary")
    +            return(list(docConcentration = unlist(docConcentration),
    +                        topicConcentration = topicConcentration,
    +                        logLikelihood = logLikelihood, logPerplexity = logPerplexity,
    +                        isDistributed = isDistributed, vocabSize = vocabSize,
    +                        topics = topics,
    +                        vocabulary = unlist(vocabulary)))
    +          })
    +
    +# Returns the log perplexity of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @return \code{spark.perplexity} returns the log perplexity of given SparkDataFrame, or the log
    +#'         perplexity of the training data if missing argument "data".
    +#' @rdname spark.lda
    +#" @aliases spark.perplexity,spark.lda,LDAModel-method
    --- End diff --
    
    I think this should be (note single quote needed for roxygen) and method name
    `#' @aliases spark.perplexity,LDAModel-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62400/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63705/consoleFull)** for PR 14229 at commit [`a254220`](https://github.com/apache/spark/commit/a254220e417fa715c98336246c2347a24b88828a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63883/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    @felixcheung I add some aliases for spark.lda related functions. However, I am not quite understand it. From [here](https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html) I can see that 
    *When you use ?x, help("x") or example("x") R looks for an Rd file containing \alias{x}. It then parses the file, converts it into html and displays it.*
    But when I using `?GroupedData-method`, sparkr-shell cannot find related topics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63890/consoleFull)** for PR 14229 at commit [`84cc5e7`](https://github.com/apache/spark/commit/84cc5e73523dc9a306c45ca8bc994dc05984424d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63705/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63883 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63883/consoleFull)** for PR 14229 at commit [`6781785`](https://github.com/apache/spark/commit/67817853b44b73906370fcf91d7876095ee0067a).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by junyangq <gi...@git.apache.org>.

Github user junyangq commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r75026541
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +306,94 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.posterior,LDAModel,SparkDataFrame-method
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}.
    +#' @param maxTermsPerTopic Maximum number of terms to collect for each topic. Default value of 10.
    +#' @return \code{summary} returns a list containing
    +#'         \item{\code{docConcentration}}{concentration parameter commonly named \code{alpha} for
    +#'               the prior placed on documents distributions over topics \code{theta}}
    +#'         \item{\code{topicConcentration}}{concentration parameter commonly named \code{beta} or
    +#'               \code{eta} for the prior placed on topic distributions over terms}
    +#'         \item{\code{logLikelihood}}{log likelihood of the entire corpus}
    +#'         \item{\code{logPerplexity}}{log perplexity}
    +#'         \item{\code{isDistributed}}{TRUE for distributed model while FALSE for local model}
    +#'         \item{\code{vocabSize}}{number of terms in the corpus}
    +#'         \item{\code{topics}}{top 10 terms and their weights of all topics}
    +#'         \item{\code{vocabulary}}{whole terms of the training corpus, NULL if libsvm format file
    +#'               used as training set}
    +#' @rdname spark.lda
    +#' @aliases summary,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, maxTermsPerTopic, ...) {
    --- End diff --
    
    Is `...` useful here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74676513
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    +#' @param maxVocabSize maximum vocabulary size, default 1 << 18
    +#' @return \code{spark.lda} returns a fitted Latent Dirichlet Allocation model
    +#' @rdname spark.lda
    +#' @aliases spark.lda,SparkDataFrame
    +#' @seealso topicmodels: \url{https://cran.r-project.org/web/packages/topicmodels/}
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' text <- read.df("path/to/data", source = "libsvm")
    +#' model <- spark.lda(data = text, optimizer = "em")
    +#'
    +#' # get a summary of the model
    +#' summary(model)
    +#'
    +#' # compute posterior probabilities
    +#' posterior <- spark.posterior(model, df)
    +#' showDF(posterior)
    +#'
    +#' # compute perplexity
    +#' perplexity <- spark.perplexity(model, df)
    +#'
    +#' # save and load the model
    +#' path <- "path/to/model"
    +#' write.ml(model, path)
    +#' savedModel <- read.ml(path)
    +#' summary(savedModel)
    +#' }
    +#' @note spark.lda since 2.1.0
    +setMethod("spark.lda", signature(data = "SparkDataFrame"),
    +          function(data, features = "features", k = 10, maxIter = 20, optimizer = c("online", "em"),
    +                   subsamplingRate = 0.05, topicConcentration = -1, docConcentration = -1,
    +                   customizedStopWords = "", maxVocabSize = bitwShiftL(1, 18)) {
    +            optimizer <- match.arg(optimizer)
    +            jobj <- callJStatic("org.apache.spark.ml.r.LDAWrapper", "fit", data@sdf, features,
    +                                as.integer(k), as.integer(maxIter), optimizer, subsamplingRate,
    --- End diff --
    
    change to `as.numeric(subsamplingRate)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74675732
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    +#' @param maxVocabSize maximum vocabulary size, default 1 << 18
    +#' @return \code{spark.lda} returns a fitted Latent Dirichlet Allocation model
    +#' @rdname spark.lda
    +#' @aliases spark.lda,SparkDataFrame
    --- End diff --
    
    add `-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63863/consoleFull)** for PR 14229 at commit [`3e8678e`](https://github.com/apache/spark/commit/3e8678e27d556a15fd9df075bb7de071d53c01ea).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62401/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63884 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63884/consoleFull)** for PR 14229 at commit [`8280b41`](https://github.com/apache/spark/commit/8280b414443eece95130abb053be16c0c57aa9f0).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r75041418
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    --- End diff --
    
    SparkR doesn't provide the type explicitly. However, you may load the libSVM file through `text <- read.df("data/mllib/sample_lda_libsvm_data.txt", source = "libsvm")`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74676433
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    --- End diff --
    
    "Only affects training data with string format column."?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74866841
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    --- End diff --
    
    Can you add this @seealso of write.ml (around L63), predict like other ML


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74990815
  
    --- Diff: R/pkg/R/generics.R ---
    @@ -1279,6 +1279,19 @@ setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("s
     #' @export
     setGeneric("spark.survreg", function(data, formula, ...) { standardGeneric("spark.survreg") })
     
    +#' @rdname spark.lda
    +#' @export
    +setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") })
    --- End diff --
    
    I'll remove the ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74676531
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    +#' @param maxVocabSize maximum vocabulary size, default 1 << 18
    +#' @return \code{spark.lda} returns a fitted Latent Dirichlet Allocation model
    +#' @rdname spark.lda
    +#' @aliases spark.lda,SparkDataFrame
    +#' @seealso topicmodels: \url{https://cran.r-project.org/web/packages/topicmodels/}
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' text <- read.df("path/to/data", source = "libsvm")
    +#' model <- spark.lda(data = text, optimizer = "em")
    +#'
    +#' # get a summary of the model
    +#' summary(model)
    +#'
    +#' # compute posterior probabilities
    +#' posterior <- spark.posterior(model, df)
    +#' showDF(posterior)
    +#'
    +#' # compute perplexity
    +#' perplexity <- spark.perplexity(model, df)
    +#'
    +#' # save and load the model
    +#' path <- "path/to/model"
    +#' write.ml(model, path)
    +#' savedModel <- read.ml(path)
    +#' summary(savedModel)
    +#' }
    +#' @note spark.lda since 2.1.0
    +setMethod("spark.lda", signature(data = "SparkDataFrame"),
    +          function(data, features = "features", k = 10, maxIter = 20, optimizer = c("online", "em"),
    +                   subsamplingRate = 0.05, topicConcentration = -1, docConcentration = -1,
    +                   customizedStopWords = "", maxVocabSize = bitwShiftL(1, 18)) {
    +            optimizer <- match.arg(optimizer)
    +            jobj <- callJStatic("org.apache.spark.ml.r.LDAWrapper", "fit", data@sdf, features,
    +                                as.integer(k), as.integer(maxIter), optimizer, subsamplingRate,
    +                                topicConcentration, as.array(docConcentration),
    --- End diff --
    
    should there be `as.array` for `topicConcentration`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r75121550
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    --- End diff --
    
    hmm, I think that might not be super straightforward.
    how about a follow up PR on documenting how to use libSVM from R in the Programming Guide and link this to that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74869681
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    --- End diff --
    
    could you link to libSVM's ml.Vector-format?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63869/consoleFull)** for PR 14229 at commit [`6bd15cd`](https://github.com/apache/spark/commit/6bd15cda89bb330c538accf48bc8c8e24f889d6a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74675990
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    --- End diff --
    
    add test for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63661 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63661/consoleFull)** for PR 14229 at commit [`ca5ea9e`](https://github.com/apache/spark/commit/ca5ea9e10b53d2d1dc6ff6b350301eb79a944eb8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74993602
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
    @@ -0,0 +1,207 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import scala.collection.mutable
    +
    +import org.apache.hadoop.fs.Path
    +import org.json4s._
    +import org.json4s.JsonDSL._
    +import org.json4s.jackson.JsonMethods._
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
    +import org.apache.spark.ml.clustering.{LDA, LDAModel}
    +import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
    +import org.apache.spark.ml.linalg.VectorUDT
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.StringType
    +
    +
    +private[r] class LDAWrapper private (
    +    val pipeline: PipelineModel,
    +    val logLikelihood: Double,
    +    val logPerplexity: Double,
    +    val vocabulary: Array[String]) extends MLWritable {
    +
    +  import LDAWrapper._
    +
    +  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
    +  private val preprocessor: PipelineModel =
    +    new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", pipeline.stages.dropRight(1))
    +
    +  def transform(data: Dataset[_]): DataFrame = {
    +    pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, COUNT_VECTOR_COL)
    +  }
    +
    +  def computeLogPerplexity(data: Dataset[_]): Double = {
    +    lda.logPerplexity(preprocessor.transform(data))
    +  }
    +
    +  lazy val topicIndices: DataFrame = lda.describeTopics(10)
    --- End diff --
    
    I'll add it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74675553
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -724,7 +728,8 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
     #' @param maxVocabSize maximum vocabulary size, default 1 << 18
     #' @return \code{spark.lda} returns a fitted Latent Dirichlet Allocation model
     #' @rdname spark.lda
    -#' @seealso survival: \url{https://cran.r-project.org/web/packages/topicmodels/}
    +#' @aliases spark.lda,SparkDataFrame
    --- End diff --
    
    should have `-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #62442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62442/consoleFull)** for PR 14229 at commit [`90dad9d`](https://github.com/apache/spark/commit/90dad9d199e7540d455f99db4d028686c36f9f99).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74869532
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    +#' @param maxVocabSize maximum vocabulary size, default 1 << 18
    +#' @return \code{spark.lda} returns a fitted Latent Dirichlet Allocation model
    +#' @rdname spark.lda
    +#' @aliases spark.lda,SparkDataFrame
    +#' @seealso topicmodels: \url{https://cran.r-project.org/web/packages/topicmodels/}
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' text <- read.df("path/to/data", source = "libsvm")
    +#' model <- spark.lda(data = text, optimizer = "em")
    +#'
    +#' # get a summary of the model
    +#' summary(model)
    +#'
    +#' # compute posterior probabilities
    +#' posterior <- spark.posterior(model, df)
    +#' showDF(posterior)
    +#'
    +#' # compute perplexity
    +#' perplexity <- spark.perplexity(model, df)
    +#'
    +#' # save and load the model
    +#' path <- "path/to/model"
    +#' write.ml(model, path)
    +#' savedModel <- read.ml(path)
    +#' summary(savedModel)
    +#' }
    +#' @note spark.lda since 2.1.0
    +setMethod("spark.lda", signature(data = "SparkDataFrame"),
    +          function(data, features = "features", k = 10, maxIter = 20, optimizer = c("online", "em"),
    +                   subsamplingRate = 0.05, topicConcentration = -1, docConcentration = -1,
    +                   customizedStopWords = "", maxVocabSize = bitwShiftL(1, 18)) {
    +            optimizer <- match.arg(optimizer)
    +            jobj <- callJStatic("org.apache.spark.ml.r.LDAWrapper", "fit", data@sdf, features,
    +                                as.integer(k), as.integer(maxIter), optimizer, subsamplingRate,
    +                                topicConcentration, as.array(docConcentration),
    --- End diff --
    
    could this be reflected in the @param description?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74676368
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    --- End diff --
    
    "Vector format column or String format column are accepted."
    not quite sure what we are saying here - are we trying to say one or multiple columns are ok? if so, perhaps say "character vector with a length one or more"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63884/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74675724
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    +#'         \code{vocabSize}, number of terms in the corpus;
    +#'         \code{topics}, top 10 terms and their weights of all topics;
    +#'         \code{vocabulary}, whole terms of the training corpus, NULL if libsvm format file used as
    +#'         training set.
    +#' @rdname spark.lda
    +#' @aliases summary,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, ...) {
    +            jobj <- object@jobj
    +            docConcentration <- callJMethod(jobj, "docConcentration")
    +            topicConcentration <- callJMethod(jobj, "topicConcentration")
    +            logLikelihood <- callJMethod(jobj, "logLikelihood")
    +            logPerplexity <- callJMethod(jobj, "logPerplexity")
    +            isDistributed <- callJMethod(jobj, "isDistributed")
    +            vocabSize <- callJMethod(jobj, "vocabSize")
    +            topics <- dataFrame(callJMethod(jobj, "topics"))
    +            vocabulary <- callJMethod(jobj, "vocabulary")
    +            return(list(docConcentration = unlist(docConcentration),
    +                        topicConcentration = topicConcentration,
    +                        logLikelihood = logLikelihood, logPerplexity = logPerplexity,
    +                        isDistributed = isDistributed, vocabSize = vocabSize,
    +                        topics = topics,
    +                        vocabulary = unlist(vocabulary)))
    +          })
    +
    +# Returns the log perplexity of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @return \code{spark.perplexity} returns the log perplexity of given SparkDataFrame, or the log
    +#'         perplexity of the training data if missing argument "data".
    +#' @rdname spark.lda
    +#" @aliases spark.perplexity,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("spark.perplexity", signature(object = "LDAModel"),
    +          function(object, newData) {
    +            return(ifelse(missing(newData), callJMethod(object@jobj, "logPerplexity"),
    +                   callJMethod(object@jobj, "computeLogPerplexity", newData@sdf)))
    +         })
    +
    +# Saves the Latent Dirichlet Allocation model to the input path.
    +
    +#' @param path The directory where the model is saved
    +#' @param overwrite Overwrites or not if the output path already exists. Default is FALSE
    +#'                  which means throw exception if the output path exists.
    +#'
    +#' @rdname spark.lda
    +#' @aliases write.ml,LDAModel-method,character-method
    --- End diff --
    
    only one `-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74675925
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
    @@ -0,0 +1,207 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import scala.collection.mutable
    +
    +import org.apache.hadoop.fs.Path
    +import org.json4s._
    +import org.json4s.JsonDSL._
    +import org.json4s.jackson.JsonMethods._
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
    +import org.apache.spark.ml.clustering.{LDA, LDAModel}
    +import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
    +import org.apache.spark.ml.linalg.VectorUDT
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.StringType
    +
    +
    +private[r] class LDAWrapper private (
    +    val pipeline: PipelineModel,
    +    val logLikelihood: Double,
    +    val logPerplexity: Double,
    +    val vocabulary: Array[String]) extends MLWritable {
    +
    +  import LDAWrapper._
    +
    +  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
    +  private val preprocessor: PipelineModel =
    +    new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", pipeline.stages.dropRight(1))
    +
    +  def transform(data: Dataset[_]): DataFrame = {
    +    pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, COUNT_VECTOR_COL)
    +  }
    +
    +  def computeLogPerplexity(data: Dataset[_]): Double = {
    +    lda.logPerplexity(preprocessor.transform(data))
    +  }
    +
    +  lazy val topicIndices: DataFrame = lda.describeTopics(10)
    --- End diff --
    
    shouldn't `maxTermsPerTopic` be configurable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63869/consoleFull)** for PR 14229 at commit [`6bd15cd`](https://github.com/apache/spark/commit/6bd15cda89bb330c538accf48bc8c8e24f889d6a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by wangmiao1981 <gi...@git.apache.org>.

Github user wangmiao1981 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74542500
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/LDAWrapper.scala ---
    @@ -0,0 +1,210 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.r
    +
    +import scala.collection.mutable
    +
    +import org.apache.hadoop.fs.Path
    +import org.json4s._
    +import org.json4s.JsonDSL._
    +import org.json4s.jackson.JsonMethods._
    +
    +import org.apache.spark.SparkException
    +import org.apache.spark.internal.Logging
    +import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
    +import org.apache.spark.ml.clustering.{LDA, LDAModel}
    +import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel, RegexTokenizer, StopWordsRemover}
    +import org.apache.spark.ml.linalg.VectorUDT
    +import org.apache.spark.ml.util._
    +import org.apache.spark.sql.{DataFrame, Dataset}
    +import org.apache.spark.sql.functions._
    +import org.apache.spark.sql.types.StringType
    +
    +
    +private[r] class LDAWrapper private (
    +    val pipeline: PipelineModel,
    +    val logLikelihood: Double,
    +    val logPerplexity: Double,
    +    val vocabulary: Array[String]) extends MLWritable {
    +
    +  import LDAWrapper._
    +
    +  private val lda: LDAModel = pipeline.stages.last.asInstanceOf[LDAModel]
    +  private val preprocessor: PipelineModel =
    +    new PipelineModel(s"${Identifiable.randomUID(pipeline.uid)}", pipeline.stages.dropRight(1))
    +
    +  def transform(data: Dataset[_]): DataFrame = {
    +    pipeline.transform(data).drop(TOKENIZER_COL, STOPWORDS_REMOVER_COL, COUNT_VECTOR_COL)
    +  }
    +
    +  def computeLogPerplexity(data: Dataset[_]): Double = {
    +    lda.logPerplexity(preprocessor.transform(data))
    +  }
    +
    +  lazy val topicIndices: DataFrame = lda.describeTopics(10)
    +
    +  lazy val topics = if (vocabulary.isEmpty || vocabulary.length < vocabSize) {
    +    topicIndices
    +  } else {
    +    val index2term = udf { indices: mutable.WrappedArray[Int] => indices.map(i => vocabulary(i)) }
    +    topicIndices.select(col("topic"), index2term(col("termIndices")).as("term"), col("termWeights"))
    +  }
    +
    +  lazy val isDistributed: Boolean = lda.isDistributed
    +  lazy val vocabSize: Int = lda.vocabSize
    +  lazy val docConcentration: Array[Double] = lda.getEffectiveDocConcentration
    +  lazy val topicConcentration: Double = lda.getEffectiveTopicConcentration
    +
    +  override def write: MLWriter = new LDAWrapper.LDAWrapperWriter(this)
    +}
    +
    +private[r] object LDAWrapper extends MLReadable[LDAWrapper] with Logging {
    +
    +  val TOKENIZER_COL = s"${Identifiable.randomUID("rawTokens")}"
    +  val STOPWORDS_REMOVER_COL = s"${Identifiable.randomUID("tokens")}"
    +  val COUNT_VECTOR_COL = s"${Identifiable.randomUID("features")}"
    +
    +  private def getPreStages(
    +      features: String,
    +      customizedStopWords: Array[String],
    +      maxVocabSize: Int): Array[PipelineStage] = {
    +    val tokenizer = new RegexTokenizer()
    +      .setInputCol(features)
    +      .setOutputCol(TOKENIZER_COL)
    +    val stopWordsRemover = new StopWordsRemover()
    +      .setInputCol(TOKENIZER_COL)
    +      .setOutputCol(STOPWORDS_REMOVER_COL)
    +    stopWordsRemover.setStopWords(stopWordsRemover.getStopWords ++ customizedStopWords)
    +    val countVectorizer = new CountVectorizer()
    +      .setVocabSize(maxVocabSize)
    +      .setInputCol(STOPWORDS_REMOVER_COL)
    +      .setOutputCol(COUNT_VECTOR_COL)
    +
    +    Array(tokenizer, stopWordsRemover, countVectorizer)
    +  }
    +
    +  def fit(
    +      data: DataFrame,
    +      features: String,
    +      k: Int,
    +      maxIter: Int,
    +      optimizer: String,
    +      subsamplingRate: Double,
    +      topicConcentration: Double,
    +      docConcentration: Array[Double],
    +      customizedStopWords: Array[String],
    +      maxVocabSize: Int): LDAWrapper = {
    +
    +    val lda = new LDA()
    +      .setK(k)
    +      .setMaxIter(maxIter)
    +      .setSubsamplingRate(subsamplingRate)
    +
    +    val featureSchema = data.schema(features)
    +    val stages = featureSchema.dataType match {
    +      case d: StringType =>
    +        logDebug(s"Feature ($features) schema is StringType, use the built-in preprocessor.")
    +        getPreStages(features, customizedStopWords, maxVocabSize) ++
    +          Array(lda.setFeaturesCol(COUNT_VECTOR_COL))
    +      case d: VectorUDT =>
    +        logDebug(s"Feature ($features) schema is VectorUDT, use the LDA directly.")
    --- End diff --
    
    Same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74868660
  
    --- Diff: R/pkg/R/generics.R ---
    @@ -1279,6 +1279,19 @@ setGeneric("spark.naiveBayes", function(data, formula, ...) { standardGeneric("s
     #' @export
     setGeneric("spark.survreg", function(data, formula, ...) { standardGeneric("spark.survreg") })
     
    +#' @rdname spark.lda
    +#' @export
    +setGeneric("spark.lda", function(data, ...) { standardGeneric("spark.lda") })
    --- End diff --
    
    Consider whether we need to have `...` in the signature. If it's present we need to add `@param ... description` otherwise CRAN check will flag this.
    I think if it is not needed we should remove it.
    
    Same for all other generic being added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r75120758
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +306,94 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.posterior,LDAModel,SparkDataFrame-method
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}.
    +#' @param maxTermsPerTopic Maximum number of terms to collect for each topic. Default value of 10.
    +#' @return \code{summary} returns a list containing
    +#'         \item{\code{docConcentration}}{concentration parameter commonly named \code{alpha} for
    +#'               the prior placed on documents distributions over topics \code{theta}}
    +#'         \item{\code{topicConcentration}}{concentration parameter commonly named \code{beta} or
    +#'               \code{eta} for the prior placed on topic distributions over terms}
    +#'         \item{\code{logLikelihood}}{log likelihood of the entire corpus}
    +#'         \item{\code{logPerplexity}}{log perplexity}
    +#'         \item{\code{isDistributed}}{TRUE for distributed model while FALSE for local model}
    +#'         \item{\code{vocabSize}}{number of terms in the corpus}
    +#'         \item{\code{topics}}{top 10 terms and their weights of all topics}
    +#'         \item{\code{vocabulary}}{whole terms of the training corpus, NULL if libsvm format file
    +#'               used as training set}
    +#' @rdname spark.lda
    +#' @aliases summary,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, maxTermsPerTopic, ...) {
    --- End diff --
    
    @junyangq has a point, if it's not used it could be omitted - we would leave the `...` in the generics.R (for various reasons) but here we could remove it.
    It's better that way because otherwise we could need to add `@param ...` to "document" it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63956/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    @junyangq Could you help review this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r72880737
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -291,6 +299,88 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    +#'         \code{vocabSize}, number of terms in the corpus;
    +#'         \code{topics}, top 10 terms and their weights of all topics;
    +#'         \code{vocabulary}, whole terms of the training corpus, NULL if libsvm format file used as
    +#'         training set.
    +#' @rdname spark.lda
    +#' @export
    --- End diff --
    
    Please add @aliases, @example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    @yinxusen could you rebase to master please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74992693
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    --- End diff --
    
    I'll add `\link{spark.lda}` in `write.ml`. 
    
    As for `predict`, LDA uses the name `spark.posterior`. So I'll leave `predict` unchanged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74676001
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +307,92 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.lda,spark.posterior,LDAModel-method,SparkDataFrame
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}
    +#' @return \code{summary} returns a list containing
    +#'         \code{docConcentration}, concentration parameter commonly named \code{alpha} for the
    +#'         prior placed on documents distributions over topics \code{theta};
    +#'         \code{topicConcentration}, concentration parameter commonly named \code{beta} or
    +#'         \code{eta} for the prior placed on topic distributions over terms;
    +#'         \code{logLikelihood}, log likelihood of the entire corpus;
    +#'         \code{logPerplexity}, log perplexity;
    +#'         \code{isDistributed}, TRUE for distribuetd model while FALSE for local model;
    +#'         \code{vocabSize}, number of terms in the corpus;
    +#'         \code{topics}, top 10 terms and their weights of all topics;
    +#'         \code{vocabulary}, whole terms of the training corpus, NULL if libsvm format file used as
    +#'         training set.
    +#' @rdname spark.lda
    +#' @aliases summary,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, ...) {
    +            jobj <- object@jobj
    +            docConcentration <- callJMethod(jobj, "docConcentration")
    +            topicConcentration <- callJMethod(jobj, "topicConcentration")
    +            logLikelihood <- callJMethod(jobj, "logLikelihood")
    +            logPerplexity <- callJMethod(jobj, "logPerplexity")
    +            isDistributed <- callJMethod(jobj, "isDistributed")
    +            vocabSize <- callJMethod(jobj, "vocabSize")
    +            topics <- dataFrame(callJMethod(jobj, "topics"))
    +            vocabulary <- callJMethod(jobj, "vocabulary")
    +            return(list(docConcentration = unlist(docConcentration),
    +                        topicConcentration = topicConcentration,
    +                        logLikelihood = logLikelihood, logPerplexity = logPerplexity,
    +                        isDistributed = isDistributed, vocabSize = vocabSize,
    +                        topics = topics,
    +                        vocabulary = unlist(vocabulary)))
    +          })
    +
    +# Returns the log perplexity of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @return \code{spark.perplexity} returns the log perplexity of given SparkDataFrame, or the log
    +#'         perplexity of the training data if missing argument "data".
    +#' @rdname spark.lda
    +#" @aliases spark.perplexity,spark.lda,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("spark.perplexity", signature(object = "LDAModel"),
    --- End diff --
    
    add test for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74675520
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -377,6 +380,7 @@ setMethod("spark.perplexity", signature(object = "LDAModel"),
     #'                  which means throw exception if the output path exists.
     #'
     #' @rdname spark.lda
    +#' @aliases write.ml,LDAModel-method,character-method
    --- End diff --
    
    only one `-method`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by felixcheung <gi...@git.apache.org>.

Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    @yinxusen shouldn't you search for `show` for something tagged as `@aliases show,GroupedData-method`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by yinxusen <gi...@git.apache.org>.

Github user yinxusen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74852343
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -605,6 +701,69 @@ setMethod("spark.survreg", signature(data = "SparkDataFrame", formula = "formula
                 return(new("AFTSurvivalRegressionModel", jobj = jobj))
               })
     
    +#' Latent Dirichlet Allocation
    +#'
    +#' \code{spark.lda} fits a Latent Dirichlet Allocation model on a SparkDataFrame. Users can call
    +#' \code{summary} to get a summary of the fitted LDA model, \code{spark.posterior} to compute
    +#' posterior probabilities on new data, \code{spark.perplexity} to compute log perplexity on new
    +#' data and \code{write.ml}/\code{read.ml} to save/load fitted models.
    +#'
    +#' @param data A SparkDataFrame for training
    +#' @param features Features column name, default "features". Either Vector format column or String
    +#'        format column are accepted.
    +#' @param k Number of topics, default 10
    +#' @param maxIter Maximum iterations, default 20
    +#' @param optimizer Optimizer to train an LDA model, "online" or "em", default "online"
    +#' @param subsamplingRate (For online optimizer) Fraction of the corpus to be sampled and used in
    +#         each iteration of mini-batch gradient descent, in range (0, 1], default 0.05
    +#' @param topicConcentration concentration parameter (commonly named \code{beta} or \code{eta}) for
    +#'        the prior placed on topic distributions over terms, default -1 to set automatically on the
    +#'        Spark side. Use \code{summary} to retrieve the effective topicConcentration.
    +#' @param docConcentration concentration parameter (commonly named \code{alpha}) for the
    +#'        prior placed on documents distributions over topics (\code{theta}), default -1 to set
    +#'        automatically on the Spark side. Use \code{summary} to retrieve the effective
    +#'        docConcentration.
    +#' @param customizedStopWords stopwords that need to be removed from the given corpus. Only effected
    +#'        given training data with string format column.
    +#' @param maxVocabSize maximum vocabulary size, default 1 << 18
    +#' @return \code{spark.lda} returns a fitted Latent Dirichlet Allocation model
    +#' @rdname spark.lda
    +#' @aliases spark.lda,SparkDataFrame
    +#' @seealso topicmodels: \url{https://cran.r-project.org/web/packages/topicmodels/}
    +#' @export
    +#' @examples
    +#' \dontrun{
    +#' text <- read.df("path/to/data", source = "libsvm")
    +#' model <- spark.lda(data = text, optimizer = "em")
    +#'
    +#' # get a summary of the model
    +#' summary(model)
    +#'
    +#' # compute posterior probabilities
    +#' posterior <- spark.posterior(model, df)
    +#' showDF(posterior)
    +#'
    +#' # compute perplexity
    +#' perplexity <- spark.perplexity(model, df)
    +#'
    +#' # save and load the model
    +#' path <- "path/to/model"
    +#' write.ml(model, path)
    +#' savedModel <- read.ml(path)
    +#' summary(savedModel)
    +#' }
    +#' @note spark.lda since 2.1.0
    +setMethod("spark.lda", signature(data = "SparkDataFrame"),
    +          function(data, features = "features", k = 10, maxIter = 20, optimizer = c("online", "em"),
    +                   subsamplingRate = 0.05, topicConcentration = -1, docConcentration = -1,
    +                   customizedStopWords = "", maxVocabSize = bitwShiftL(1, 18)) {
    +            optimizer <- match.arg(optimizer)
    +            jobj <- callJStatic("org.apache.spark.ml.r.LDAWrapper", "fit", data@sdf, features,
    +                                as.integer(k), as.integer(maxIter), optimizer, subsamplingRate,
    +                                topicConcentration, as.array(docConcentration),
    --- End diff --
    
    Unlike the `docConcentration`, the `topicConcentration` is a single Double as defined in Scala side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #62439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62439/consoleFull)** for PR 14229 at commit [`1886c1d`](https://github.com/apache/spark/commit/1886c1dc1005128b868ac849542d08d069fb5bcc).
     * This patch **fails R style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63956/consoleFull)** for PR 14229 at commit [`41249d7`](https://github.com/apache/spark/commit/41249d76ecd2e89ace3a30212d6e5a74f1376117).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63705/consoleFull)** for PR 14229 at commit [`a254220`](https://github.com/apache/spark/commit/a254220e417fa715c98336246c2347a24b88828a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63863/consoleFull)** for PR 14229 at commit [`3e8678e`](https://github.com/apache/spark/commit/3e8678e27d556a15fd9df075bb7de071d53c01ea).
     * This patch **fails some tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by wangmiao1981 <gi...@git.apache.org>.

Github user wangmiao1981 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r74844277
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -39,6 +39,14 @@ setClass("GeneralizedLinearRegressionModel", representation(jobj = "jobj"))
     #' @note NaiveBayesModel since 2.0.0
     setClass("NaiveBayesModel", representation(jobj = "jobj"))
     
    +#' S4 class that represents an LDAModel
    +#'
    +#' @param jobj a Java object reference to the backing Scala LDAWrapper
    +#' @export
    +#' @note LDAModel since 2.1.0
    +setClass("LDAModel", representation(jobj = "jobj"))
    +
    +
    --- End diff --
    
    Extra blank line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63869/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63884 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63884/consoleFull)** for PR 14229 at commit [`8280b41`](https://github.com/apache/spark/commit/8280b414443eece95130abb053be16c0c57aa9f0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by junyangq <gi...@git.apache.org>.

Github user junyangq commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14229#discussion_r75026748
  
    --- Diff: R/pkg/R/mllib.R ---
    @@ -299,6 +306,94 @@ setMethod("summary", signature(object = "NaiveBayesModel"),
                 return(list(apriori = apriori, tables = tables))
               })
     
    +# Returns posterior probabilities from a Latent Dirichlet Allocation model produced by spark.lda()
    +
    +#' @param newData A SparkDataFrame for testing
    +#' @return \code{spark.posterior} returns a SparkDataFrame containing posterior probabilities
    +#'         vectors named "topicDistribution"
    +#' @rdname spark.lda
    +#' @aliases spark.posterior,LDAModel,SparkDataFrame-method
    +#' @export
    +#' @note spark.posterior(LDAModel) since 2.1.0
    +setMethod("spark.posterior", signature(object = "LDAModel", newData = "SparkDataFrame"),
    +          function(object, newData) {
    +            return(dataFrame(callJMethod(object@jobj, "transform", newData@sdf)))
    +          })
    +
    +# Returns the summary of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @param object A Latent Dirichlet Allocation model fitted by \code{spark.lda}.
    +#' @param maxTermsPerTopic Maximum number of terms to collect for each topic. Default value of 10.
    +#' @return \code{summary} returns a list containing
    +#'         \item{\code{docConcentration}}{concentration parameter commonly named \code{alpha} for
    +#'               the prior placed on documents distributions over topics \code{theta}}
    +#'         \item{\code{topicConcentration}}{concentration parameter commonly named \code{beta} or
    +#'               \code{eta} for the prior placed on topic distributions over terms}
    +#'         \item{\code{logLikelihood}}{log likelihood of the entire corpus}
    +#'         \item{\code{logPerplexity}}{log perplexity}
    +#'         \item{\code{isDistributed}}{TRUE for distributed model while FALSE for local model}
    +#'         \item{\code{vocabSize}}{number of terms in the corpus}
    +#'         \item{\code{topics}}{top 10 terms and their weights of all topics}
    +#'         \item{\code{vocabulary}}{whole terms of the training corpus, NULL if libsvm format file
    +#'               used as training set}
    +#' @rdname spark.lda
    +#' @aliases summary,LDAModel-method
    +#' @export
    +#' @note summary(LDAModel) since 2.1.0
    +setMethod("summary", signature(object = "LDAModel"),
    +          function(object, maxTermsPerTopic, ...) {
    +            maxTermsPerTopic <- as.integer(ifelse(missing(maxTermsPerTopic), 10, maxTermsPerTopic))
    +            jobj <- object@jobj
    +            docConcentration <- callJMethod(jobj, "docConcentration")
    +            topicConcentration <- callJMethod(jobj, "topicConcentration")
    +            logLikelihood <- callJMethod(jobj, "logLikelihood")
    +            logPerplexity <- callJMethod(jobj, "logPerplexity")
    +            isDistributed <- callJMethod(jobj, "isDistributed")
    +            vocabSize <- callJMethod(jobj, "vocabSize")
    +            topics <- dataFrame(callJMethod(jobj, "topics", maxTermsPerTopic))
    +            vocabulary <- callJMethod(jobj, "vocabulary")
    +            return(list(docConcentration = unlist(docConcentration),
    +                        topicConcentration = topicConcentration,
    +                        logLikelihood = logLikelihood, logPerplexity = logPerplexity,
    +                        isDistributed = isDistributed, vocabSize = vocabSize,
    +                        topics = topics,
    +                        vocabulary = unlist(vocabulary)))
    +          })
    +
    +# Returns the log perplexity of a Latent Dirichlet Allocation model produced by \code{spark.lda}
    +
    +#' @return \code{spark.perplexity} returns the log perplexity of given SparkDataFrame, or the log
    +#'         perplexity of the training data if missing argument "data".
    +#' @rdname spark.lda
    +#' @aliases spark.perplexity,LDAModel-method
    +#' @export
    +#' @note spark.perplexity(LDAModel) since 2.1.0
    +setMethod("spark.perplexity", signature(object = "LDAModel", data = "SparkDataFrame"),
    +          function(object, data) {
    +            return(ifelse(missing(data), callJMethod(object@jobj, "logPerplexity"),
    +                   callJMethod(object@jobj, "computeLogPerplexity", data@sdf)))
    +         })
    +
    +# Saves the Latent Dirichlet Allocation model to the input path.
    +
    +#' @param path The directory where the model is saved
    +#' @param overwrite Overwrites or not if the output path already exists. Default is FALSE
    --- End diff --
    
    `,` at the end?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/63863/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63883 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63883/consoleFull)** for PR 14229 at commit [`6781785`](https://github.com/apache/spark/commit/67817853b44b73906370fcf91d7876095ee0067a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #14229: [SPARK-16447][ML][SparkR] LDA wrapper in SparkR

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14229
  
    **[Test build #63890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/63890/consoleFull)** for PR 14229 at commit [`84cc5e7`](https://github.com/apache/spark/commit/84cc5e73523dc9a306c45ca8bc994dc05984424d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org