You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yanboliang <gi...@git.apache.org> on 2015/11/13 11:23:30 UTC

[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

GitHub user yanboliang opened a pull request:

    https://github.com/apache/spark/pull/9690

    [SPARK-11723] [ML] [Doc] Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame

    Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanboliang/spark spark-11723

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9690.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9690
    
----
commit 738f73a177fca85b8570451caff7a92f4ef06285
Author: Yanbo Liang <yb...@gmail.com>
Date:   2015-11-13T08:12:08Z

    Use sqlContext.read.format(libsvm).load(***) to load DataFrame

commit a0a3cbb32284f8e14796aeb56e6e21730c34f295
Author: Yanbo Liang <yb...@gmail.com>
Date:   2015-11-13T09:14:03Z

    PySpark ml example codes use LibSVMRelation

commit 93381fae845dfaac70b6b76c8ff5a04dbc857174
Author: Yanbo Liang <yb...@gmail.com>
Date:   2015-11-13T09:43:46Z

    Update code examples in spark.ml user guide to use LIBSVM data source

commit 66d8ec970a0d31110d4530252ac370b3e1a754cf
Author: Yanbo Liang <yb...@gmail.com>
Date:   2015-11-13T09:49:53Z

    Fix bug: Java should use sqlContext.read() not sqlContext.read

commit 83a0b2cbb0defb98ac69f10d1d9c9c19bc8ce7ff
Author: Yanbo Liang <yb...@gmail.com>
Date:   2015-11-13T10:22:01Z

    fix typo

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9690


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9690#issuecomment-156389696
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9690#issuecomment-156391154
  
    **[Test build #45858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45858/consoleFull)** for PR 9690 at commit [`83a0b2c`](https://github.com/apache/spark/commit/83a0b2cbb0defb98ac69f10d1d9c9c19bc8ce7ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9690#issuecomment-156400777
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45858/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9690#discussion_r44803784
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -82,7 +82,7 @@ private[libsvm] class LibSVMRelation(val path: String, val numFeatures: Int, val
      *     .load("data/mllib/sample_libsvm_data.txt")
      *
      *   // Java
    - *   DataFrame df = sqlContext.read.format("libsvm")
    + *   DataFrame df = sqlContext.read().format("libsvm")
    --- End diff --
    
    Thanks for catching this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9690#issuecomment-156400775
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9690#issuecomment-156389674
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/9690#issuecomment-156483533
  
    LGTM. Merged into master and branch-1.6. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9690#discussion_r44769440
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/DecisionTreeExample.scala ---
    @@ -169,36 +169,22 @@ object DecisionTreeExample {
           algo: String,
           fracTest: Double): (DataFrame, DataFrame) = {
         val sqlContext = new SQLContext(sc)
    -    import sqlContext.implicits._
     
         // Load training data
    -    val origExamples: RDD[LabeledPoint] = loadData(sc, input, dataFormat)
    +    val origExamples: DataFrame = loadData(sqlContext, input, dataFormat)
     
         // Load or create test set
    -    val splits: Array[RDD[LabeledPoint]] = if (testInput != "") {
    +    val dataframes: Array[DataFrame] = if (testInput != "") {
           // Load testInput.
    -      val numFeatures = origExamples.take(1)(0).features.size
    -      val origTestExamples: RDD[LabeledPoint] =
    -        loadData(sc, testInput, dataFormat, Some(numFeatures))
    +      val numFeatures = origExamples.first().getAs[Vector](1).size
    +      val origTestExamples: DataFrame =
    +        loadData(sqlContext, testInput, dataFormat, Some(numFeatures))
           Array(origExamples, origTestExamples)
         } else {
           // Split input into training, test.
           origExamples.randomSplit(Array(1.0 - fracTest, fracTest), seed = 12345)
         }
     
    -    // For classification, convert labels to Strings since we will index them later with
    -    // StringIndexer.
    -    def labelsToStrings(data: DataFrame): DataFrame = {
    -      algo.toLowerCase match {
    -        case "classification" =>
    -          data.withColumn("labelString", data("label").cast(StringType))
    -        case "regression" =>
    -          data
    -        case _ =>
    -          throw new IllegalArgumentException("Algo ${params.algo} not supported.")
    -      }
    -    }
    --- End diff --
    
    ```StringIndexer``` will cast label column to String automatically and then index, so we don't need this code snippet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9690#discussion_r44803464
  
    --- Diff: docs/ml-ensembles.md ---
    @@ -195,7 +195,7 @@ import org.apache.spark.ml.feature.*;
     import org.apache.spark.sql.DataFrame;
     
     // Load and parse the data file, converting it to a DataFrame.
    -DataFrame data = sqlContext.read.format("libsvm")
    +DataFrame data = sqlContext.read().format("libsvm")
    --- End diff --
    
    In Scala, we don't need `()` for `read`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11723] [ML] [Doc] Use LibSVM data sourc...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9690#issuecomment-156400632
  
    **[Test build #45858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45858/consoleFull)** for PR 9690 at commit [`83a0b2c`](https://github.com/apache/spark/commit/83a0b2cbb0defb98ac69f10d1d9c9c19bc8ce7ff).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org