You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by freeman-lab <gi...@git.apache.org> on 2014/08/20 02:27:13 UTC

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

GitHub user freeman-lab opened a pull request:

    https://github.com/apache/spark/pull/2047

    [SPARK-3112][MLLIB] Add documentation and example for StreamingLR

    Added a documentation section on StreamingLR to the ``MLlib - Linear Methods``, including a worked example.
    
    @mengxr @tdas

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/freeman-lab/spark streaming-lr-docs

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2047
    
----
commit 05a113946c09f4e61c4f16b80ae3ae217e471e9f
Author: freeman <th...@gmail.com>
Date:   2014-08-20T00:23:44Z

    Added documentation and example for StreamingLR

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2047#issuecomment-52721455
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18905/consoleFull) for   PR 2047 at commit [`568d250`](https://github.com/apache/spark/commit/568d250ebf47017e79f6112390c0af81ff50ab63).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2047#discussion_r16453381
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -518,6 +518,80 @@ print("Mean Squared Error = " + str(MSE))
     </div>
     </div>
     
    +## Streaming linear regression
    +
    +When data arrive in a streaming fashion, it is useful to fit regression models online, 
    +updating the parameters of the model as new data arrive. MLlib currently supports 
    +streaming linear regression using ordinary least squares. The fitting is similar
    +to that performed offline, except fitting occurs on each batch of data, so that
    +the model continually updates to reflect the data from the stream.
    +
    +### Examples
    +
    +The following example demonstrates how to load training and testing data from two different
    +input streams of text files, parse the streams as labeled points, fit a linear regression model
    +online to the first stream, and make predictions on the second stream.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +First, we import the necessary classes for parsing our input data and creating the model. 
    +
    +{% highlight scala %}
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
    +
    +{% endhighlight %}
    +
    +Then we make input streams for training and testing data. We assume a Streaming Context `ssc`
    --- End diff --
    
    `Streaming Context` -> `StreamingContext` or `streaming context`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2047#issuecomment-52721448
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18905/consoleFull) for   PR 2047 at commit [`568d250`](https://github.com/apache/spark/commit/568d250ebf47017e79f6112390c0af81ff50ab63).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2047#issuecomment-52722502
  
    This PR contains only updates to documentation and `jekyll build` runs fine on my local machine. So I'm merging this into master and branch-1.1. @freeman-lab Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2047#discussion_r16453385
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -518,6 +518,80 @@ print("Mean Squared Error = " + str(MSE))
     </div>
     </div>
     
    +## Streaming linear regression
    +
    +When data arrive in a streaming fashion, it is useful to fit regression models online, 
    +updating the parameters of the model as new data arrive. MLlib currently supports 
    +streaming linear regression using ordinary least squares. The fitting is similar
    +to that performed offline, except fitting occurs on each batch of data, so that
    +the model continually updates to reflect the data from the stream.
    +
    +### Examples
    +
    +The following example demonstrates how to load training and testing data from two different
    +input streams of text files, parse the streams as labeled points, fit a linear regression model
    +online to the first stream, and make predictions on the second stream.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +First, we import the necessary classes for parsing our input data and creating the model. 
    +
    +{% highlight scala %}
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
    +
    +{% endhighlight %}
    +
    +Then we make input streams for training and testing data. We assume a Streaming Context `ssc`
    +has already been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing)
    +for more info. For this example, we use labeled points in training and testing streams, 
    +but in practice you will likely want to use unlabeled Vectors for test data.
    --- End diff --
    
    `Vectors` -> `vectors`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2047#discussion_r16453379
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -518,6 +518,80 @@ print("Mean Squared Error = " + str(MSE))
     </div>
     </div>
     
    +## Streaming linear regression
    +
    +When data arrive in a streaming fashion, it is useful to fit regression models online, 
    +updating the parameters of the model as new data arrive. MLlib currently supports 
    --- End diff --
    
    `arrive` -> `arrives`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2047#issuecomment-52720809
  
    @freeman-lab This looks great! Thanks a lot for the documentation!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2047


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3112][MLLIB] Add documentation and exam...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2047#discussion_r16453386
  
    --- Diff: docs/mllib-linear-methods.md ---
    @@ -518,6 +518,80 @@ print("Mean Squared Error = " + str(MSE))
     </div>
     </div>
     
    +## Streaming linear regression
    +
    +When data arrive in a streaming fashion, it is useful to fit regression models online, 
    +updating the parameters of the model as new data arrive. MLlib currently supports 
    +streaming linear regression using ordinary least squares. The fitting is similar
    +to that performed offline, except fitting occurs on each batch of data, so that
    +the model continually updates to reflect the data from the stream.
    +
    +### Examples
    +
    +The following example demonstrates how to load training and testing data from two different
    +input streams of text files, parse the streams as labeled points, fit a linear regression model
    +online to the first stream, and make predictions on the second stream.
    +
    +<div class="codetabs">
    +
    +<div data-lang="scala" markdown="1">
    +
    +First, we import the necessary classes for parsing our input data and creating the model. 
    +
    +{% highlight scala %}
    +
    +import org.apache.spark.mllib.linalg.Vectors
    +import org.apache.spark.mllib.regression.LabeledPoint
    +import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
    +
    +{% endhighlight %}
    +
    +Then we make input streams for training and testing data. We assume a Streaming Context `ssc`
    +has already been created, see [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing)
    +for more info. For this example, we use labeled points in training and testing streams, 
    +but in practice you will likely want to use unlabeled Vectors for test data.
    +
    +{% highlight scala %}
    +
    +val trainingData = ssc.textFileStream('/training/data/dir').map(LabeledPoint.parse)
    +val testData = ssc.textFileStream('/testing/data/dir').map(LabeledPoint.parse)
    +
    +{% endhighlight %}
    +
    +We create our model by initializing the weights to 0
    +
    +{% highlight scala %}
    +
    +val model = new StreamingLinearRegressionWithSGD()
    +    .setInitialWeights(Vectors.zeros(3))
    --- End diff --
    
    set `val numFeatures = 3` and use `numFeatures` in the function call?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org