You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Xiangrui Meng <me...@gmail.com> on 2014/07/01 06:13:31 UTC

Re: Spark 1.0 and Logistic Regression Python Example

You were using an old version of numpy, 1.4? I think this is fixed in
the latest master. Try to replace vec.dot(target) by numpy.dot(vec,
target), or use the latest master. -Xiangrui

On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs <sa...@us.abb.com> wrote:
> Hi,
>
>
> I modified the example code for logistic regression to compute the error in
> classification. Please see below. However the code is failing when it makes
> a call to:
>
>
> labelsAndPreds.filter(lambda (v, p): v != p).count()
>
>
> with the error message (something related to numpy or dot product):
>
>
> File "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py",
> line 65, in predict
>
>     margin = _dot(x, self._coeff) + self._intercept
>
>   File "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py", line
> 443, in _dot
>
>     return vec.dot(target)
>
> AttributeError: 'numpy.ndarray' object has no attribute 'dot'
>
>
> FYI, I am running the code using spark-submit i.e.
>
>
> ./bin/spark-submit examples/src/main/python/mllib/logistic_regression2.py
>
>
>
> The code is posted below if it will be useful in any way:
>
>
> from math import exp
>
> import sys
> import time
>
> from pyspark import SparkContext
>
> from pyspark.mllib.classification import LogisticRegressionWithSGD
> from pyspark.mllib.regression import LabeledPoint
> from numpy import array
>
>
> # Load and parse the data
> def parsePoint(line):
>     values = [float(x) for x in line.split(',')]
>     if values[0] == -1:   # Convert -1 labels to 0 for MLlib
>         values[0] = 0
>     return LabeledPoint(values[0], values[1:])
>
> sc = SparkContext(appName="PythonLR")
> # start timing
> start = time.time()
> #start = time.clock()
>
> data = sc.textFile("sWAMSpark_train.csv")
> parsedData = data.map(parsePoint)
>
> # Build the model
> model = LogisticRegressionWithSGD.train(parsedData)
>
> #load test data
>
> testdata = sc.textFile("sWSpark_test.csv")
> parsedTestData = testdata.map(parsePoint)
>
> # Evaluating the model on test data
> labelsAndPreds = parsedTestData.map(lambda p: (p.label,
> model.predict(p.features)))
> trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
> float(parsedData.count())
> print("Training Error = " + str(trainErr))
> end = time.time()
> print("Time is = " + str(end - start))
>
>
>
>
>
>
>

RE: Spark 1.0 and Logistic Regression Python Example

Posted by Sam Jacobs <sa...@us.abb.com>.

Thanks Xiangrui, your suggestion fixed the problem. I will see if I can upgrade the numpy/python for a permanent fix. My current versions of python and numpy are 2.6 and 4.1.9 respectively.

Thanks,

Sam  

-----Original Message-----
From: Xiangrui Meng [mailto:mengxr@gmail.com] 
Sent: Tuesday, July 01, 2014 12:14 AM
To: user@spark.apache.org
Subject: Re: Spark 1.0 and Logistic Regression Python Example

You were using an old version of numpy, 1.4? I think this is fixed in the latest master. Try to replace vec.dot(target) by numpy.dot(vec, target), or use the latest master. -Xiangrui

On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs <sa...@us.abb.com> wrote:
> Hi,
>
>
> I modified the example code for logistic regression to compute the 
> error in classification. Please see below. However the code is failing 
> when it makes a call to:
>
>
> labelsAndPreds.filter(lambda (v, p): v != p).count()
>
>
> with the error message (something related to numpy or dot product):
>
>
> File 
> "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py",
> line 65, in predict
>
>     margin = _dot(x, self._coeff) + self._intercept
>
>   File "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py", 
> line 443, in _dot
>
>     return vec.dot(target)
>
> AttributeError: 'numpy.ndarray' object has no attribute 'dot'
>
>
> FYI, I am running the code using spark-submit i.e.
>
>
> ./bin/spark-submit 
> examples/src/main/python/mllib/logistic_regression2.py
>
>
>
> The code is posted below if it will be useful in any way:
>
>
> from math import exp
>
> import sys
> import time
>
> from pyspark import SparkContext
>
> from pyspark.mllib.classification import LogisticRegressionWithSGD 
> from pyspark.mllib.regression import LabeledPoint from numpy import 
> array
>
>
> # Load and parse the data
> def parsePoint(line):
>     values = [float(x) for x in line.split(',')]
>     if values[0] == -1:   # Convert -1 labels to 0 for MLlib
>         values[0] = 0
>     return LabeledPoint(values[0], values[1:])
>
> sc = SparkContext(appName="PythonLR")
> # start timing
> start = time.time()
> #start = time.clock()
>
> data = sc.textFile("sWAMSpark_train.csv")
> parsedData = data.map(parsePoint)
>
> # Build the model
> model = LogisticRegressionWithSGD.train(parsedData)
>
> #load test data
>
> testdata = sc.textFile("sWSpark_test.csv") parsedTestData = 
> testdata.map(parsePoint)
>
> # Evaluating the model on test data
> labelsAndPreds = parsedTestData.map(lambda p: (p.label,
> model.predict(p.features)))
> trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
> float(parsedData.count())
> print("Training Error = " + str(trainErr)) end = time.time() 
> print("Time is = " + str(end - start))
>
>
>
>
>
>
>