You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Noel Smith (JIRA)" <ji...@apache.org> on 2015/08/24 20:48:48 UTC

[jira] [Created] (SPARK-10188) Pyspark CrossValidator with RMSE select incorrect model

Noel Smith created SPARK-10188:
----------------------------------

             Summary: Pyspark CrossValidator with RMSE select incorrect model
                 Key: SPARK-10188
                 URL: https://issues.apache.org/jira/browse/SPARK-10188
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.5.0
            Reporter: Noel Smith


Pyspark {{CrossValidator}} is giving incorrect results when selecting estimators using RMSE as an evaluation metric.

In the example below, it should be selecting the {{LogisticRegression}} with zero regularization as that gives the most accurate result, but instead selects the one with the largest.

Probably related to: SPARK-10097

{code}
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.feature import Binarizer
from pyspark.mllib.linalg import Vectors
from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)

# Label = 2 * feature
train = sqlContext.createDataFrame([
    (Vectors.dense([10.0]), 20.0), 
    (Vectors.dense([100.0]), 200.0), 
    (Vectors.dense([1000.0]), 2000.0)] * 10,
    ["features", "label"])

test = sqlContext.createDataFrame([
    (Vectors.dense([1000.0]),)],  
    ["features"])

# Expected prediction 2000.0
print LinearRegression(regParam=0.0).fit(train).transform(test).collect() # Predicts 2000.0 (perfect)
print LinearRegression(regParam=100.0).fit(train).transform(test).collect() # Predicts 1869.31
print LinearRegression(regParam=1000000.0).fit(train).transform(test).collect() # 741.08 (worst)

# Cross-validation
lr = LinearRegression()
rmse_eval = RegressionEvaluator()
grid = (ParamGridBuilder()
    .addGrid( lr.regParam, [0.0, 100.0, 1000000.0] )
    .build())
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=rmse_eval)
cv_model = cv.fit(train)

cv_model.bestModel.transform(test).collect() # Predicts 741.08 (i.e. worst model selected)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org