You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/01 21:00:38 UTC
[GitHub] [spark] ancasarb opened a new pull request #24509: Linear Regression - validate training related params such as loss only during fitting phase

ancasarb opened a new pull request #24509: Linear Regression - validate training related params such as loss only during fitting phase
URL: https://github.com/apache/spark/pull/24509
 
 
   ## What changes were proposed in this pull request?
   
   When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered.
   
   ```
   java.util.NoSuchElementException: Failed to find a default value for loss
   	at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
   	at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
   	at scala.Option.getOrElse(Option.scala:121)
   	at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779)
   	at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
   	at org.apache.spark.ml.param.Params$class.$(params.scala:786)
   	at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
   	at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
   	at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
   	at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192)
   	at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
   	at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
   	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
   	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
   	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
   	at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311)
   	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
   	at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
   ```
   
   This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :)
   
   This issue was first reported for mleap (https://github.com/combust/mleap/issues/455) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params. 
   
   ## How was this patch tested?
   Added a unit test to check this scenario. 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org