You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by femibyte <fe...@gmail.com> on 2020/01/03 05:52:27 UTC

MLeap and Spark ML SQLTransformer

1


I have a question. I am trying to serialize a PySpark ML model to mleap.
However, the model makes use of the SQLTransformer to do some column-based
transformations e.g. adding log-scaled versions of some columns. As we all
know, Mleap doesn't support SQLTransformer - see here :
https://github.com/combust/mleap/issues/126 so I've implemented the former
of these 2 suggestions:

For non-row operations, move the SQL out of the ML Pipeline that you plan to
serialize For row-based operations, use the available ML transformers or
write a custom transformer <- this is where the custom transformer
documentation will help. I've externalized the SQL transformation on the
training data used to build the model, and I do the same for the input data
when I run the model for evaluation.

The problem I'm having is that I'm unable to obtain the same results across
the 2 models.

*Model 1 *- Pure Spark ML model containing

/
SQLTransformer + later transformations : StringIndexer -> 
 OneHotEncoderEstimator -> VectorAssembler -> RandomForestClassifier
/

*Model 2* - Externalized version with SQL queries run on training data in
building the model. 
The transformations are everything after SQLTransformer in Model 1:

   /StringIndexer -> OneHotEncoderEstimator -> VectorAssembler ->
RandomForestClassifier
/

I'm wondering how I could go about debugging this problem. Is there a way to
somehow compare the results after each stage to see where the differences
show up ? Any suggestions are appreciated.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org