You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (Jira)" <ji...@apache.org> on 2019/08/19 22:02:00 UTC

[jira] [Resolved] (SPARK-28434) Decision Tree model isn't equal after save and load

     [ https://issues.apache.org/jira/browse/SPARK-28434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-28434.
-------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 25485
[https://github.com/apache/spark/pull/25485]

> Decision Tree model isn't equal after save and load
> ---------------------------------------------------
>
>                 Key: SPARK-28434
>                 URL: https://issues.apache.org/jira/browse/SPARK-28434
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.4.3
>         Environment: spark from master
>            Reporter: Ievgen Prokhorenko
>            Assignee: Ievgen Prokhorenko
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> The file `mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on the line no. 628 has a TODO saying:
>  
> {code:java}
> // TODO: Check other fields besides the information gain.
> {code}
> If, in addition to the existing check of InformationGainStats' gain value I add another check, for instance, impurity – the test fails because the values are different in the saved model and the one restored from disk.
>  
> See PR with an example.
>  
> The tests are executed with this command:
>  
> {code:java}
> build/mvn -e -Dtest=none -DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code}
>  
> Excerpts from the output of the command above:
> {code:java}
> ...
> - model save/load *** FAILED ***
> checkEqual failed since the two trees were not identical.
> TREE A:
> DecisionTreeModel classifier of depth 2 with 5 nodes
> If (feature 0 <= 0.5)
> Predict: 0.0
> Else (feature 0 > 0.5)
> If (feature 1 in {0.0,1.0})
> Predict: 0.0
> Else (feature 1 not in {0.0,1.0})
> Predict: 0.0
> TREE B:
> DecisionTreeModel classifier of depth 2 with 5 nodes
> If (feature 0 <= 0.5)
> Predict: 0.0
> Else (feature 0 > 0.5)
> If (feature 1 in {0.0,1.0})
> Predict: 0.0
> Else (feature 1 not in {0.0,1.0})
> Predict: 0.0 (DecisionTreeSuite.scala:610)
> ...{code}
> If I add a little debug info in the `DecisionTreeSuite.checkEqual`:
>  
> {code:java}
> val aStats = a.stats
> val bStats = b.stats
> println(s"id ${a.id} ${b.id}")
> println(s"impurity ${aStats.get.impurity} ${bStats.get.impurity}")
> println(s"leftImpurity ${aStats.get.leftImpurity} ${bStats.get.leftImpurity}")
> println(s"rightImpurity ${aStats.get.rightImpurity} ${bStats.get.rightImpurity}")
> println(s"leftPredict ${aStats.get.leftPredict} ${bStats.get.leftPredict}")
> println(s"rightPredict ${aStats.get.rightPredict} ${bStats.get.rightPredict}")
> println(s"gain ${aStats.get.gain} ${bStats.get.gain}")
> {code}
>  
> Then, in the output of the test command we can see that only values of `gain` are equal:
>  
> {code:java}
> id 1 1
> impurity 0.2 0.5
> leftImpurity 0.3 0.5
> rightImpurity 0.4 0.5
> leftPredict 1.0 (prob = 0.4) 0.0 (prob = 1.0)
> rightPredict 0.0 (prob = 0.6) 0.0 (prob = 1.0)
> gain 0.1 0.1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org