You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Marco Mistroni <mm...@gmail.com> on 2016/06/10 18:39:17 UTC

Pls assist: Spark DecisionTree question

HI all
 i am trying to run a ML program against some data, using DecisionTrees.
To fine tune the parameters, i am running this loop to find the optimal
values for
impurity, depth and bins

for (impurity <- Array("gini", "entropy");
           depth    <- Array(1,2,3, 4, 5);
           bins     <- Array(10,20,25,28)) yield {
           val model = DecisionTree.trainClassifier(
               trainingData, numClasses, categoricalFeaturesInfo,
               impurity, depth, bins)

           val accuracy = getMetrics(model, testData).precision
           ((impurity, depth, bins), accuracy)

Could anyone explain me
why, if i run my program multiple times against the SAME
data, i get different optimal results for the parameters above?
i assume if i run the loop above agains the same data i will always get the
same results?
to  give you an example run1 returned following top results

((gini,4,28),0.8)
((gini,4,25),0.8)
((gini,3,28),0.8)
((gini,3,25),0.8)
((entropy,3,28),0.7333333333333333)

while run2 gives me this top results


((entropy,2,28),0.6842105263157895)
((entropy,2,25),0.6842105263157895)
((entropy,2,20),0.6842105263157895)
((entropy,2,10),0.6842105263157895)
((entropy,1,28),0.684210526315789

could anyone explain why?

kind regards
 marco