You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mark Alen <li...@yahoo.com.INVALID> on 2015/08/19 01:54:52 UTC

[mllib] Random forest maxBins and confidence in training points

Hi everyone, 
I have two questions regarding the random forest implementation in mllib
1- maxBins: Say the value of a feature is between [0,100]. In my dataset there are a lot of data points between [0,10] and one datapoint at 100 and nothing between (10, 100). I am wondering how does the binning work in this case? I obviously don't want all my points that are in between [0,10] to fall into the same bin and other bins to be empty.  would mllib do any smart reallocation of bins such that each bin gets some datapoints in them and one bin does not get all the datapoints?
2- Is there any way to do this in Spark? http://stats.stackexchange.com/questions/165062/incorporating-the-confidence-in-the-training-data-into-the-ml-model
Thanks a lotMark