You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by janardhan shetty <ja...@gmail.com> on 2016/10/01 02:34:07 UTC

Re: Spark ML Decision Trees Algorithm

It would be good to know which paper has inspired to implement the version
which we use in spark  2.0 decision trees ?

On Fri, Sep 30, 2016 at 4:44 PM, Peter Figliozzi <pe...@gmail.com>
wrote:

> It's a good question.  People have been publishing papers on decision
> trees and various methods of constructing and pruning them for over 30
> years.  I think it's rather a question for a historian at this point.
>
> On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty <ja...@gmail.com>
> wrote:
>
>> Read this explanation but wondering if this algorithm has the base from a
>> research paper for detail understanding.
>>
>> On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott <kevin.r.mellott@gmail.com
>> > wrote:
>>
>>> The documentation details the algorithm being used at
>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty <
>>> janardhanp22@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Any help here is appreciated ..
>>>>
>>>> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <
>>>> janardhanp22@gmail.com> wrote:
>>>>
>>>>> Is there a reference to the research paper which is implemented in
>>>>> spark 2.0 ?
>>>>>
>>>>> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <
>>>>> janardhanp22@gmail.com> wrote:
>>>>>
>>>>>> Which algorithm is used under the covers while doing decision trees
>>>>>> FOR SPARK ?
>>>>>> for example: scikit-learn (python) uses an optimised version of the
>>>>>> CART algorithm.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark ML Decision Trees Algorithm

Posted by "颜发才 (Yan Facai)" <ya...@gmail.com>.
Perhaps the best way is to read the code.
The Decision tree is implemented by 1-tree Random forest, whose entry point
is `run` method:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88

I'm not familiar with the so-called algorithms of decision tree, such as
ID4, CART. However, I believe that the implementation of decision tree of
sklearn is quite similar with those of spark, and some difference are
listed below:
1. Continuous feature.
    sklearn use all candidate values to find best split, while spark groups
all candidate values into fixed bins.

2. Build tree.
    sklearn provides two methods: depth-first and best-first, while spark
has only one: depth-first.

3. Split number.
    sklearn creates one split per iteration, while spark could split in
parallel.

If I'm wrong, please let me know.



On Sat, Oct 1, 2016 at 10:34 AM, janardhan shetty <ja...@gmail.com>
wrote:

> It would be good to know which paper has inspired to implement the version
> which we use in spark  2.0 decision trees ?
>
> On Fri, Sep 30, 2016 at 4:44 PM, Peter Figliozzi <pete.figliozzi@gmail.com
> > wrote:
>
>> It's a good question.  People have been publishing papers on decision
>> trees and various methods of constructing and pruning them for over 30
>> years.  I think it's rather a question for a historian at this point.
>>
>> On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty <janardhanp22@gmail.com
>> > wrote:
>>
>>> Read this explanation but wondering if this algorithm has the base from
>>> a research paper for detail understanding.
>>>
>>> On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott <
>>> kevin.r.mellott@gmail.com> wrote:
>>>
>>>> The documentation details the algorithm being used at
>>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty <
>>>> janardhanp22@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Any help here is appreciated ..
>>>>>
>>>>> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <
>>>>> janardhanp22@gmail.com> wrote:
>>>>>
>>>>>> Is there a reference to the research paper which is implemented in
>>>>>> spark 2.0 ?
>>>>>>
>>>>>> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <
>>>>>> janardhanp22@gmail.com> wrote:
>>>>>>
>>>>>>> Which algorithm is used under the covers while doing decision trees
>>>>>>> FOR SPARK ?
>>>>>>> for example: scikit-learn (python) uses an optimised version of the
>>>>>>> CART algorithm.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>