You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Anton Dmitriev (JIRA)" <ji...@apache.org> on 2018/04/06 19:06:00 UTC

[jira] [Updated] (IGNITE-8059) Integrate decision tree with partition based dataset

     [ https://issues.apache.org/jira/browse/IGNITE-8059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anton Dmitriev updated IGNITE-8059:
-----------------------------------
    Description: 
A partition based dataset (new underlying infrastructure component) was added as part of IGNITE-7437 and now we need to adopt decision tree algorithm to work on top of this infrastructure.
----
The way decision tree algorithm is implemented on top of a row-partitioned data is described further.

At first, the basic idea behind any decision tree, bother regression and classification, is to find the *data split* that allows to minimize an *impurity measure* like [Gini coefficient|[https://en.wikipedia.org/wiki/Gini_coefficient],] [entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean squared error|[https://en.wikipedia.org/wiki/Mean_squared_error].] To calculate the best split we need to build a _function_ that describes dependency between split point (independent variable) and impurity measure (dependent variable) and then find a minimum of this _function_.

In case of a distributed system, when a data is partitioned by row, we can calculate such _function_ on every node, compress it somehow, and then pass it to the master node. On the master node we need to summarize _functions_ received from all nodes and then find a minimum of the result _function_. It's the way decision tree algorithm is implemented in Apache Ignite ML module.

  was:A partition based dataset (new underlying infrastructure component) was added as part of IGNITE-7437 and now we need to adopt decision tree algorithm to work on top of this infrastructure. 


> Integrate decision tree with partition based dataset
> ----------------------------------------------------
>
>                 Key: IGNITE-8059
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8059
>             Project: Ignite
>          Issue Type: Improvement
>          Components: ml
>            Reporter: Anton Dmitriev
>            Assignee: Anton Dmitriev
>            Priority: Major
>             Fix For: 2.5
>
>
> A partition based dataset (new underlying infrastructure component) was added as part of IGNITE-7437 and now we need to adopt decision tree algorithm to work on top of this infrastructure.
> ----
> The way decision tree algorithm is implemented on top of a row-partitioned data is described further.
> At first, the basic idea behind any decision tree, bother regression and classification, is to find the *data split* that allows to minimize an *impurity measure* like [Gini coefficient|[https://en.wikipedia.org/wiki/Gini_coefficient],] [entropy|https://en.wikipedia.org/wiki/Entropy_(information_theory)] or [mean squared error|[https://en.wikipedia.org/wiki/Mean_squared_error].] To calculate the best split we need to build a _function_ that describes dependency between split point (independent variable) and impurity measure (dependent variable) and then find a minimum of this _function_.
> In case of a distributed system, when a data is partitioned by row, we can calculate such _function_ on every node, compress it somehow, and then pass it to the master node. On the master node we need to summarize _functions_ received from all nodes and then find a minimum of the result _function_. It's the way decision tree algorithm is implemented in Apache Ignite ML module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)