You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/04/18 19:09:41 UTC
[jira] [Commented] (MADLIB-1057) Reduce memory footprint for DT

    [ https://issues.apache.org/jira/browse/MADLIB-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973311#comment-15973311 ] 

ASF GitHub Bot commented on MADLIB-1057:
----------------------------------------

Github user iyerr3 commented on the issue:

    https://github.com/apache/incubator-madlib/pull/117
  
    No, that's a separate JIRA: MADLIB-1057
    <https://issues.apache.org/jira/browse/MADLIB-1057>. This one is just about
    setting the defaults to a more reasonable value considering the data that
    users have shared.
    
    The commit is a little more than just changing two numbers since I updated
    the way these defaults are set. Previously they were set in overloaded
    function declaration (in SQL). Changed this to set the default in the main
    function definition, eliminating redundancy.
    
    Thanks,
    Rahul



> Reduce memory footprint for DT
> ------------------------------
>
>                 Key: MADLIB-1057
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1057
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Decision Tree
>            Reporter: Frank McQuillan
>            Assignee: Rahul Iyer
>             Fix For: v1.11
>
>
> Follow on from spike 
> https://issues.apache.org/jira/browse/MADLIB-1035
> Step 1
> As a madlib developer I want to recreate the RF memory issue (reported in https://issues.apache.org/jira/browse/MADLIB-1035). 
> The current datasets we have are 
> dt_adult : 32K rows 14 columns
> ecommerce : 1M rows 4 columns (ecommerce isn’t actually suitable for DT/RF)
> We need a table with ~2.2M rows and ~130 features (the actual target table has ~1300 features). Randomly filling them might help diagnosing the issue but ideally we would want a somewhat sensible dataset. The problem seems to involve relatively short trees (depth 5) which means a random dataset will probably fill the whole tree which might not be true for a structured dataset.
> Step 2
> Refactoring DT for for smaller memory footprint.
> Tree Accumulator has 2 matrices for continuous and categorical variables. 
> The whole structure is recreated at every level. 
> Every matrix has 2^i rows (i is the level)
> The categorical matrix size depends on the total number of categories (weather : {sunny, cloudy, rainy}, isWeekend : {true, false} means this total is 3+2=5) 
> The continuous matrix size depends on the number of cont. features * the number of bins.
> Tree accumulator works like an array not a linked list. Even if the output is not a complete tree, the tree accumulator creates rows for nonexistent branches in proper order and fills them with 0 values. 
> The refactored version would create a small index table that has the same number of rows as the old tree accumulator (a complete tree) but only a single index column that points to the new tree accumulator row. 
> This will allow us to keep most of the internal function interfaces same but the code to access (read/write) the tree accumulator will have to change.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)