You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/06/05 23:42:12 UTC

[jira] [Commented] (MADLIB-1117) Add "columns to process per pass" as an optional param for summary()

    [ https://issues.apache.org/jira/browse/MADLIB-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037855#comment-16037855 ] 

ASF GitHub Bot commented on MADLIB-1117:
----------------------------------------

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/incubator-madlib/pull/138

    Summary: Add param to determine num of cols per run

    JIRA: MADLIB-1117
    
    Summary used a hard-coded parameter of a maximum of 15 columns per run.
    This was put in place to avoid out-of-memory errors in most cases.
    This, however, limits the run time since higher number of columns can be
    summarized in a single run for a simpler data set (one which leads to
    smaller sketch data structures).
    
    This commit adds a new parameter allowing users to set this limit,
    while retaining the old default of 15 columns.
    
    Closes #138

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/iyerr3/incubator-madlib feature/summary_add_parameter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-madlib/pull/138.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #138
    
----
commit 1cca783b63111d004662f314cef67e9be8bb9a92
Author: Rahul Iyer <ri...@apache.org>
Date:   2017-06-05T23:36:50Z

    Summary: Add param to determine num of cols per run
    
    JIRA: MADLIB-1117
    
    Summary used a hard-coded parameter of a maximum of 15 columns per run.
    This was put in place to avoid out-of-memory errors in most cases.
    This, however, limits the run time since higher number of columns can be
    summarized in a single run for a simpler data set (one which leads to
    smaller sketch data structures).
    
    This commit adds a new parameter allowing users to set this limit,
    while retaining the old default of 15 columns.
    
    Closes #138

----


> Add "columns to process per pass" as an optional param for summary()
> --------------------------------------------------------------------
>
>                 Key: MADLIB-1117
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1117
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Sketch-based Estimators
>            Reporter: Frank McQuillan
>            Assignee: Rahul Iyer
>            Priority: Minor
>             Fix For: v1.12
>
>
> Context
> The summary() function
> http://madlib.incubator.apache.org/docs/latest/group__grp__summary.html
> currently processes 15 columns per pass to keep memory usage below 1 GB limit.  This is a somewhat arbitrary limit since memory usage depends on many things including data set, and which params in summary() are set.  If more columns per pass could be used, summary() would run faster.
> Story
> As a MADlib developer, I want to add "columns to process per pass" as an optional param for summary() function.  Default: use 15 columns (which is the current setting).  Suggested param name:  "columns_per_pass" though if you have a better name, that's fine.
> Acceptance
> 1) Add new optional parameter and update docs.  Please add a note so it is clear what this control does.
> 2) Write and pass tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)