You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2019/02/13 19:22:00 UTC
[jira] [Comment Edited] (MADLIB-1301) Improve correlation and covariance memory usage with large number of groups

    [ https://issues.apache.org/jira/browse/MADLIB-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767483#comment-16767483 ] 

Frank McQuillan edited comment on MADLIB-1301 at 2/13/19 7:21 PM:
------------------------------------------------------------------

One idea is to do something like we do in 
http://madlib.apache.org/docs/latest/group__grp__summary.html

with the parameter

{code}
n_cols_per_run (optional)

INTEGER, default: 15. The number of columns to collect summary statistics in one pass of the data. This parameter determines the number of passes through the data. For e.g., with a total of 40 columns to summarize and 'n_cols_per_run = 15', there will be 3 passes through the data, with each pass summarizing a maximum of 15 columns.

Note
This parameter should be used with caution. Increasing this parameter could decrease the total run time (if number of passes decreases), but will increase the memory consumption during each run. Since PostgreSQL limits the memory available for a single aggregate run, this increased memory consumption could result in an out-of-memory termination error.
{code}

i.e., limit the number of groups processed per pass over the data.  Default could be "all" like it is now, then allow user to reduce if there are memory issues.


was (Author: fmcquillan):
One idea is to do something like we do in 
http://madlib.apache.org/docs/latest/group__grp__summary.html

with the parameter

{code}
n_cols_per_run (optional)

INTEGER, default: 15. The number of columns to collect summary statistics in one pass of the data. This parameter determines the number of passes through the data. For e.g., with a total of 40 columns to summarize and 'n_cols_per_run = 15', there will be 3 passes through the data, with each pass summarizing a maximum of 15 columns.

Note
This parameter should be used with caution. Increasing this parameter could decrease the total run time (if number of passes decreases), but will increase the memory consumption during each run. Since PostgreSQL limits the memory available for a single aggregate run, this increased memory consumption could result in an out-of-memory termination error.
{code}

> Improve correlation and covariance memory usage with large number of groups
> ---------------------------------------------------------------------------
>
>                 Key: MADLIB-1301
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1301
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Descriptive Statistics
>            Reporter: Frank McQuillan
>            Priority: Major
>             Fix For: v2.0
>
>
> When correlation and covariance are run with large number of groups (100's), can run out of memory.  Increasing statement_mem helps, but this JIRA is to investigate and improve memory usage with large numbers of groups.
> Sample findings on correlation for 300K input data set:
> || #groups || statement mem 186M || statement mem 200M || statement mem 500M || statement mem 1000M ||
> | 6 | Success | Success | Success | - |
> | 127 | Success | Success | - | - |
> | 930 | Fail | Fail | Success | - |
> | 1213 | Fail | Fail | Success | - |
> | 4852 | Fail | Fail | Fail | Fail |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)