You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (Jira)" <ji...@apache.org> on 2021/01/16 02:28:00 UTC

[jira] [Comment Edited] (MADLIB-1460) Prevent an "integer out of range" exception in linear regression train

    [ https://issues.apache.org/jira/browse/MADLIB-1460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266484#comment-17266484 ] 

Frank McQuillan edited comment on MADLIB-1460 at 1/16/21, 2:27 AM:
-------------------------------------------------------------------

(2)
Run with 5 billion rows (which is > 2^32):

Train
{code}
DROP TABLE IF EXISTS tab1;
CREATE TABLE tab1(
     indep_var BIGINT,
     dep_var BIGINT
);
INSERT INTO tab1 VALUES(generate_series(1,5000000000), generate_series(1,5000000000));

DROP TABLE IF EXISTS test_linregr, test_linregr_summary;
SELECT madlib.linregr_train( 'tab1',
                             'test_linregr',
                             'dep_var',
                             'ARRAY[1, indep_var]'
                           );
 linregr_train
---------------

(1 row)

Time: 375514.851 ms
{code}

Predict
{code}
test=# create table tab_pred_out as select madlib.linregr_predict(coef,
                              ARRAY[1, indep_var]
                             ) as predict from tab1,test_linregr;

SELECT 5000000000
{code}

Note need to run as CTAS query.  otherwise it fails with the error "PGresult cannot support more than INT_MAX tuples". This is because before printing to the console, postgres tries to store the result of the query in a PGresult object which does not support bigint

{code}
test=# select madlib.linregr_predict(coef,
                              ARRAY[1, indep_var]
                             ) from tab1,test_linregr;

PGresult cannot support more than INT_MAX tuples
{code}


was (Author: fmcquillan):
(2)
Run with 5 billion rows (which is > 2^32):

Train
{code}
DROP TABLE IF EXISTS tab1;
CREATE TABLE tab1(
     indep_var BIGINT,
     dep_var BIGINT
);
INSERT INTO tab1 VALUES(generate_series(1,5000000000), generate_series(1,5000000000));

DROP TABLE IF EXISTS test_linregr, test_linregr_summary;
SELECT madlib.linregr_train( 'tab1',
                             'test_linregr',
                             'dep_var',
                             'ARRAY[1, indep_var]'
                           );
 linregr_train
---------------

(1 row)

Time: 375514.851 ms
{code}

Predict
{code}
test=# create table tab_pred_out as select madlib.linregr_predict(coef,
                              ARRAY[1, indep_var]
                             ) as predict from tab1,test_linregr;

SELECT 5000000000
{code}

Note need to run as CTAS query.  otherwise it fails with the error PGresult cannot support more than INT_MAX tuples. This is because before printing to the console, postgres tries to store the result of the query in a PGresult object which does not support bigint

{code}
test=# select madlib.linregr_predict(coef,
                              ARRAY[1, indep_var]
                             ) from tab1,test_linregr;

PGresult cannot support more than INT_MAX tuples
{code}

> Prevent an "integer out of range" exception in linear regression train
> ----------------------------------------------------------------------
>
>                 Key: MADLIB-1460
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1460
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Linear Regression
>            Reporter: Daniel Daniel
>            Priority: Minor
>             Fix For: v1.18.0
>
>
> Linear regression training results in 2 output tables (*neither are optional*): 
>  * The *primary* output table, that includes the computed coefficients.
>  * A *summary* output table, that contains a single line.
> +Scenario+
> Running the linear regression training in postgresql on an input table which has *more than 2^31 records* within it (even if a grouping column is specified), fails due to an "*integer out of range*" exception.
> +Source+
> *The summary table* has a column that stores *the total number of records* involved in the computation. The column's data type is a *singed integer*. However, the total number of records is computed as a *BIGINT*. Therefore, when the total number of records in the input table is beyond the range of a signed integer (i.e., 2^31), an "integer out of range" exception is thrown.
> +Solution+
> A simple solution is to change the data type of the column from a *signed integer* into a *BIGINT*. 
> +Test+
> We have executed the linear regression training function with and without the suggested modification on an input table having between 2^31-2^32 records. Without the modification, an integer out of range exception was thrown. After modifying the code as suggested, it worked perfectly. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)