You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@madlib.apache.org by LUYAO CHEN <lu...@hotmail.com> on 2018/07/23 19:34:01 UTC
PostgreSQL crashed during random forest training
Dear user group,
I got a problem when training the grouped data with random forest(300 features). Small data was fine ( eg, 56K instances in 56 groups), but failed for 240K instances in 250 groups. Postgres forced to disconnect the session after showing the below message in verbose mode:
NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a temporary view
NOTICE: sql_create_empty_result_table:
CREATE TABLE analysis.dx_rf_train_output_1 (
gid integer,
sample_id integer,
tree madlib.bytea8);
NOTICE: sql_refresh_training_pois_cnt:
TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__ CASCADE;
INSERT INTO __madlib_temp_91155016_1532371657_5660955__
SELECT
*,
madlib.poisson_random(1) AS poisson_count
FROM
(
SELECT
*,
0.::double precision AS __madlib_temp_14328459_1532371657_7318497__
FROM analysis.dxpredict_svec
) subq
WHERE __madlib_temp_14328459_1532371657_7318497__ < 1
NOTICE:
src_cnt: 158360,
oob_cnt: 92418,
dup_cnt: 250617.
NOTICE: Started tree building for all groups
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The PostgreSQL did not capture the detail log even I increased the logstatement to "all"
2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was terminated by signal 11: Segmentation fault
2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running: SELECT madlib.forest_train('analysis.dxpredict_svec',
'analysis.dx_rf_train_output_1',
'rowid',
'positive',
'*',
'rowid,positive,case_icd',
'case_icd',
30::integer,
30::integer,
TRUE::boolean,
1::integer,
10::integer,
3::integer,
1::integer,
10::integer,
NULL,
TRUE
);
2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active server processes
2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection because of crash of another server process
Re: PostgreSQL crashed during random forest training
Posted by Orhan Kislal <ok...@pivotal.io>.
Hi Luyao Chen,
I was wondering if you are still experiencing this issue? If not, we will
close the JIRA. Otherwise, it would be helpful to add your system
information to the JIRA (OS, database and MADlib version).
Thanks,
Orhan Kislal
On Thu, Oct 18, 2018 at 10:59 AM Orhan Kislal <ok...@pivotal.io> wrote:
> Hi Luyao Chen,
>
> I started looking into this bug. Currently, I am running it on OSX 10.13,
> Postgres 10, MADlib 1.15.1 without issue. Could you share your system
> information? OS, database and MADlib version would be most helpful.
>
> Thanks,
>
> Orhan
>
> On Sat, Jul 28, 2018 at 2:28 PM Frank McQuillan <fm...@pivotal.io>
> wrote:
>
>> thanks, I added this info to the jira
>>
>> On Fri, Jul 27, 2018 at 7:23 AM, LUYAO CHEN <lu...@hotmail.com>
>> wrote:
>>
>>> The similar problem happened in decision tree. ( with the same set of
>>> data ).
>>>
>>> I got the error (dmesg) that "
>>> [ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp
>>> 00007ffdf867dd50 error 4 in libmadlib.so[7f17cd2ec000+64a000]"
>>>
>>>
>>>
>>>
>>> Regards,
>>> Luyao Chen
>>>
>>> ------------------------------
>>> *From:* Frank McQuillan <fm...@pivotal.io>
>>> *Sent:* Tuesday, July 24, 2018 2:13 PM
>>>
>>> *To:* user@madlib.apache.org
>>> *Subject:* Re: PostgreSQL crashed during random forest training
>>>
>>> Thank you, we created a JIRA to investigate this
>>> https://issues.apache.org/jira/browse/MADLIB-1257
>>>
>>> On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <lu...@hotmail.com>
>>> wrote:
>>>
>>> Another observation - It crashed with 84 groups and 73K instance. In
>>> this scenario, I shall have pretty enough memory and disk.
>>>
>>> Also seems during the increasing of the groups, it used a lot of
>>> temporary disk space when the data is over certain groups.
>>>
>>>
>>> Regards,
>>>
>>> ------------------------------
>>> *From:* LUYAO CHEN <lu...@hotmail.com>
>>> *Sent:* Tuesday, July 24, 2018 9:15 AM
>>> *To:* user@madlib.apache.org
>>> *Subject:* Re: PostgreSQL crashed during random forest training
>>>
>>>
>>> Hi Frank,
>>>
>>>
>>> You may refer to the enclosed dump data for the training table, and I
>>> used the below SQL for random forest.
>>>
>>>
>>> DROP TABLE IF EXISTS train_output, train_output_group,
>>> train_output_summary;
>>> SELECT madlib.forest_train('train_data', -- source table
>>> 'train_output', -- output model table
>>> 'rowid', -- id column
>>> 'positive', -- response
>>> 'features', -- features
>>> NULL, -- exclude columns
>>> 'caseid', -- grouping columns
>>> 30::integer, -- number of trees
>>> 30::integer, -- number of random
>>> features
>>> TRUE::boolean, -- variable importance
>>> 1::integer, -- num_permutations
>>> 10::integer, -- max depth
>>> 3::integer, -- min split
>>> 1::integer, -- min bucket
>>> 10::integer, -- number of splits per
>>> continuous variable
>>> NULL, -- null handling parameter
>>> TRUE -- verbose
>>> );
>>>
>>> Regards,
>>> Luyao Chen
>>>
>>> ------------------------------
>>> *From:* Frank McQuillan <fm...@pivotal.io>
>>> *Sent:* Monday, July 23, 2018 4:59 PM
>>> *To:* user@madlib.apache.org
>>> *Subject:* Re: PostgreSQL crashed during random forest training
>>>
>>> Hi Luyao Chen
>>>
>>> It's hard to debug just looking at that trace.
>>>
>>> 1) If you increase your data size to more than 56K instances in 56
>>> groups, does it work? e.g., double it to approx 112K instances and 112
>>> groups.
>>>
>>> 2) Is it possible of you could share a sample of your data so that we
>>> could try? If not, perhaps anonymize a sample of the data so that we can
>>> multiply it out to make it bigger? Then we could take a closer look.
>>>
>>> Frank
>>>
>>> On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <lu...@hotmail.com>
>>> wrote:
>>>
>>> Dear user group,
>>>
>>>
>>> I got a problem when training the grouped data with random forest(300
>>> features). Small data was fine ( eg, 56K instances in 56 groups), but
>>> failed for 240K instances in 250 groups. Postgres forced to disconnect the
>>> session after showing the below message in verbose mode:
>>>
>>>
>>> NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a
>>> temporary view
>>> NOTICE: sql_create_empty_result_table:
>>>
>>> CREATE TABLE analysis.dx_rf_train_output_1 (
>>> gid integer,
>>> sample_id integer,
>>> tree madlib.bytea8);
>>>
>>> NOTICE: sql_refresh_training_pois_cnt:
>>>
>>> TRUNCATE TABLE
>>> __madlib_temp_91155016_1532371657_5660955__ CASCADE;
>>> INSERT INTO
>>> __madlib_temp_91155016_1532371657_5660955__
>>> SELECT
>>> *,
>>> madlib.poisson_random(1) AS poisson_count
>>> FROM
>>> (
>>> SELECT
>>> *,
>>> 0.::double precision AS
>>> __madlib_temp_14328459_1532371657_7318497__
>>> FROM analysis.dxpredict_svec
>>> ) subq
>>> WHERE
>>> __madlib_temp_14328459_1532371657_7318497__ < 1
>>>
>>> NOTICE:
>>> src_cnt: 158360,
>>> oob_cnt: 92418,
>>> dup_cnt: 250617.
>>>
>>> NOTICE: Started tree building for all groups
>>> server closed the connection unexpectedly
>>> This probably means the server terminated abnormally
>>> before or while processing the request.
>>> The connection to the server was lost. Attempting reset: Failed.
>>>
>>> The PostgreSQL did not capture the detail log even I increased the
>>> logstatement to "all"
>>> 2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was
>>> terminated by signal 11: Segmentation fault
>>> 2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running:
>>> SELECT madlib.forest_train('analysis.dxpredict_svec',
>>> 'analysis.dx_rf_train_output_1',
>>> 'rowid',
>>> 'positive',
>>> '*',
>>> 'rowid,positive,case_icd',
>>> 'case_icd',
>>> 30::integer,
>>> 30::integer,
>>> TRUE::boolean,
>>> 1::integer,
>>> 10::integer,
>>> 3::integer,
>>> 1::integer,
>>> 10::integer,
>>> NULL,
>>> TRUE
>>> );
>>> 2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active
>>> server processes
>>> 2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection
>>> because of crash of another server process
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
Re: PostgreSQL crashed during random forest training
Posted by Orhan Kislal <ok...@pivotal.io>.
Hi Luyao Chen,
I started looking into this bug. Currently, I am running it on OSX 10.13,
Postgres 10, MADlib 1.15.1 without issue. Could you share your system
information? OS, database and MADlib version would be most helpful.
Thanks,
Orhan
On Sat, Jul 28, 2018 at 2:28 PM Frank McQuillan <fm...@pivotal.io>
wrote:
> thanks, I added this info to the jira
>
> On Fri, Jul 27, 2018 at 7:23 AM, LUYAO CHEN <lu...@hotmail.com>
> wrote:
>
>> The similar problem happened in decision tree. ( with the same set of
>> data ).
>>
>> I got the error (dmesg) that "
>> [ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp
>> 00007ffdf867dd50 error 4 in libmadlib.so[7f17cd2ec000+64a000]"
>>
>>
>>
>>
>> Regards,
>> Luyao Chen
>>
>> ------------------------------
>> *From:* Frank McQuillan <fm...@pivotal.io>
>> *Sent:* Tuesday, July 24, 2018 2:13 PM
>>
>> *To:* user@madlib.apache.org
>> *Subject:* Re: PostgreSQL crashed during random forest training
>>
>> Thank you, we created a JIRA to investigate this
>> https://issues.apache.org/jira/browse/MADLIB-1257
>>
>> On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <lu...@hotmail.com>
>> wrote:
>>
>> Another observation - It crashed with 84 groups and 73K instance. In
>> this scenario, I shall have pretty enough memory and disk.
>>
>> Also seems during the increasing of the groups, it used a lot of
>> temporary disk space when the data is over certain groups.
>>
>>
>> Regards,
>>
>> ------------------------------
>> *From:* LUYAO CHEN <lu...@hotmail.com>
>> *Sent:* Tuesday, July 24, 2018 9:15 AM
>> *To:* user@madlib.apache.org
>> *Subject:* Re: PostgreSQL crashed during random forest training
>>
>>
>> Hi Frank,
>>
>>
>> You may refer to the enclosed dump data for the training table, and I
>> used the below SQL for random forest.
>>
>>
>> DROP TABLE IF EXISTS train_output, train_output_group,
>> train_output_summary;
>> SELECT madlib.forest_train('train_data', -- source table
>> 'train_output', -- output model table
>> 'rowid', -- id column
>> 'positive', -- response
>> 'features', -- features
>> NULL, -- exclude columns
>> 'caseid', -- grouping columns
>> 30::integer, -- number of trees
>> 30::integer, -- number of random
>> features
>> TRUE::boolean, -- variable importance
>> 1::integer, -- num_permutations
>> 10::integer, -- max depth
>> 3::integer, -- min split
>> 1::integer, -- min bucket
>> 10::integer, -- number of splits per
>> continuous variable
>> NULL, -- null handling parameter
>> TRUE -- verbose
>> );
>>
>> Regards,
>> Luyao Chen
>>
>> ------------------------------
>> *From:* Frank McQuillan <fm...@pivotal.io>
>> *Sent:* Monday, July 23, 2018 4:59 PM
>> *To:* user@madlib.apache.org
>> *Subject:* Re: PostgreSQL crashed during random forest training
>>
>> Hi Luyao Chen
>>
>> It's hard to debug just looking at that trace.
>>
>> 1) If you increase your data size to more than 56K instances in 56
>> groups, does it work? e.g., double it to approx 112K instances and 112
>> groups.
>>
>> 2) Is it possible of you could share a sample of your data so that we
>> could try? If not, perhaps anonymize a sample of the data so that we can
>> multiply it out to make it bigger? Then we could take a closer look.
>>
>> Frank
>>
>> On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <lu...@hotmail.com>
>> wrote:
>>
>> Dear user group,
>>
>>
>> I got a problem when training the grouped data with random forest(300
>> features). Small data was fine ( eg, 56K instances in 56 groups), but
>> failed for 240K instances in 250 groups. Postgres forced to disconnect the
>> session after showing the below message in verbose mode:
>>
>>
>> NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a
>> temporary view
>> NOTICE: sql_create_empty_result_table:
>>
>> CREATE TABLE analysis.dx_rf_train_output_1 (
>> gid integer,
>> sample_id integer,
>> tree madlib.bytea8);
>>
>> NOTICE: sql_refresh_training_pois_cnt:
>>
>> TRUNCATE TABLE
>> __madlib_temp_91155016_1532371657_5660955__ CASCADE;
>> INSERT INTO
>> __madlib_temp_91155016_1532371657_5660955__
>> SELECT
>> *,
>> madlib.poisson_random(1) AS poisson_count
>> FROM
>> (
>> SELECT
>> *,
>> 0.::double precision AS
>> __madlib_temp_14328459_1532371657_7318497__
>> FROM analysis.dxpredict_svec
>> ) subq
>> WHERE
>> __madlib_temp_14328459_1532371657_7318497__ < 1
>>
>> NOTICE:
>> src_cnt: 158360,
>> oob_cnt: 92418,
>> dup_cnt: 250617.
>>
>> NOTICE: Started tree building for all groups
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>> The connection to the server was lost. Attempting reset: Failed.
>>
>> The PostgreSQL did not capture the detail log even I increased the
>> logstatement to "all"
>> 2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was
>> terminated by signal 11: Segmentation fault
>> 2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running:
>> SELECT madlib.forest_train('analysis.dxpredict_svec',
>> 'analysis.dx_rf_train_output_1',
>> 'rowid',
>> 'positive',
>> '*',
>> 'rowid,positive,case_icd',
>> 'case_icd',
>> 30::integer,
>> 30::integer,
>> TRUE::boolean,
>> 1::integer,
>> 10::integer,
>> 3::integer,
>> 1::integer,
>> 10::integer,
>> NULL,
>> TRUE
>> );
>> 2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active
>> server processes
>> 2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection
>> because of crash of another server process
>>
>>
>>
>>
>>
>>
>>
>
Re: PostgreSQL crashed during random forest training
Posted by Frank McQuillan <fm...@pivotal.io>.
thanks, I added this info to the jira
On Fri, Jul 27, 2018 at 7:23 AM, LUYAO CHEN <lu...@hotmail.com> wrote:
> The similar problem happened in decision tree. ( with the same set of
> data ).
>
> I got the error (dmesg) that "
> [ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp
> 00007ffdf867dd50 error 4 in libmadlib.so[7f17cd2ec000+64a000]"
>
>
>
>
> Regards,
> Luyao Chen
>
> ------------------------------
> *From:* Frank McQuillan <fm...@pivotal.io>
> *Sent:* Tuesday, July 24, 2018 2:13 PM
>
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
> Thank you, we created a JIRA to investigate this
> https://issues.apache.org/jira/browse/MADLIB-1257
>
> On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <lu...@hotmail.com>
> wrote:
>
> Another observation - It crashed with 84 groups and 73K instance. In this
> scenario, I shall have pretty enough memory and disk.
>
> Also seems during the increasing of the groups, it used a lot of
> temporary disk space when the data is over certain groups.
>
>
> Regards,
>
> ------------------------------
> *From:* LUYAO CHEN <lu...@hotmail.com>
> *Sent:* Tuesday, July 24, 2018 9:15 AM
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
>
> Hi Frank,
>
>
> You may refer to the enclosed dump data for the training table, and I used
> the below SQL for random forest.
>
>
> DROP TABLE IF EXISTS train_output, train_output_group,
> train_output_summary;
> SELECT madlib.forest_train('train_data', -- source table
> 'train_output', -- output model table
> 'rowid', -- id column
> 'positive', -- response
> 'features', -- features
> NULL, -- exclude columns
> 'caseid', -- grouping columns
> 30::integer, -- number of trees
> 30::integer, -- number of random features
> TRUE::boolean, -- variable importance
> 1::integer, -- num_permutations
> 10::integer, -- max depth
> 3::integer, -- min split
> 1::integer, -- min bucket
> 10::integer, -- number of splits per
> continuous variable
> NULL, -- null handling parameter
> TRUE -- verbose
> );
>
> Regards,
> Luyao Chen
>
> ------------------------------
> *From:* Frank McQuillan <fm...@pivotal.io>
> *Sent:* Monday, July 23, 2018 4:59 PM
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
> Hi Luyao Chen
>
> It's hard to debug just looking at that trace.
>
> 1) If you increase your data size to more than 56K instances in 56
> groups, does it work? e.g., double it to approx 112K instances and 112
> groups.
>
> 2) Is it possible of you could share a sample of your data so that we
> could try? If not, perhaps anonymize a sample of the data so that we can
> multiply it out to make it bigger? Then we could take a closer look.
>
> Frank
>
> On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <lu...@hotmail.com>
> wrote:
>
> Dear user group,
>
>
> I got a problem when training the grouped data with random forest(300
> features). Small data was fine ( eg, 56K instances in 56 groups), but
> failed for 240K instances in 250 groups. Postgres forced to disconnect the
> session after showing the below message in verbose mode:
>
>
> NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a
> temporary view
> NOTICE: sql_create_empty_result_table:
>
> CREATE TABLE analysis.dx_rf_train_output_1 (
> gid integer,
> sample_id integer,
> tree madlib.bytea8);
>
> NOTICE: sql_refresh_training_pois_cnt:
>
> TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__
> CASCADE;
> INSERT INTO __madlib_temp_91155016_1532371
> 657_5660955__
> SELECT
> *,
> madlib.poisson_random(1) AS poisson_count
> FROM
> (
> SELECT
> *,
> 0.::double precision AS
> __madlib_temp_14328459_1532371657_7318497__
> FROM analysis.dxpredict_svec
> ) subq
> WHERE __madlib_temp_14328459_1532371657_7318497__
> < 1
>
> NOTICE:
> src_cnt: 158360,
> oob_cnt: 92418,
> dup_cnt: 250617.
>
> NOTICE: Started tree building for all groups
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> The PostgreSQL did not capture the detail log even I increased the
> logstatement to "all"
> 2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was
> terminated by signal 11: Segmentation fault
> 2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running:
> SELECT madlib.forest_train('analysis.dxpredict_svec',
> 'analysis.dx_rf_train_output_1',
> 'rowid',
> 'positive',
> '*',
> 'rowid,positive,case_icd',
> 'case_icd',
> 30::integer,
> 30::integer,
> TRUE::boolean,
> 1::integer,
> 10::integer,
> 3::integer,
> 1::integer,
> 10::integer,
> NULL,
> TRUE
> );
> 2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active
> server processes
> 2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection
> because of crash of another server process
>
>
>
>
>
>
>
Re: PostgreSQL crashed during random forest training
Posted by LUYAO CHEN <lu...@hotmail.com>.
The similar problem happened in decision tree. ( with the same set of data ).
I got the error (dmesg) that "
[ 4289.020198] postmaster[1840]: segfault at 0 ip 00007f17cd5f4ea3 sp 00007ffdf867dd50 error 4 in libmadlib.so[7f17cd2ec000+64a000]"
Regards,
Luyao Chen
________________________________
From: Frank McQuillan <fm...@pivotal.io>
Sent: Tuesday, July 24, 2018 2:13 PM
To: user@madlib.apache.org
Subject: Re: PostgreSQL crashed during random forest training
Thank you, we created a JIRA to investigate this
https://issues.apache.org/jira/browse/MADLIB-1257
On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <lu...@hotmail.com>> wrote:
Another observation - It crashed with 84 groups and 73K instance. In this scenario, I shall have pretty enough memory and disk.
Also seems during the increasing of the groups, it used a lot of temporary disk space when the data is over certain groups.
Regards,
________________________________
From: LUYAO CHEN <lu...@hotmail.com>>
Sent: Tuesday, July 24, 2018 9:15 AM
To: user@madlib.apache.org<ma...@madlib.apache.org>
Subject: Re: PostgreSQL crashed during random forest training
Hi Frank,
You may refer to the enclosed dump data for the training table, and I used the below SQL for random forest.
DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
SELECT madlib.forest_train('train_data', -- source table
'train_output', -- output model table
'rowid', -- id column
'positive', -- response
'features', -- features
NULL, -- exclude columns
'caseid', -- grouping columns
30::integer, -- number of trees
30::integer, -- number of random features
TRUE::boolean, -- variable importance
1::integer, -- num_permutations
10::integer, -- max depth
3::integer, -- min split
1::integer, -- min bucket
10::integer, -- number of splits per continuous variable
NULL, -- null handling parameter
TRUE -- verbose
);
Regards,
Luyao Chen
________________________________
From: Frank McQuillan <fm...@pivotal.io>>
Sent: Monday, July 23, 2018 4:59 PM
To: user@madlib.apache.org<ma...@madlib.apache.org>
Subject: Re: PostgreSQL crashed during random forest training
Hi Luyao Chen
It's hard to debug just looking at that trace.
1) If you increase your data size to more than 56K instances in 56 groups, does it work? e.g., double it to approx 112K instances and 112 groups.
2) Is it possible of you could share a sample of your data so that we could try? If not, perhaps anonymize a sample of the data so that we can multiply it out to make it bigger? Then we could take a closer look.
Frank
On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <lu...@hotmail.com>> wrote:
Dear user group,
I got a problem when training the grouped data with random forest(300 features). Small data was fine ( eg, 56K instances in 56 groups), but failed for 240K instances in 250 groups. Postgres forced to disconnect the session after showing the below message in verbose mode:
NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a temporary view
NOTICE: sql_create_empty_result_table:
CREATE TABLE analysis.dx_rf_train_output_1 (
gid integer,
sample_id integer,
tree madlib.bytea8);
NOTICE: sql_refresh_training_pois_cnt:
TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__ CASCADE;
INSERT INTO __madlib_temp_91155016_1532371657_5660955__
SELECT
*,
madlib.poisson_random(1) AS poisson_count
FROM
(
SELECT
*,
0.::double precision AS __madlib_temp_14328459_1532371657_7318497__
FROM analysis.dxpredict_svec
) subq
WHERE __madlib_temp_14328459_1532371657_7318497__ < 1
NOTICE:
src_cnt: 158360,
oob_cnt: 92418,
dup_cnt: 250617.
NOTICE: Started tree building for all groups
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The PostgreSQL did not capture the detail log even I increased the logstatement to "all"
2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was terminated by signal 11: Segmentation fault
2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running: SELECT madlib.forest_train('analysis.dxpredict_svec',
'analysis.dx_rf_train_output_1',
'rowid',
'positive',
'*',
'rowid,positive,case_icd',
'case_icd',
30::integer,
30::integer,
TRUE::boolean,
1::integer,
10::integer,
3::integer,
1::integer,
10::integer,
NULL,
TRUE
);
2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active server processes
2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection because of crash of another server process
Re: PostgreSQL crashed during random forest training
Posted by Frank McQuillan <fm...@pivotal.io>.
Thank you, we created a JIRA to investigate this
https://issues.apache.org/jira/browse/MADLIB-1257
On Tue, Jul 24, 2018 at 10:31 AM, LUYAO CHEN <lu...@hotmail.com> wrote:
> Another observation - It crashed with 84 groups and 73K instance. In this
> scenario, I shall have pretty enough memory and disk.
>
> Also seems during the increasing of the groups, it used a lot of
> temporary disk space when the data is over certain groups.
>
>
> Regards,
>
> ------------------------------
> *From:* LUYAO CHEN <lu...@hotmail.com>
> *Sent:* Tuesday, July 24, 2018 9:15 AM
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
>
> Hi Frank,
>
>
> You may refer to the enclosed dump data for the training table, and I used
> the below SQL for random forest.
>
>
> DROP TABLE IF EXISTS train_output, train_output_group,
> train_output_summary;
> SELECT madlib.forest_train('train_data', -- source table
> 'train_output', -- output model table
> 'rowid', -- id column
> 'positive', -- response
> 'features', -- features
> NULL, -- exclude columns
> 'caseid', -- grouping columns
> 30::integer, -- number of trees
> 30::integer, -- number of random features
> TRUE::boolean, -- variable importance
> 1::integer, -- num_permutations
> 10::integer, -- max depth
> 3::integer, -- min split
> 1::integer, -- min bucket
> 10::integer, -- number of splits per
> continuous variable
> NULL, -- null handling parameter
> TRUE -- verbose
> );
>
> Regards,
> Luyao Chen
>
> ------------------------------
> *From:* Frank McQuillan <fm...@pivotal.io>
> *Sent:* Monday, July 23, 2018 4:59 PM
> *To:* user@madlib.apache.org
> *Subject:* Re: PostgreSQL crashed during random forest training
>
> Hi Luyao Chen
>
> It's hard to debug just looking at that trace.
>
> 1) If you increase your data size to more than 56K instances in 56
> groups, does it work? e.g., double it to approx 112K instances and 112
> groups.
>
> 2) Is it possible of you could share a sample of your data so that we
> could try? If not, perhaps anonymize a sample of the data so that we can
> multiply it out to make it bigger? Then we could take a closer look.
>
> Frank
>
> On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <lu...@hotmail.com>
> wrote:
>
> Dear user group,
>
>
> I got a problem when training the grouped data with random forest(300
> features). Small data was fine ( eg, 56K instances in 56 groups), but
> failed for 240K instances in 250 groups. Postgres forced to disconnect the
> session after showing the below message in verbose mode:
>
>
> NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a
> temporary view
> NOTICE: sql_create_empty_result_table:
>
> CREATE TABLE analysis.dx_rf_train_output_1 (
> gid integer,
> sample_id integer,
> tree madlib.bytea8);
>
> NOTICE: sql_refresh_training_pois_cnt:
>
> TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__
> CASCADE;
> INSERT INTO __madlib_temp_91155016_1532371
> 657_5660955__
> SELECT
> *,
> madlib.poisson_random(1) AS poisson_count
> FROM
> (
> SELECT
> *,
> 0.::double precision AS
> __madlib_temp_14328459_1532371657_7318497__
> FROM analysis.dxpredict_svec
> ) subq
> WHERE __madlib_temp_14328459_1532371657_7318497__
> < 1
>
> NOTICE:
> src_cnt: 158360,
> oob_cnt: 92418,
> dup_cnt: 250617.
>
> NOTICE: Started tree building for all groups
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> The PostgreSQL did not capture the detail log even I increased the
> logstatement to "all"
> 2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was
> terminated by signal 11: Segmentation fault
> 2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running:
> SELECT madlib.forest_train('analysis.dxpredict_svec',
> 'analysis.dx_rf_train_output_1',
> 'rowid',
> 'positive',
> '*',
> 'rowid,positive,case_icd',
> 'case_icd',
> 30::integer,
> 30::integer,
> TRUE::boolean,
> 1::integer,
> 10::integer,
> 3::integer,
> 1::integer,
> 10::integer,
> NULL,
> TRUE
> );
> 2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active
> server processes
> 2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection
> because of crash of another server process
>
>
>
>
>
>
Re: PostgreSQL crashed during random forest training
Posted by LUYAO CHEN <lu...@hotmail.com>.
Another observation - It crashed with 84 groups and 73K instance. In this scenario, I shall have pretty enough memory and disk.
Also seems during the increasing of the groups, it used a lot of temporary disk space when the data is over certain groups.
Regards,
________________________________
From: LUYAO CHEN <lu...@hotmail.com>
Sent: Tuesday, July 24, 2018 9:15 AM
To: user@madlib.apache.org
Subject: Re: PostgreSQL crashed during random forest training
Hi Frank,
You may refer to the enclosed dump data for the training table, and I used the below SQL for random forest.
DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
SELECT madlib.forest_train('train_data', -- source table
'train_output', -- output model table
'rowid', -- id column
'positive', -- response
'features', -- features
NULL, -- exclude columns
'caseid', -- grouping columns
30::integer, -- number of trees
30::integer, -- number of random features
TRUE::boolean, -- variable importance
1::integer, -- num_permutations
10::integer, -- max depth
3::integer, -- min split
1::integer, -- min bucket
10::integer, -- number of splits per continuous variable
NULL, -- null handling parameter
TRUE -- verbose
);
Regards,
Luyao Chen
________________________________
From: Frank McQuillan <fm...@pivotal.io>
Sent: Monday, July 23, 2018 4:59 PM
To: user@madlib.apache.org
Subject: Re: PostgreSQL crashed during random forest training
Hi Luyao Chen
It's hard to debug just looking at that trace.
1) If you increase your data size to more than 56K instances in 56 groups, does it work? e.g., double it to approx 112K instances and 112 groups.
2) Is it possible of you could share a sample of your data so that we could try? If not, perhaps anonymize a sample of the data so that we can multiply it out to make it bigger? Then we could take a closer look.
Frank
On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <lu...@hotmail.com>> wrote:
Dear user group,
I got a problem when training the grouped data with random forest(300 features). Small data was fine ( eg, 56K instances in 56 groups), but failed for 240K instances in 250 groups. Postgres forced to disconnect the session after showing the below message in verbose mode:
NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a temporary view
NOTICE: sql_create_empty_result_table:
CREATE TABLE analysis.dx_rf_train_output_1 (
gid integer,
sample_id integer,
tree madlib.bytea8);
NOTICE: sql_refresh_training_pois_cnt:
TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__ CASCADE;
INSERT INTO __madlib_temp_91155016_1532371657_5660955__
SELECT
*,
madlib.poisson_random(1) AS poisson_count
FROM
(
SELECT
*,
0.::double precision AS __madlib_temp_14328459_1532371657_7318497__
FROM analysis.dxpredict_svec
) subq
WHERE __madlib_temp_14328459_1532371657_7318497__ < 1
NOTICE:
src_cnt: 158360,
oob_cnt: 92418,
dup_cnt: 250617.
NOTICE: Started tree building for all groups
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The PostgreSQL did not capture the detail log even I increased the logstatement to "all"
2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was terminated by signal 11: Segmentation fault
2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running: SELECT madlib.forest_train('analysis.dxpredict_svec',
'analysis.dx_rf_train_output_1',
'rowid',
'positive',
'*',
'rowid,positive,case_icd',
'case_icd',
30::integer,
30::integer,
TRUE::boolean,
1::integer,
10::integer,
3::integer,
1::integer,
10::integer,
NULL,
TRUE
);
2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active server processes
2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection because of crash of another server process
Re: PostgreSQL crashed during random forest training
Posted by LUYAO CHEN <lu...@hotmail.com>.
Hi Frank,
You may refer to the enclosed dump data for the training table, and I used the below SQL for random forest.
DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
SELECT madlib.forest_train('train_data', -- source table
'train_output', -- output model table
'rowid', -- id column
'positive', -- response
'features', -- features
NULL, -- exclude columns
'caseid', -- grouping columns
30::integer, -- number of trees
30::integer, -- number of random features
TRUE::boolean, -- variable importance
1::integer, -- num_permutations
10::integer, -- max depth
3::integer, -- min split
1::integer, -- min bucket
10::integer, -- number of splits per continuous variable
NULL, -- null handling parameter
TRUE -- verbose
);
Regards,
Luyao Chen
________________________________
From: Frank McQuillan <fm...@pivotal.io>
Sent: Monday, July 23, 2018 4:59 PM
To: user@madlib.apache.org
Subject: Re: PostgreSQL crashed during random forest training
Hi Luyao Chen
It's hard to debug just looking at that trace.
1) If you increase your data size to more than 56K instances in 56 groups, does it work? e.g., double it to approx 112K instances and 112 groups.
2) Is it possible of you could share a sample of your data so that we could try? If not, perhaps anonymize a sample of the data so that we can multiply it out to make it bigger? Then we could take a closer look.
Frank
On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <lu...@hotmail.com>> wrote:
Dear user group,
I got a problem when training the grouped data with random forest(300 features). Small data was fine ( eg, 56K instances in 56 groups), but failed for 240K instances in 250 groups. Postgres forced to disconnect the session after showing the below message in verbose mode:
NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a temporary view
NOTICE: sql_create_empty_result_table:
CREATE TABLE analysis.dx_rf_train_output_1 (
gid integer,
sample_id integer,
tree madlib.bytea8);
NOTICE: sql_refresh_training_pois_cnt:
TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__ CASCADE;
INSERT INTO __madlib_temp_91155016_1532371657_5660955__
SELECT
*,
madlib.poisson_random(1) AS poisson_count
FROM
(
SELECT
*,
0.::double precision AS __madlib_temp_14328459_1532371657_7318497__
FROM analysis.dxpredict_svec
) subq
WHERE __madlib_temp_14328459_1532371657_7318497__ < 1
NOTICE:
src_cnt: 158360,
oob_cnt: 92418,
dup_cnt: 250617.
NOTICE: Started tree building for all groups
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The PostgreSQL did not capture the detail log even I increased the logstatement to "all"
2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was terminated by signal 11: Segmentation fault
2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running: SELECT madlib.forest_train('analysis.dxpredict_svec',
'analysis.dx_rf_train_output_1',
'rowid',
'positive',
'*',
'rowid,positive,case_icd',
'case_icd',
30::integer,
30::integer,
TRUE::boolean,
1::integer,
10::integer,
3::integer,
1::integer,
10::integer,
NULL,
TRUE
);
2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active server processes
2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection because of crash of another server process
Re: PostgreSQL crashed during random forest training
Posted by Frank McQuillan <fm...@pivotal.io>.
Hi Luyao Chen
It's hard to debug just looking at that trace.
1) If you increase your data size to more than 56K instances in 56 groups,
does it work? e.g., double it to approx 112K instances and 112 groups.
2) Is it possible of you could share a sample of your data so that we could
try? If not, perhaps anonymize a sample of the data so that we can
multiply it out to make it bigger? Then we could take a closer look.
Frank
On Mon, Jul 23, 2018 at 12:34 PM, LUYAO CHEN <lu...@hotmail.com> wrote:
> Dear user group,
>
>
> I got a problem when training the grouped data with random forest(300
> features). Small data was fine ( eg, 56K instances in 56 groups), but
> failed for 240K instances in 250 groups. Postgres forced to disconnect the
> session after showing the below message in verbose mode:
>
>
> NOTICE: view "__madlib_temp_60124179_1532371657_7130296__" will be a
> temporary view
> NOTICE: sql_create_empty_result_table:
>
> CREATE TABLE analysis.dx_rf_train_output_1 (
> gid integer,
> sample_id integer,
> tree madlib.bytea8);
>
> NOTICE: sql_refresh_training_pois_cnt:
>
> TRUNCATE TABLE __madlib_temp_91155016_1532371657_5660955__
> CASCADE;
> INSERT INTO __madlib_temp_91155016_
> 1532371657_5660955__
> SELECT
> *,
> madlib.poisson_random(1) AS poisson_count
> FROM
> (
> SELECT
> *,
> 0.::double precision AS
> __madlib_temp_14328459_1532371657_7318497__
> FROM analysis.dxpredict_svec
> ) subq
> WHERE __madlib_temp_14328459_1532371657_7318497__
> < 1
>
> NOTICE:
> src_cnt: 158360,
> oob_cnt: 92418,
> dup_cnt: 250617.
>
> NOTICE: Started tree building for all groups
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> The PostgreSQL did not capture the detail log even I increased the
> logstatement to "all"
> 2018-07-23 14:47:50.229 EDT [1090] LOG: server process (PID 1980) was
> terminated by signal 11: Segmentation fault
> 2018-07-23 14:47:50.229 EDT [1090] DETAIL: Failed process was running:
> SELECT madlib.forest_train('analysis.dxpredict_svec',
> 'analysis.dx_rf_train_output_1',
> 'rowid',
> 'positive',
> '*',
> 'rowid,positive,case_icd',
> 'case_icd',
> 30::integer,
> 30::integer,
> TRUE::boolean,
> 1::integer,
> 10::integer,
> 3::integer,
> 1::integer,
> 10::integer,
> NULL,
> TRUE
> );
> 2018-07-23 14:47:50.229 EDT [1090] LOG: terminating any other active
> server processes
> 2018-07-23 14:47:50.229 EDT [1401] WARNING: terminating connection
> because of crash of another server process
>
>
>
>
>