You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Domino Valdano (JIRA)" <ji...@apache.org> on 2019/05/22 23:19:00 UTC
[jira] [Comment Edited] (MADLIB-1326) DL: Dev-check fails when keras_fit is called after array_scalar_mult

    [ https://issues.apache.org/jira/browse/MADLIB-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846274#comment-16846274 ] 

Domino Valdano edited comment on MADLIB-1326 at 5/22/19 11:18 PM:
------------------------------------------------------------------

We were able to reproduce this error in a docker image, Ubuntu 16.04.6 LTS (Xenial Xerus). But we could not reproduce it on OSX (both with keras 2.2.4 and tensorflow 1.1.13).

On Ubuntu, the simplest repro we found was running dev-check on this _debug.sql_in_ file:
{code:java}
drop table if exists small_unbatched;
create table small_unbatched AS select ARRAY[1] AS x;

drop table if exists small_batched;
create table small_batched as select array_scalar_mult(x,1) as x from small_unbatched;

select dummy() FROM small_unbatched;{code}
{code:java}
Interestingly, it does not crash if you add:{code}
{code:java}
dummy()
{code}
at the top of the test file, so that it calls it once before and once after array_scalar_mult. This seems suspiciously like another example of tensorflow holding on to resources until the process dies, even if it's already returned from a plpy function. (This is discussed in some tensorflow forums, but unfortunately the tensorflow devs have said that it is expected behavior and they don't intend to fix it).


was (Author: dvaldano):
We were able to reproduce this error in a docker image, Ubuntu 16.04.6 LTS (Xenial Xerus). But we could not reproduce it on OSX (both with keras 2.2.4 and tensorflow 1.1.13).

On Ubuntu, the simplest repro we found was running dev-check on this _debug.sql_in_ file:
{code:java}
drop table if exists small_unbatched;
create table small_unbatched (x real[],y int);
insert into small_unbatched (x,y) values(ARRAY[1,2,3],4);
insert into small_unbatched (x,y) values(ARRAY[4,1,9],5);
insert into small_unbatched (x,y) values(ARRAY[8,3,1],1);
insert into small_unbatched (x,y) values(ARRAY[2,2,0],2);
insert into small_unbatched (x,y) values(ARRAY[1,7,6],9);
drop table if exists small_batched;
create table small_batched as select array_scalar_mult(x,5::real) as x from small_unbatched;
dummy()
{code}
where the dummy function is added to _module/deep_learning/madlib_keras.sql_in_

Interestingly, it does not crash if you add:
{code:java}
dummy()
{code}
at the top of the test file, so that it calls it once before and once after array_scalar_mult. This seems suspiciously like another example of tensorflow holding on to resources until the process dies, even if it's already returned from a plpy function. (This is discussed in some tensorflow forums, but unfortunately the tensorflow devs have said that it is expected behavior and they don't intend to fix it).

> DL: Dev-check fails when keras_fit is called after array_scalar_mult
> --------------------------------------------------------------------
>
>                 Key: MADLIB-1326
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1326
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Deep Learning
>            Reporter: Nandish Jayaram
>            Priority: Major
>             Fix For: v1.16
>
>
> In madlib_keras dev-check, we create the input data to fit using {{minibatch_preprocessor_dl()}}. This function internally calls {{array_scalar_mult()}}. If we call either of these functions followed by {{madlib_keras_fit()}}, then the following error pops up:
> {code:java}
> NOTICE:  Releasing segworker groups to finish aborting the transaction.
> ERROR:  could not connect to segment: initialization of segworker group failed (cdbgang.c:237)
> {code}
> Digging further into Postgres logs suggests that there was a segmentation fault, and it seems like it's happening the moment {{import keras}} is called in {{madlib_keras_fit()}}.
> This issue was first noticed while working on MADLIB-1304 (which was closed with [this commit|https://github.com/apache/madlib/commit/241074ae68cb8e15437f98abf1c2e3c7bb3471ae], as the comment [in this line|https://github.com/apache/madlib/commit/241074ae68cb8e15437f98abf1c2e3c7bb3471ae#diff-f89c193e163bfe0e7e3821445e38fa97R29] suggests. This happened on Greenplum then, and Postgres was not supporting deep learning yet. This was again noticed while working on MADLIB-1311, which added Postgres support. At this point, the failure happened on Postgres and there were no failures on Greenplum.
> While working on MADLIB-1311, we tried a couple of things and observed an odd behavior. We created a dummy function:
> {code:java}
> create function dummy()
> returns void as
> $$
> import keras
> $$
> language plpythonu;
> {code}
> If we ran {{select dummy()}} *before* running {{minibatch_preprocessor_dl()}} or {{array_scalar_mult()}}, then the whole dev-check passes. But running the same function right after calling either of those functions causes a failure.
>  So, looks like any UDF that calls {{import keras}} *must* be run *before* calling {{minibatch_preprocessor_dl()}} or {{array_scalar_mult()}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)