You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Domino Valdano (Jira)" <ji...@apache.org> on 2020/05/11 20:02:00 UTC
[jira] [Updated] (MADLIB-1426) Without GPU's, FitMultipleModel fails in evaluate()

     [ https://issues.apache.org/jira/browse/MADLIB-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Domino Valdano updated MADLIB-1426:
-----------------------------------
    Description: 
Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system without GPU's, it always fails in evaluate complaining that device `gpu0` is not available.  This happens regardless of whether {{use_gpus=False}} or use_gpus=True.

My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5.  I think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug that affects all platforms, but not entirely sure of that.  Possibly specific to OSX or gpdb5.

The problem happens in {{internal_keras_eval_transition()}} in {{madlib_keras.py_in}}.
With {{use_gpus=False}}, it runs:

{{
with K.tf.device(device_name):
        res = segment_model.evaluate(x_val, y_val)
}}

```
I added a {{plpy.info}} statement to print {{device_name}} at the beginning of this function.  I also printed the value of {{use_gpus}} on master before training begins.  While {{use_gpus}} is set to false, the {{device_name}} on the segments is set to {{/gpu:0}}.  This is the bug (it should be set to {{/cpu:0}}).


This is the error message that happens:
```
INFO:  00000: {{use_gpus = False}}
...
INFO:  00000: device_name = /gpu:0  (seg1 slice1 127.0.0.1:25433 pid=90300)
CONTEXT:  PL/Python function "internal_keras_eval_transition"
LOCATION:  PLy_output, plpython.c:4773
psql:../run_fit_mult_iris.sql:1: ERROR:  XX000: plpy.SPIError: tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation group_deps: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. (plpython.c:5038)  (seg0 slice1 127.0.0.1:25432 pid=90299) (plpython.c:5038)
DETAIL:
[[{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul, ^metrics/acc/Mean)]]
Traceback (most recent call last):
  PL/Python function "internal_keras_eval_transition", line 6, in <module>
    return madlib_keras.internal_keras_eval_transition(**globals())
  PL/Python function "internal_keras_eval_transition", line 782, in internal_keras_eval_transition
  PL/Python function "internal_keras_eval_transition", line 1112, in evaluate
  PL/Python function "internal_keras_eval_transition", line 391, in test_loop
  PL/Python function "internal_keras_eval_transition", line 2714, in __call__
  PL/Python function "internal_keras_eval_transition", line 2670, in _call
  PL/Python function "internal_keras_eval_transition", line 2622, in _make_callable
  PL/Python function "internal_keras_eval_transition", line 1469, in _make_callable_from_options
  PL/Python function "internal_keras_eval_transition", line 1351, in _extend_graph
PL/Python function "internal_keras_eval_transition"
CONTEXT:  Traceback (most recent call last):
  PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
    fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
  PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
  PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__
  PL/Python function "madlib_keras_fit_multiple_model", line 230, in fit_multiple_model
  PL/Python function "madlib_keras_fit_multiple_model", line 270, in train_multiple_model
  PL/Python function "madlib_keras_fit_multiple_model", line 302, in evaluate_model
  PL/Python function "madlib_keras_fit_multiple_model", line 417, in compute_loss_and_metrics
  PL/Python function "madlib_keras_fit_multiple_model", line 739, in get_loss_metric_from_keras_eval
PL/Python function "madlib_keras_fit_multiple_model"
LOCATION:  PLy_elog, plpython.c:5038
```

  was:
Whenever I try to run `madlib_keras_fit_multiple_model()` on a system without GPU's, it always fails in evaluate complaining that device `gpu0` is not available.  This happens regardless of whether use_gpus=False or use_gpus=True.

My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5.  I think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug that affects all platforms, but not entirely sure of that.  Possibly specific to OSX or gpdb5.

The problem happens in `internal_keras_eval_transition()` in `madlib_keras.py_in`.
With `use_gpus=False`, it calls:

```
with K.tf.device(device_name):
        res = segment_model.evaluate(x_val, y_val)
```
with `device_name='/gpu0'`

```
I know this because I added a plpy.info statement to print `device_name` at the beginning of this function.  I also printed the value of `use_gpus` on master before training begins:
```
INFO:  00000: use_gpus = False
```
This is what the error looks like:
```
INFO:  00000: device_name = /gpu:0  (seg1 slice1 127.0.0.1:25433 pid=90300)
CONTEXT:  PL/Python function "internal_keras_eval_transition"
LOCATION:  PLy_output, plpython.c:4773
psql:../run_fit_mult_iris.sql:1: ERROR:  XX000: plpy.SPIError: tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation group_deps: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. (plpython.c:5038)  (seg0 slice1 127.0.0.1:25432 pid=90299) (plpython.c:5038)
DETAIL:
[[{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul, ^metrics/acc/Mean)]]
Traceback (most recent call last):
  PL/Python function "internal_keras_eval_transition", line 6, in <module>
    return madlib_keras.internal_keras_eval_transition(**globals())
  PL/Python function "internal_keras_eval_transition", line 782, in internal_keras_eval_transition
  PL/Python function "internal_keras_eval_transition", line 1112, in evaluate
  PL/Python function "internal_keras_eval_transition", line 391, in test_loop
  PL/Python function "internal_keras_eval_transition", line 2714, in __call__
  PL/Python function "internal_keras_eval_transition", line 2670, in _call
  PL/Python function "internal_keras_eval_transition", line 2622, in _make_callable
  PL/Python function "internal_keras_eval_transition", line 1469, in _make_callable_from_options
  PL/Python function "internal_keras_eval_transition", line 1351, in _extend_graph
PL/Python function "internal_keras_eval_transition"
CONTEXT:  Traceback (most recent call last):
  PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
    fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
  PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
  PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__
  PL/Python function "madlib_keras_fit_multiple_model", line 230, in fit_multiple_model
  PL/Python function "madlib_keras_fit_multiple_model", line 270, in train_multiple_model
  PL/Python function "madlib_keras_fit_multiple_model", line 302, in evaluate_model
  PL/Python function "madlib_keras_fit_multiple_model", line 417, in compute_loss_and_metrics
  PL/Python function "madlib_keras_fit_multiple_model", line 739, in get_loss_metric_from_keras_eval
PL/Python function "madlib_keras_fit_multiple_model"
LOCATION:  PLy_elog, plpython.c:5038
```


> Without GPU's, FitMultipleModel fails in evaluate()
> ---------------------------------------------------
>
>                 Key: MADLIB-1426
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1426
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Deep Learning
>            Reporter: Domino Valdano
>            Priority: Major
>
> Whenever I try to run {{madlib_keras_fit_multiple_model()}} on a system without GPU's, it always fails in evaluate complaining that device `gpu0` is not available.  This happens regardless of whether {{use_gpus=False}} or use_gpus=True.
> My platform is OSX 10.14.1 with latest version of madlib (1.17.0) and gpdb5.  I think I've also seen this happen on CentOS in gpdb6, so I believe this is a bug that affects all platforms, but not entirely sure of that.  Possibly specific to OSX or gpdb5.
> The problem happens in {{internal_keras_eval_transition()}} in {{madlib_keras.py_in}}.
> With {{use_gpus=False}}, it runs:
> {{
> with K.tf.device(device_name):
>         res = segment_model.evaluate(x_val, y_val)
> }}
> ```
> I added a {{plpy.info}} statement to print {{device_name}} at the beginning of this function.  I also printed the value of {{use_gpus}} on master before training begins.  While {{use_gpus}} is set to false, the {{device_name}} on the segments is set to {{/gpu:0}}.  This is the bug (it should be set to {{/cpu:0}}).
> This is the error message that happens:
> ```
> INFO:  00000: {{use_gpus = False}}
> ...
> INFO:  00000: device_name = /gpu:0  (seg1 slice1 127.0.0.1:25433 pid=90300)
> CONTEXT:  PL/Python function "internal_keras_eval_transition"
> LOCATION:  PLy_output, plpython.c:4773
> psql:../run_fit_mult_iris.sql:1: ERROR:  XX000: plpy.SPIError: tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation group_deps: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. (plpython.c:5038)  (seg0 slice1 127.0.0.1:25432 pid=90299) (plpython.c:5038)
> DETAIL:
> [[{{node group_deps}} = NoOp[_device="/device:GPU:0"](^loss/mul, ^metrics/acc/Mean)]]
> Traceback (most recent call last):
>   PL/Python function "internal_keras_eval_transition", line 6, in <module>
>     return madlib_keras.internal_keras_eval_transition(**globals())
>   PL/Python function "internal_keras_eval_transition", line 782, in internal_keras_eval_transition
>   PL/Python function "internal_keras_eval_transition", line 1112, in evaluate
>   PL/Python function "internal_keras_eval_transition", line 391, in test_loop
>   PL/Python function "internal_keras_eval_transition", line 2714, in __call__
>   PL/Python function "internal_keras_eval_transition", line 2670, in _call
>   PL/Python function "internal_keras_eval_transition", line 2622, in _make_callable
>   PL/Python function "internal_keras_eval_transition", line 1469, in _make_callable_from_options
>   PL/Python function "internal_keras_eval_transition", line 1351, in _extend_graph
> PL/Python function "internal_keras_eval_transition"
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
>     fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
>   PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
>   PL/Python function "madlib_keras_fit_multiple_model", line 216, in __init__
>   PL/Python function "madlib_keras_fit_multiple_model", line 230, in fit_multiple_model
>   PL/Python function "madlib_keras_fit_multiple_model", line 270, in train_multiple_model
>   PL/Python function "madlib_keras_fit_multiple_model", line 302, in evaluate_model
>   PL/Python function "madlib_keras_fit_multiple_model", line 417, in compute_loss_and_metrics
>   PL/Python function "madlib_keras_fit_multiple_model", line 739, in get_loss_metric_from_keras_eval
> PL/Python function "madlib_keras_fit_multiple_model"
> LOCATION:  PLy_elog, plpython.c:5038
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)