You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Domino Valdano (Jira)" <ji...@apache.org> on 2020/07/17 00:00:00 UTC

[jira] [Updated] (MADLIB-1443) Crash in fit_multiple when any model reaches loss=nan

     [ https://issues.apache.org/jira/browse/MADLIB-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Domino Valdano updated MADLIB-1443:
-----------------------------------
    Description: 
There's a crash that can happen in {{fit_multiple}} (and probably fit as well but I haven't tested it), when the loss ends up becoming nan for a model.

$$loss='categorical_crossentropy',optimizer='SGD(lr=0.05, momentum=1.1)',metrics=['accuracy']$$

Clearly, this was not a great choice for the momentum hyperparameter, but keras does accept it and trains through all the way with no errors or exceptions.  The problem is, the loss ends up becoming infinite (or undefined?) at some point.  All 8 models trained for 10 hours, printed out the results, and then `madlib_keras_fit_multiple` crashed while trying to write out the final info table:

Training set after iteration 1:
 mst_key=7: metric=0.446168005466, loss=2.39643478394
 mst_key=12: metric=0.00999999977648, loss=nan}}

mst_key=11: metric=0.165068000555, loss=4.0407166481}}

...

Validation set after iteration 1:
 mst_key=7: metric=0.359100013971, loss=2.89618015289
 mst_key=12: metric=0.00999999977648, loss=nan
 mst_key=11: metric=0.151299998164, loss=4.0829615593}}

...

CONTEXT: PL/Python function "madlib_keras_fit_multiple_model"
 psql:run_fit_mult100.sql:14:

ERROR: spiexceptions.UndefinedColumn: column "nan" does not exist

LINE 4: training_loss_final = nan,
                                 ^

QUERY:
 UPDATE places100_mult_model_444_july7_info SET
 training_metrics_final = 0.00999999977648,
 training_loss_final = nan,
 metrics_elapsed_time = ARRAY[33260.02720808983],
 training_metrics = ARRAY[0.009999999776482582],
 training_loss = ARRAY[nan]
 WHERE mst_key = 12

CONTEXT: Traceback (most recent call last):

PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
 fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
 PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
 PL/Python function "madlib_keras_fit_multiple_model", line 195, in __init__
 PL/Python function "madlib_keras_fit_multiple_model", line 543, in insert_info_table
 PL/Python function "madlib_keras_fit_multiple_model", line 539, in update_info_table
 PL/Python function "madlib_keras_fit_multiple_model"

So even though most of them trained fine, it rolled back all of the output so that they all have to be trained from scratch again. Maybe while we're at it, we should look for other places where _nan_ might occur.

  was:
There's a crash that can happen in {{fit_multiple}} (and probably fit as well but I haven't tested it), when the loss ends up becoming nan for a model.

$$loss='categorical_crossentropy',optimizer='SGD(lr=0.05, momentum=1.1)',metrics=['accuracy']$$

Clearly, this was not a great choice for the momentum hyperparameter, but keras does accept it and trains through all the way with no errors or exceptions.  The problem is, the loss ends up becoming infinite (or undefined?) at some point.  All 8 models trained for 10 hours, printed out the results, and then `madlib_keras_fit_multiple` crashed while trying to write out the final info table:

{{    Training set after iteration 1:}}

{{    mst_key=7: metric=0.446168005466, loss=2.39643478394
    mst_key=12: metric=0.00999999977648, loss=nan}}

{{    mst_key=11: metric=0.165068000555, loss=4.0407166481}}

{{...}}

{{    Validation set after iteration 1:}}

{{    mst_key=7: metric=0.359100013971, loss=2.89618015289
    mst_key=12: metric=0.00999999977648, loss=nan
    mst_key=11: metric=0.151299998164, loss=4.0829615593}}

{{...}}

{{CONTEXT:  PL/Python function "madlib_keras_fit_multiple_model"}}{{psql:run_fit_mult100.sql:14: }}

{{ERROR:  spiexceptions.UndefinedColumn: column "nan" does not exist}}

{{LINE 4:                            training_loss_final = nan,}}

{{                               ^}}

{{QUERY:
                           UPDATE places100_mult_model_444_july7_info SET
                           training_metrics_final = 0.00999999977648,
                           training_loss_final = nan,
                           metrics_elapsed_time = ARRAY[33260.02720808983],
                           training_metrics = ARRAY[0.009999999776482582],
                           training_loss = ARRAY[nan]
                           WHERE mst_key = 12}}

{{CONTEXT:  Traceback (most recent call last):}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
    fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 195, in __init__}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 543, in insert_info_table}}

{{  PL/Python function "madlib_keras_fit_multiple_model", line 539, in update_info_table}}

{{  PL/Python function "madlib_keras_fit_multiple_model"}}

So even though most of them trained fine, it rolled back all of the output so that they all have to be trained from scratch again. Maybe while we're at it, we should look for other places where {{nan}} might occur.


> Crash in fit_multiple when any model reaches loss=nan
> -----------------------------------------------------
>
>                 Key: MADLIB-1443
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1443
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Deep Learning
>            Reporter: Domino Valdano
>            Priority: Major
>
> There's a crash that can happen in {{fit_multiple}} (and probably fit as well but I haven't tested it), when the loss ends up becoming nan for a model.
> $$loss='categorical_crossentropy',optimizer='SGD(lr=0.05, momentum=1.1)',metrics=['accuracy']$$
> Clearly, this was not a great choice for the momentum hyperparameter, but keras does accept it and trains through all the way with no errors or exceptions.  The problem is, the loss ends up becoming infinite (or undefined?) at some point.  All 8 models trained for 10 hours, printed out the results, and then `madlib_keras_fit_multiple` crashed while trying to write out the final info table:
> Training set after iteration 1:
>  mst_key=7: metric=0.446168005466, loss=2.39643478394
>  mst_key=12: metric=0.00999999977648, loss=nan}}
> mst_key=11: metric=0.165068000555, loss=4.0407166481}}
> ...
> Validation set after iteration 1:
>  mst_key=7: metric=0.359100013971, loss=2.89618015289
>  mst_key=12: metric=0.00999999977648, loss=nan
>  mst_key=11: metric=0.151299998164, loss=4.0829615593}}
> ...
> CONTEXT: PL/Python function "madlib_keras_fit_multiple_model"
>  psql:run_fit_mult100.sql:14:
> ERROR: spiexceptions.UndefinedColumn: column "nan" does not exist
> LINE 4: training_loss_final = nan,
>                                  ^
> QUERY:
>  UPDATE places100_mult_model_444_july7_info SET
>  training_metrics_final = 0.00999999977648,
>  training_loss_final = nan,
>  metrics_elapsed_time = ARRAY[33260.02720808983],
>  training_metrics = ARRAY[0.009999999776482582],
>  training_loss = ARRAY[nan]
>  WHERE mst_key = 12
> CONTEXT: Traceback (most recent call last):
> PL/Python function "madlib_keras_fit_multiple_model", line 23, in <module>
>  fit_obj = madlib_keras_fit_multiple_model.FitMultipleModel(**globals())
>  PL/Python function "madlib_keras_fit_multiple_model", line 42, in wrapper
>  PL/Python function "madlib_keras_fit_multiple_model", line 195, in __init__
>  PL/Python function "madlib_keras_fit_multiple_model", line 543, in insert_info_table
>  PL/Python function "madlib_keras_fit_multiple_model", line 539, in update_info_table
>  PL/Python function "madlib_keras_fit_multiple_model"
> So even though most of them trained fine, it rolled back all of the output so that they all have to be trained from scratch again. Maybe while we're at it, we should look for other places where _nan_ might occur.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)