[GitHub] alexmosc opened a new issue #9358: Why do running 1 round of an MXNET model training produce Train-mse=NaN?

   If I run just 1 round of an MXNET model training with `mx.model.FeedForward.create` I get NaN as a training error. Is this for a purpose?
   > sessionInfo()
   R version 3.4.0 (2017-04-21)
   Platform: x86_64-w64-mingw32/x64 (64-bit)
   Running under: Windows >= 8 x64 (build 9200)
   Matrix products: default
   [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
   [5] LC_TIME=English_United States.1252    
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   other attached packages:
    [1] mxnet_0.10.1         pryr_0.1.3           quantregForest_1.3-6 RColorBrewer_1.1-2   randomForest_4.6-12  ggjoy_0.4.0          ggridges_0.4.1      
    [8] DT_0.2               caret_6.0-77         lattice_0.20-35      FSelector_0.21       scales_0.5.0         nnet_7.3-12          infotheo_1.2.0      
   [15] cluster_2.0.6        forecast_8.2         gridExtra_2.3        kableExtra_0.6.1     knitr_1.17           rmarkdown_1.8        markdown_0.8        
   [22] TTR_0.23-2           tseries_0.10-42      ggplot2_2.2.1        magrittr_1.5         data.table_1.10.4-3 
   loaded via a namespace (and not attached):
    [1] colorspace_1.3-2   class_7.3-14       rprojroot_1.2      rstudioapi_0.7     DRR_0.0.2          prodlim_1.6.1      lubridate_1.7.1    xml2_1.1.1        
    [9] codetools_0.2-15   splines_3.4.0      mnormt_1.5-5       robustbase_0.92-8  RcppRoll_0.2.2     jsonlite_1.5       entropy_1.2.1      rJava_0.9-9       
   [17] broom_0.4.3        ddalpha_1.3.1      kernlab_0.9-25     sfsmisc_1.1-1      DiagrammeR_0.9.2   readr_1.1.1        compiler_3.4.0     httr_1.3.1        
   [25] backports_1.1.1    assertthat_0.2.0   Matrix_1.2-9       lazyeval_0.2.1     visNetwork_2.0.1   htmltools_0.3.6    tools_3.4.0        bindrcpp_0.2      
   [33] igraph_1.1.2       gtable_0.2.0       glue_1.2.0         reshape2_1.4.2     dplyr_0.7.4        Rcpp_0.12.14       rgexf_0.15.3       fracdiff_1.4-2    
   [41] nlme_3.1-131       iterators_1.0.8    psych_1.7.8        lmtest_0.9-35      timeDate_3042.101  gower_0.1.2        stringr_1.2.0      rvest_0.3.2       
   [49] RWekajars_3.9.1-5  XML_3.98-1.9       DEoptimR_1.0-8     MASS_7.3-47        zoo_1.8-0          ipred_0.9-6        hms_0.4.0          parallel_3.4.0    
   [57] quantmod_0.4-11    curl_3.0           downloader_0.4     rpart_4.1-11       stringi_1.1.6      Rook_1.1-1         foreach_1.4.3      RWeka_0.4-36      
   [65] lava_1.5.1         rlang_0.1.4        pkgconfig_2.0.1    evaluate_0.10.1    purrr_0.2.4        bindr_0.1          recipes_0.1.1      htmlwidgets_0.9   
   [73] CVST_0.2-1         tidyselect_0.2.3   plyr_1.8.4         R6_2.2.2           dimRed_0.1.0       foreign_0.8-67     withr_2.1.0        xts_0.10-0        
   [81] survival_2.41-3    tibble_1.3.4       viridis_0.4.0      grid_3.4.0         influenceR_0.1.0   ModelMetrics_1.1.0 digest_0.6.12      tidyr_0.7.2       
   [89] brew_1.0-6         stats4_3.4.0       munsell_0.4.3      viridisLite_0.2.0  quadprog_1.5-5
   Start training with 1 devices
   [1] Train-mse=NaN
   hidden_u_1 <- 10
   activ_hidden_1 <- 'tanh'
   hidden_u_2 <- 1
   learn_rate <- 0.001
   initializer <- mx.init.uniform(1)
   optimizer <- 'rmsprop' #sgd
   loss <- mx.metric.mse
   device.cpu <- mx.cpu()
   mini_batch <- 64 #8
   rounds <- 1 #2
   ## data symbols
   nn_data <- mx.symbol.Variable('data')
   nn_label <- mx.symbol.Variable('label')
   ## first fully connected layer
   fc1 <- mx.symbol.FullyConnected(data = nn_data
                                   , num_hidden = hidden_u_1)
   activ1 <- mx.symbol.Activation(data = fc1, act.type = activ_hidden_1)
   ## second fully connected layer
   fc2 <- mx.symbol.FullyConnected(data = activ1, num_hidden = hidden_u_2)
   q_func <- mx.symbol.LinearRegressionOutput(data = fc2, label = nn_label, name = 'regr')
   # initialize NN
   train.x <- matrix(rnorm(mini_batch * 10, 0, 1), ncol = 10)
   train.y = rnorm(64, 0, 1)
   nn_model <- mx.model.FeedForward.create(
        symbol = q_func,
        X = train.x,
        y = train.y,
        ctx = device.cpu,
        num.round = rounds,
        array.batch.size = mini_batch, #60
        optimizer = optimizer,
        eval.metric = loss,
        learning.rate = learn_rate,
        initializer = initializer
   ## What have you tried to solve it?
   If I use 2 or more rounds, or minibatch of the size smaller than the number of samples in my dataset, I get a numeric value of train error.

