You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/09/25 22:27:27 UTC

[GitHub] [incubator-mxnet] salmanmashayekh commented on issue #16230: Loading Sagemaker NTM Artifacts

salmanmashayekh commented on issue #16230: Loading Sagemaker NTM Artifacts
URL: https://github.com/apache/incubator-mxnet/issues/16230#issuecomment-535248289
 
 
   Thank you both @lanking520 and @ThomasDelteil!
   
   Here is a MWE:
   
   ```
   from sklearn.feature_extraction.text import CountVectorizer
   import s3fs
   from sagemaker.amazon.amazon_estimator import get_image_uri
   from sagemaker import get_execution_role
   
   fs = s3fs.S3FileSystem()
   
   
   # create the document corpus
   docs = [
       "Python is an interpreted, high-level, general-purpose programming language.",
       "Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
       "Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.",
       "Python is dynamically typed and garbage-collected.",
       "It supports multiple programming paradigms, including procedural, object-oriented, and functional programming.",
       "Python is often described as a batteries included language due to its comprehensive standard library.",
       "Python was conceived in the late 1980s as a successor to the ABC language.",
       "Python 2.0, released 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles.",
       "Python 3.0, released 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3.",
       "Due to concern about the amount of code written for Python 2, support for Python 2.7 (the last release in the 2.x series) was extended to 2020.",
       "Language developer Guido van Rossum shouldered sole responsibility for the project until July 2018 but now shares his leadership as a member of a five-person steering council.",
       "The Python 2 language, i.e. Python 2.7.x, is sunsetting on January 1, 2020, and the Python team of volunteers will not fix security issues, or improve in other ways after that date.",
       "With the end-of-life, only Python 3.6.x and later, e.g.",
       "Python 3.8 which should be released in October 2019 (currently in beta), will be supported.",
       "Python interpreters are available for many operating systems.",
       "A global community of programmers develops and maintains CPython, an open source reference implementation.",
       "A non-profit organization, the Python Software Foundation, manages and directs resources for Python and CPython development.",
   ]
   DOCS_NO = len(docs)
   
   # fir the vectorizer and transform the docs
   count_vectorizer = CountVectorizer(input="content", max_df=0.8, min_df=5, max_features=10)
   transformed_docs = count_vectorizer.fit_transform(docs)
   transformed_docs = transformed_docs.astype(np.float32) # transformed_docs is a 17x8 sparse matrix
   
   # s3 params
   BUCKET = "my-bucket"
   PATH = "mxnet-mwe"
   TRAIN_DIR  = f"{PATH}/train"
   AUX_DIR    = f"{PATH}/auxiliary"
   OUTPUT_DIR = f"{PATH}/output"
   
   # store the transformed docs into s3
   with fs.open(f"s3://{BUCKET}/{TRAIN_DIR}/transformed_docs.protobuf", "wb") as buf:
       smac.write_spmatrix_to_sparse_tensor(buf, transformed_docs)
       
   # store the vocab into s3
   vocab_dict = {i:w for w,i in count_vectorizer.vocabulary_.items()}
   vocab = [str(vocab_dict[i]) for i in range(len(vocab_dict))]
   with fs.open(f"s3://{BUCKET}/{AUX_DIR}/vocab.txt", "wb") as f:
       f.write(("\n".join(vocab)).encode())
   
   # get the training image, role, and session
   container = get_image_uri(boto3.Session().region_name, 'ntm')
   role = get_execution_role()
   session = sagemaker.Session()
   
   # instantiate the estimator
   ntm = sagemaker.estimator.Estimator(
       container,
       role, 
       train_instance_count = 1, 
       train_instance_type = "ml.p2.xlarge",
       output_path = f"s3://{BUCKET}/{OUTPUT_DIR}",
       sagemaker_session = session,
   )
   
   # set hyper-params
   ntm.set_hyperparameters(
       num_topics = 3,
       feature_dim = len(vocab),
   )
   
   # set training and auxiliary channels
   train_channel = sagemaker.session.s3_input(
       f"s3://{BUCKET}/{TRAIN_DIR}/transformed_docs.protobuf",
       content_type = 'application/x-recordio-protobuf'
   )
   auxiliary_channel = sagemaker.session.s3_input(
       f"s3://{BUCKET}/{AUX_DIR}/vocab.txt",
       content_type = 'text/plain'
   )
   
   # fit the model
   ntm.fit(
       {
           'train': train_channel,
           'auxiliary': auxiliary_channel,
       }
   )
   ```
   
   The training log is as follows:
   
   ```
   2019-09-25 21:56:29 Starting - Starting the training job...
   2019-09-25 21:56:35 Starting - Launching requested ML instances......
   2019-09-25 21:57:34 Starting - Preparing the instances for training......
   2019-09-25 21:58:45 Downloading - Downloading input data...
   2019-09-25 21:59:04 Training - Downloading the training image..Docker entrypoint called with argument(s): train
   /opt/amazon/lib/python2.7/site-packages/pandas/util/nosetester.py:13: DeprecationWarning: Importing from numpy.testing.nosetester is deprecated, import from numpy.testing instead.
     from numpy.testing import nosetester
   [09/25/2019 21:59:36 INFO 140704525621056] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/default-input.json: {u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto_gpu', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'sub_sample': u'1.0', u'epochs': u'50', u'weight_decay': u'0.0', u'_num_kv_servers': u'auto', u'encoder_layers_activation': u'sigmoid', u'mini_batch_size': u'256', u'tolerance': u'0.001', u'batch_norm': u'false'}
   [09/25/2019 21:59:36 INFO 140704525621056] Reading provided configuration from /opt/ml/input/config/hyperparameters.json: {u'feature_dim': u'8', u'num_topics': u'3'}
   [09/25/2019 21:59:36 INFO 140704525621056] Final configuration: {u'num_patience_epochs': u'3', u'clip_gradient': u'Inf', u'encoder_layers': u'auto', u'optimizer': u'adadelta', u'_kvstore': u'auto_gpu', u'rescale_gradient': u'1.0', u'_tuning_objective_metric': u'', u'_num_gpus': u'auto', u'learning_rate': u'0.01', u'_data_format': u'record', u'sub_sample': u'1.0', u'epochs': u'50', u'feature_dim': u'8', u'weight_decay': u'0.0', u'num_topics': u'3', u'_num_kv_servers': u'auto', u'encoder_layers_activation': u'sigmoid', u'mini_batch_size': u'256', u'tolerance': u'0.001', u'batch_norm': u'false'}
   [09/25/2019 21:59:37 INFO 140704525621056] nvidia-smi took: 0.0503778457642 secs to identify 1 gpus
   Process 1 is a worker.
   [09/25/2019 21:59:37 INFO 140704525621056] Using default worker.
   [09/25/2019 21:59:37 INFO 140704525621056] Initializing
   /opt/amazon/lib/python2.7/site-packages/ai_algorithms_sdk/config/config_helper.py:122: DeprecationWarning: deprecated
     warnings.warn("deprecated", DeprecationWarning)
   [09/25/2019 21:59:37 INFO 140704525621056] /opt/ml/input/data/auxiliary
   [09/25/2019 21:59:37 INFO 140704525621056] vocab.txt
   [09/25/2019 21:59:37 INFO 140704525621056] Vocab file vocab.txt is expected at /opt/ml/input/data/auxiliary
   [09/25/2019 21:59:37 INFO 140704525621056] Loading pre-trained token embedding vectors from /opt/amazon/lib/python2.7/site-packages/algorithm/s3_binary/glove.6B.50d.txt
   
   2019-09-25 21:59:58 Uploading - Uploading generated training model
   2019-09-25 21:59:58 Completed - Training job completed
   [09/25/2019 21:59:49 WARNING 140704525621056] 0 out of 8 in vocabulary do not have embeddings! Default vector used for unknown embedding!
   [09/25/2019 21:59:49 INFO 140704525621056] Vocab embedding shape
   [09/25/2019 21:59:49 INFO 140704525621056] Number of GPUs being used: 1
   [09/25/2019 21:59:53 INFO 140704525621056] Create Store: device
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Records Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Max Records Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Reset Count": {"count": 1, "max": 0, "sum": 0.0, "min": 0}}, "EndTime": 1569448793.550828, "Dimensions": {"Host": "algo-1", "Meta": "init_train_data_iter", "Operation": "training", "Algorithm": "AWS/NTM"}, "StartTime": 1569448793.55079}
   
   [2019-09-25 21:59:53.551] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 0, "duration": 16503, "num_examples": 1, "num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 1
   [2019-09-25 21:59:53.571] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 2, "duration": 19, "num_examples": 1, "num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 1 on 17 examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 0.13865429163
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 0.000402930745622
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 0.13825134933
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 0.13865429163
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, epoch=1, train total_loss <loss>=0.13865429163
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.03s, val: 0.00s, epoch: 0.03s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, completed 2 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Total Records Seen": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Max Records Seen Between Resets": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, "max": 2, "sum": 2.0, "min": 2}}, "EndTime": 1569448793.577662, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 0}, "StartTime": 1569448793.551103}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, train throughput=636.442222897 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 2
   [2019-09-25 21:59:53.591] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 5, "duration": 13, "num_examples": 1, "num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 2 on 17 examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 0.137792155147
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 0.000175397624844
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 0.137616753578
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 0.137792155147
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, epoch=2, train total_loss <loss>=0.137792155147
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.01s, val: 0.00s, epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, completed 4 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": {"count": 1, "max": 2, "sum": 2.0, "min": 2}, "Total Records Seen": {"count": 1, "max": 34, "sum": 34.0, "min": 34}, "Max Records Seen Between Resets": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, "max": 4, "sum": 4.0, "min": 4}}, "EndTime": 1569448793.597192, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 1}, "StartTime": 1569448793.577978}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, train throughput=876.919088438 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 3
   [2019-09-25 21:59:53.609] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 8, "duration": 11, "num_examples": 1, "num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 3 on 17 examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 0.136722803116
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 5.14156054123e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 0.136671379209
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 0.136722803116
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, epoch=3, train total_loss <loss>=0.136722803116
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.01s, val: 0.00s, epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, completed 6 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": {"count": 1, "max": 3, "sum": 3.0, "min": 3}, "Total Records Seen": {"count": 1, "max": 51, "sum": 51.0, "min": 51}, "Max Records Seen Between Resets": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, "max": 6, "sum": 6.0, "min": 6}}, "EndTime": 1569448793.615088, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 2}, "StartTime": 1569448793.597521}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, train throughput=958.182731976 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 4
   [2019-09-25 21:59:53.628] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 11, "duration": 12, "num_examples": 1, "num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 4 on 17 examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 0.136693164706
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 4.85915334139e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 0.13664457202
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 0.136693164706
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, epoch=4, train total_loss <loss>=0.136693164706
   [09/25/2019 21:59:53 INFO 140704525621056] patience losses:[0.13865429162979126, 0.13779215514659882, 0.13672280311584473] min patience loss:0.136722803116 current loss:0.136693164706 absolute loss difference:2.96384096146e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epoch: loss has not improved (enough). Bad count:1
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.01s, val: 0.00s, epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, completed 8 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": {"count": 1, "max": 4, "sum": 4.0, "min": 4}, "Total Records Seen": {"count": 1, "max": 68, "sum": 68.0, "min": 68}, "Max Records Seen Between Resets": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, "max": 8, "sum": 8.0, "min": 8}}, "EndTime": 1569448793.633979, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 3}, "StartTime": 1569448793.615416}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, train throughput=908.019866032 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 5
   [2019-09-25 21:59:53.648] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 14, "duration": 14, "num_examples": 1, "num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 5 on 17 examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 0.136731252074
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 3.38535865012e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 0.136697411537
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 0.136731252074
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, epoch=5, train total_loss <loss>=0.136731252074
   [09/25/2019 21:59:53 INFO 140704525621056] patience losses:[0.13779215514659882, 0.13672280311584473, 0.13669316470623016] min patience loss:0.136693164706 current loss:0.136731252074 absolute loss difference:3.80873680115e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epoch: loss has not improved (enough). Bad count:2
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.02s, val: 0.00s, epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, completed 10 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": {"count": 1, "max": 5, "sum": 5.0, "min": 5}, "Total Records Seen": {"count": 1, "max": 85, "sum": 85.0, "min": 85}, "Max Records Seen Between Resets": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, "max": 10, "sum": 10.0, "min": 10}}, "EndTime": 1569448793.649995, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 4}, "StartTime": 1569448793.634279}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, train throughput=1073.22869442 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 6
   [2019-09-25 21:59:53.665] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 17, "duration": 14, "num_examples": 1, "num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 6 on 17 examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 0.135983735323
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 2.08785422728e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 0.135962858796
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 0.135983735323
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, epoch=6, train total_loss <loss>=0.135983735323
   [09/25/2019 21:59:53 INFO 140704525621056] patience losses:[0.13672280311584473, 0.13669316470623016, 0.13673125207424164] min patience loss:0.136693164706 current loss:0.135983735323 absolute loss difference:0.000709429383278
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epoch: loss has not improved (enough). Bad count:3
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.02s, val: 0.01s, epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, completed 12 % of epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": {"count": 1, "max": 6, "sum": 6.0, "min": 6}, "Total Records Seen": {"count": 1, "max": 102, "sum": 102.0, "min": 102}, "Max Records Seen Between Resets": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, "max": 12, "sum": 12.0, "min": 12}}, "EndTime": 1569448793.673135, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 5}, "StartTime": 1569448793.650263}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, train throughput=737.784344767 records/second
   [09/25/2019 21:59:53 INFO 140704525621056] 
   [09/25/2019 21:59:53 INFO 140704525621056] # Starting training for epoch 7
   [2019-09-25 21:59:53.691] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 20, "duration": 17, "num_examples": 1, "num_bytes": 816}
   [09/25/2019 21:59:53 INFO 140704525621056] # Finished training epoch 7 on 17 examples from 1 batches, each of size 256.
   [09/25/2019 21:59:53 INFO 140704525621056] Metrics for Training:
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) total: 0.135596558452
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) kld: 5.20053035871e-05
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) recons: 0.135544538498
   [09/25/2019 21:59:53 INFO 140704525621056] Loss (name: value) logppx: 0.135596558452
   [09/25/2019 21:59:53 INFO 140704525621056] #quality_metric: host=algo-1, epoch=7, train total_loss <loss>=0.135596558452
   [09/25/2019 21:59:53 INFO 140704525621056] patience losses:[0.13669316470623016, 0.13673125207424164, 0.13598373532295227] min patience loss:0.135983735323 current loss:0.135596558452 absolute loss difference:0.0003871768713
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epoch: loss has not improved (enough). Bad count:4
   [09/25/2019 21:59:53 INFO 140704525621056] Bad epochs exceeded patience. Stopping training early!
   [09/25/2019 21:59:53 INFO 140704525621056] Timing: train: 0.02s, val: 0.00s, epoch: 0.02s
   [09/25/2019 21:59:53 INFO 140704525621056] Early stop condition met. Stopping training.
   [09/25/2019 21:59:53 INFO 140704525621056] #progress_metric: host=algo-1, completed 100 % epochs
   #metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Total Batches Seen": {"count": 1, "max": 7, "sum": 7.0, "min": 7}, "Total Records Seen": {"count": 1, "max": 119, "sum": 119.0, "min": 119}, "Max Records Seen Between Resets": {"count": 1, "max": 17, "sum": 17.0, "min": 17}, "Reset Count": {"count": 1, "max": 14, "sum": 14.0, "min": 14}}, "EndTime": 1569448793.696538, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/NTM", "epoch": 6}, "StartTime": 1569448793.673541}
   
   [09/25/2019 21:59:53 INFO 140704525621056] #throughput_metric: host=algo-1, train throughput=733.51886181 records/second
   [09/25/2019 21:59:53 WARNING 140704525621056] wait_for_all_workers will not sync workers since the kv store is not running distributed
   [09/25/2019 21:59:53 INFO 140704525621056] Best model based on early stopping at epoch 7. Best loss: 0.135596558452
   [09/25/2019 21:59:53 INFO 140704525621056] Topics from epoch:final (num_topics:3) [wetc 0.63, tu 0.33]:
   [09/25/2019 21:59:53 INFO 140704525621056] [0.63, 0.33] python and the for in is language of
   [09/25/2019 21:59:53 INFO 140704525621056] [0.63, 0.33] in and of python the for is language
   [09/25/2019 21:59:53 INFO 140704525621056] [0.63, 0.33] python and language for of the in is
   [09/25/2019 21:59:53 INFO 140704525621056] Serializing model to /opt/ml/model/model_algo-1
   [09/25/2019 21:59:53 INFO 140704525621056] Saved checkpoint to "/tmp/tmp3kswyI/state-0001.params"
   [09/25/2019 21:59:53 INFO 140704525621056] Test data is not provided.
   #metrics {"Metrics": {"totaltime": {"count": 1, "max": 16767.024993896484, "sum": 16767.024993896484, "min": 16767.024993896484}, "finalize.time": {"count": 1, "max": 12.12000846862793, "sum": 12.12000846862793, "min": 12.12000846862793}, "initialize.time": {"count": 1, "max": 16501.816987991333, "sum": 16501.816987991333, "min": 16501.816987991333}, "model.serialize.time": {"count": 1, "max": 3.983020782470703, "sum": 3.983020782470703, "min": 3.983020782470703}, "setuptime": {"count": 1, "max": 66.67494773864746, "sum": 66.67494773864746, "min": 66.67494773864746}, "early_stop.time": {"count": 7, "max": 6.479024887084961, "sum": 24.256229400634766, "min": 0.23102760314941406}, "update.time": {"count": 7, "max": 26.340961456298828, "sum": 141.98827743530273, "min": 15.537023544311523}, "epochs": {"count": 1, "max": 50, "sum": 50.0, "min": 50}}, "EndTime": 1569448793.714689, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "AWS/NTM"}, "StartTime": 1569448777.046916}
   ```
   
   When the training is done, it generates the following zip file, which includes a `metadata`, a `symbol`, and a `parameters` file: https://drive.google.com/open?id=1TLnIrnmB1SzPqN7cgyECql1Ri74isPKG
   
   I am trying to load the `symbol`/`parameters` file into an mxnet model, so that I can make predictions locally (outside of Sagemaker). 
   
   Can you share the code snippet to create an mxnet model with the artifacts and predict on the `transformed_docs`?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services