You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/15 07:31:49 UTC
[GitHub] safrooze closed pull request #9083: Nceloss

safrooze closed pull request #9083: Nceloss
URL: https://github.com/apache/incubator-mxnet/pull/9083
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/example/nce-loss/README.md b/example/nce-loss/README.md
index 8da444df0d..c1e8b6bc92 100644
--- a/example/nce-loss/README.md
+++ b/example/nce-loss/README.md
@@ -1,32 +1,43 @@
-#Examples of NCE Loss
+# Examples of NCE Loss
 
-nce-loss is used to speedup multi-class classification when class num is huge.
+[Noise-contrastive estimation](http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf) loss (nce-loss) is used to speedup multi-class classification when class num is huge.
+
+Examples in this folder utilize [text8](http://mattmahoney.net/dc/textdata.html) dataset, which is a 100MB of cleaned up English Wikipedia XML data. Wikipedia data is multi-licensed under the [Creative Commons Attribution-ShareAlike 3.0 License](https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License) (CC-BY-SA) and the [GNU Free Documentation License](https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License) (GFDL). For information on licensing of Wikipedia data please visit [here](https://en.wikipedia.org/wiki/Wikipedia:Database_download).
 
 ## Toy example
 
-* toy_softmax.py: a multi class example using softmax output
-* toy_nce.py: a multi-class example using nce loss
+* toy_softmax.py: a multi class example using softmax output. Command to start training on CPU:
+```
+python toy_softmax.py
+```
 
-## Word2Vec
+* toy_nce.py: equivalent example to the above toy_softmax, except using nce loss. Command to start training on CPU:
+```
+python toy_nce.py
+```
 
-* word2vec.py: a CBOW word2vec example using nce loss
+## Dataset Download
 
-You can run it by
+The dataset used in the following examples is [text8](http://mattmahoney.net/dc/textdata.html) dataset mentioned above. The example scripts expect the dataset to exist in a folder named 'data'. The included get_text8.sh script downloads the dataset into the correct path. Command to download:
 
 ```
 ./get_text8.sh
+```
+
+## Word2Vec
+
+* word2vec.py: a CBOW word2vec example using nce loss. You need to [download the text8 dataset](#dataset-download) before running this script. Command to start training on CPU (pass -g for training on GPU):
+
+```
 python word2vec.py
 
 ```
 
 ## LSTM
 
-* lstm_word.py: a lstm example use nce loss
-
-You can run it by
+* lstm_word.py: a lstm example use nce loss. You need to [download the text8 dataset](#dataset-download) before running this script. Pass -h (or --help) to see command line option for GPU training. Command to start training on CPU (pass -g for training on GPU):
 
 ```
-./get_text8.sh
 python lstm_word.py
 ```
 
@@ -37,14 +48,12 @@ You can refer to [http://www.jianshu.com/p/e439b43ea464](http://www.jianshu.com/
 
 ## Word2Vec in NCE-loss with Subword Representation
 
+Reproducing the work Microsoft Research presented in CIKM'14, in which it's a basis of DSSM([Deep Semantics Similarity Model](https://www.microsoft.com/en-us/research/project/dssm/)), you can get its lectures [here](https://www.microsoft.com/en-us/research/publication/deep-learning-for-natural-language-processing-theory-and-practice-tutorial/). You need to [download the text8 dataset](#dataset-download) before running this script. Command to start training on CPU (pass -g for training on GPU):
+
 ```
-./get_text8.sh
 python wordvec_subwords.py
 ```
 
-Reproducing the work Microsoft Research presented in CIKM'14, in which it's a basis of DSSM([Deep Semantics Similarity Model](https://www.microsoft.com/en-us/research/project/dssm/)), you can get its lectures [here](https://www.microsoft.com/en-us/research/publication/deep-learning-for-natural-language-processing-theory-and-practice-tutorial/).
-
-
 ### Motivation
 
 The motivation is to design a more robust and scalable word vector system, by reducing the size of lookup-table, and handle unknown words(out-of-vocabulary) better.
@@ -54,9 +63,9 @@ The motivation is to design a more robust and scalable word vector system, by re
 
 ### Basics
 
-<img src="https://github.com/zihaolucky/mxnet/blob/example/word2vec-nce-loss-with-subword-representations/example/nce-loss-subword-repr/slide1.png" width="700">
+<img src="images/slide1.png" width="700">
 
-<img src="https://github.com/zihaolucky/mxnet/blob/example/word2vec-nce-loss-with-subword-representations/example/nce-loss-subword-repr/slide2.png" width="700">
+<img src="images/slide2.png" width="700">
 
 Note that this word embedding method uses sub-word units to represent a word, while we still train word2vec model in its original way, the only difference is the vector representation of a word is no longer the word itself, but use several sub-word units' addition.
 
@@ -64,18 +73,16 @@ If you use sub-word sequence and feed into a word2vec training processing, it co
 
 ### Analysis
 
-> Experiment data on MacBook Pro'16 with 4 cpus.
-
-Here we print the training log below, using text8 data, to get some intuitions on its benefits:
+This experiment was performed on MacBook Pro with 4 cpus. Here we print the training log below, using text8 data, to get some intuitions on its benefits:
 
 *With subword units representation*
 
-It converge much faster.
+Then network training converges much faster.
 
 ```
 2016-11-26 19:07:31,742 Start training with [cpu(0), cpu(1), cpu(2), cpu(3)]
 2016-11-26 19:07:31,783 DataIter start.
-2016-11-26 19:07:45,099 Epoch[0] Batch [50]		Speed: 4020.37 samples/sec	Train-nce-auc=0.693178
+2016-11-26 19:07:45,099 Epoch[0] Batch [50]	Speed: 4020.37 samples/sec	Train-nce-auc=0.693178
 2016-11-26 19:07:57,870 Epoch[0] Batch [100]	Speed: 4009.19 samples/sec	Train-nce-auc=0.741482
 2016-11-26 19:08:10,196 Epoch[0] Batch [150]	Speed: 4153.73 samples/sec	Train-nce-auc=0.764026
 2016-11-26 19:08:22,497 Epoch[0] Batch [200]	Speed: 4162.61 samples/sec	Train-nce-auc=0.785248
@@ -93,18 +100,6 @@ It converge much faster.
 2016-11-26 19:10:53,362 Epoch[0] Batch [800]	Speed: 4123.59 samples/sec	Train-nce-auc=0.834170
 2016-11-26 19:11:05,645 Epoch[0] Batch [850]	Speed: 4168.32 samples/sec	Train-nce-auc=0.836135
 2016-11-26 19:11:18,035 Epoch[0] Batch [900]	Speed: 4132.51 samples/sec	Train-nce-auc=0.842253
-2016-11-26 19:11:30,257 Epoch[0] Batch [950]	Speed: 4189.27 samples/sec	Train-nce-auc=0.834119
-2016-11-26 19:11:42,600 Epoch[0] Batch [1000]	Speed: 4148.01 samples/sec	Train-nce-auc=0.828049
-2016-11-26 19:11:54,850 Epoch[0] Batch [1050]	Speed: 4179.55 samples/sec	Train-nce-auc=0.844856
-2016-11-26 19:12:07,052 Epoch[0] Batch [1100]	Speed: 4196.35 samples/sec	Train-nce-auc=0.856587
-2016-11-26 19:12:19,286 Epoch[0] Batch [1150]	Speed: 4185.10 samples/sec	Train-nce-auc=0.845370
-2016-11-26 19:12:31,703 Epoch[0] Batch [1200]	Speed: 4123.25 samples/sec	Train-nce-auc=0.851430
-2016-11-26 19:12:44,177 Epoch[0] Batch [1250]	Speed: 4104.76 samples/sec	Train-nce-auc=0.851357
-2016-11-26 19:12:56,497 Epoch[0] Batch [1300]	Speed: 4155.90 samples/sec	Train-nce-auc=0.854957
-2016-11-26 19:13:08,839 Epoch[0] Batch [1350]	Speed: 4148.39 samples/sec	Train-nce-auc=0.853684
-2016-11-26 19:13:21,052 Epoch[0] Batch [1400]	Speed: 4192.37 samples/sec	Train-nce-auc=0.849442
-2016-11-26 19:13:33,386 Epoch[0] Batch [1450]	Speed: 4151.24 samples/sec	Train-nce-auc=0.853365
-2016-11-26 19:13:45,709 Epoch[0] Batch [1500]	Speed: 4154.65 samples/sec	Train-nce-auc=0.855938
 ```
 
 
diff --git a/example/nce-loss/images/slide1.png b/example/nce-loss/images/slide1.png
new file mode 100644
index 0000000000..8ef39d2fe6
Binary files /dev/null and b/example/nce-loss/images/slide1.png differ
diff --git a/example/nce-loss/images/slide2.png b/example/nce-loss/images/slide2.png
new file mode 100644
index 0000000000..4629d00fff
Binary files /dev/null and b/example/nce-loss/images/slide2.png differ
diff --git a/example/nce-loss/lstm_word.py b/example/nce-loss/lstm_word.py
index 23729917d9..bd1817cd8c 100644
--- a/example/nce-loss/lstm_word.py
+++ b/example/nce-loss/lstm_word.py
@@ -15,17 +15,19 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# pylint:skip-file
+# pylint: disable=invalid-name, missing-docstring, too-many-arguments, deprecated-module
+# pylint: disable=too-many-locals, len-as-condition, too-many-instance-attributes
 from __future__ import print_function
-import logging
-import sys, random, time, math
-sys.path.insert(0, "../../python")
-import mxnet as mx
-import numpy as np
+
 from collections import namedtuple
-from nce import *
-from operator import itemgetter
+import logging
+import math
 from optparse import OptionParser
+import random
+
+import mxnet as mx
+from nce import nce_loss, NceLSTMAuc
+
 
 LSTMState = namedtuple("LSTMState", ["c", "h"])
 LSTMParam = namedtuple("LSTMParam", ["i2h_weight", "i2h_bias",
@@ -35,19 +37,20 @@
                                      "seq_data", "seq_labels", "seq_outputs",
                                      "param_blocks"])
 
-def lstm(num_hidden, indata, prev_state, param, seqidx, layeridx, dropout=0.):
+
+def lstm(num_hidden_, indata, prev_state, param, seqidx, layeridx, dropout=0.):
     """LSTM Cell symbol"""
     if dropout > 0.:
         indata = mx.sym.Dropout(data=indata, p=dropout)
     i2h = mx.sym.FullyConnected(data=indata,
                                 weight=param.i2h_weight,
                                 bias=param.i2h_bias,
-                                num_hidden=num_hidden * 4,
+                                num_hidden=num_hidden_ * 4,
                                 name="t%d_l%d_i2h" % (seqidx, layeridx))
     h2h = mx.sym.FullyConnected(data=prev_state.h,
                                 weight=param.h2h_weight,
                                 bias=param.h2h_bias,
-                                num_hidden=num_hidden * 4,
+                                num_hidden=num_hidden_ * 4,
                                 name="t%d_l%d_h2h" % (seqidx, layeridx))
     gates = i2h + h2h
     slice_gates = mx.sym.SliceChannel(gates, num_outputs=4,
@@ -61,10 +64,10 @@ def lstm(num_hidden, indata, prev_state, param, seqidx, layeridx, dropout=0.):
     return LSTMState(c=next_c, h=next_h)
 
 
-def get_net(vocab_size, seq_len, num_label, num_lstm_layer, num_hidden):
+def get_net(vocab_size, seq_len_, num_label_, num_lstm_layer_, num_hidden_):
     param_cells = []
     last_states = []
-    for i in range(num_lstm_layer):
+    for i in range(num_lstm_layer_):
         param_cells.append(LSTMParam(i2h_weight=mx.sym.Variable("l%d_i2h_weight" % i),
                                      i2h_bias=mx.sym.Variable("l%d_i2h_bias" % i),
                                      h2h_weight=mx.sym.Variable("l%d_h2h_weight" % i),
@@ -78,37 +81,37 @@ def get_net(vocab_size, seq_len, num_label, num_lstm_layer, num_hidden):
     label_weight = mx.sym.Variable('label_weight')
     embed_weight = mx.sym.Variable('embed_weight')
     label_embed_weight = mx.sym.Variable('label_embed_weight')
-    data_embed = mx.sym.Embedding(data = data, input_dim = vocab_size,
-                                  weight = embed_weight,
-                                  output_dim = 100, name = 'data_embed')
-    datavec = mx.sym.SliceChannel(data = data_embed,
-                                  num_outputs = seq_len,
-                                  squeeze_axis = True, name = 'data_slice')
-    labelvec = mx.sym.SliceChannel(data = label,
-                                   num_outputs = seq_len,
-                                   squeeze_axis = True, name = 'label_slice')
-    labelweightvec = mx.sym.SliceChannel(data = label_weight,
-                                         num_outputs = seq_len,
-                                         squeeze_axis = True, name = 'label_weight_slice')
+    data_embed = mx.sym.Embedding(data=data, input_dim=vocab_size,
+                                  weight=embed_weight,
+                                  output_dim=100, name='data_embed')
+    datavec = mx.sym.SliceChannel(data=data_embed,
+                                  num_outputs=seq_len_,
+                                  squeeze_axis=True, name='data_slice')
+    labelvec = mx.sym.SliceChannel(data=label,
+                                   num_outputs=seq_len_,
+                                   squeeze_axis=True, name='label_slice')
+    labelweightvec = mx.sym.SliceChannel(data=label_weight,
+                                         num_outputs=seq_len_,
+                                         squeeze_axis=True, name='label_weight_slice')
     probs = []
-    for seqidx in range(seq_len):
+    for seqidx in range(seq_len_):
         hidden = datavec[seqidx]
 
-        for i in range(num_lstm_layer):
-            next_state = lstm(num_hidden, indata = hidden,
-                              prev_state = last_states[i],
-                              param = param_cells[i],
-                              seqidx = seqidx, layeridx = i)
+        for i in range(num_lstm_layer_):
+            next_state = lstm(num_hidden_, indata=hidden,
+                              prev_state=last_states[i],
+                              param=param_cells[i],
+                              seqidx=seqidx, layeridx=i)
             hidden = next_state.h
             last_states[i] = next_state
 
-        probs.append(nce_loss(data = hidden,
-                              label = labelvec[seqidx],
-                              label_weight = labelweightvec[seqidx],
-                              embed_weight = label_embed_weight,
-                              vocab_size = vocab_size,
-                              num_hidden = 100,
-                              num_label = num_label))
+        probs.append(nce_loss(data=hidden,
+                              label=labelvec[seqidx],
+                              label_weight=labelweightvec[seqidx],
+                              embed_weight=label_embed_weight,
+                              vocab_size=vocab_size,
+                              num_hidden=100,
+                              num_label=num_label_))
     return mx.sym.Group(probs)
 
 
@@ -135,6 +138,7 @@ def load_data(name):
         negative += [i for _ in range(v)]
     return data, negative, vocab, freq
 
+
 class SimpleBatch(object):
     def __init__(self, data_names, data, label_names, label):
         self.data = data
@@ -152,26 +156,25 @@ def provide_label(self):
 
 
 class DataIter(mx.io.DataIter):
-    def __init__(self, name, batch_size, seq_len, num_label, init_states):
+    def __init__(self, name, batch_size_, seq_len_, num_label_, init_states_):
         super(DataIter, self).__init__()
-        self.batch_size = batch_size
+        self.batch_size = batch_size_
         self.data, self.negative, self.vocab, self.freq = load_data(name)
         self.vocab_size = 1 + len(self.vocab)
-        print(self.vocab_size)
-        self.seq_len = seq_len
-        self.num_label = num_label
-        self.init_states = init_states
+        print("Vocabulary Size: {}".format(self.vocab_size))
+        self.seq_len = seq_len_
+        self.num_label = num_label_
+        self.init_states = init_states_
         self.init_state_names = [x[0] for x in self.init_states]
-        self.init_state_arrays = [mx.nd.zeros(x[1]) for x in init_states]
-        self.provide_data = [('data', (batch_size, seq_len))] + init_states
-        self.provide_label = [('label', (self.batch_size, seq_len, num_label)),
-                              ('label_weight', (self.batch_size, seq_len, num_label))]
+        self.init_state_arrays = [mx.nd.zeros(x[1]) for x in init_states_]
+        self.provide_data = [('data', (batch_size_, seq_len_))] + init_states_
+        self.provide_label = [('label', (self.batch_size, seq_len_, num_label_)),
+                              ('label_weight', (self.batch_size, seq_len_, num_label_))]
 
     def sample_ne(self):
         return self.negative[random.randint(0, len(self.negative) - 1)]
 
     def __iter__(self):
-        print('begin')
         batch_data = []
         batch_label = []
         batch_label_weight = []
@@ -200,13 +203,16 @@ def __iter__(self):
     def reset(self):
         pass
 
+
 if __name__ == '__main__':
     head = '%(asctime)-15s %(message)s'
     logging.basicConfig(level=logging.DEBUG, format=head)
 
     parser = OptionParser()
-    parser.add_option("-g", "--gpu", action = "store_true", dest = "gpu", default = False,
-                      help = "use gpu")
+    parser.add_option("-g", "--gpu", action="store_true", dest="gpu", default=False,
+                      help="use gpu")
+    options, args = parser.parse_args()
+
     batch_size = 1024
     seq_len = 5
     num_label = 6
@@ -217,25 +223,30 @@ def reset(self):
     init_h = [('l%d_init_h'%l, (batch_size, num_hidden)) for l in range(num_lstm_layer)]
     init_states = init_c + init_h
 
-
     data_train = DataIter("./data/text8", batch_size, seq_len, num_label,
                           init_states)
 
     network = get_net(data_train.vocab_size, seq_len, num_label, num_lstm_layer, num_hidden)
-    options, args = parser.parse_args()
+
     devs = mx.cpu()
-    if options.gpu == True:
+    if options.gpu:
         devs = mx.gpu()
-    model = mx.model.FeedForward(ctx = devs,
-                                 symbol = network,
-                                 num_epoch = 20,
-                                 learning_rate = 0.3,
-                                 momentum = 0.9,
-                                 wd = 0.0000,
-                                 initializer=mx.init.Xavier(factor_type="in", magnitude=2.34))
 
-    metric = NceLSTMAuc()
-    model.fit(X = data_train,
-              eval_metric = metric,
-              batch_end_callback = mx.callback.Speedometer(batch_size, 50),)
+    model = mx.mod.Module(
+        symbol=network,
+        data_names=[x[0] for x in data_train.provide_data],
+        label_names=[y[0] for y in data_train.provide_label],
+        context=[devs]
+    )
 
+    print("Training on {}".format("GPU" if options.gpu else "CPU"))
+    metric = NceLSTMAuc()
+    model.fit(
+        train_data=data_train,
+        num_epoch=20,
+        optimizer='sgd',
+        optimizer_params={'learning_rate': 0.3, 'momentum': 0.9, 'wd': 0.0000},
+        initializer=mx.init.Xavier(factor_type='in', magnitude=2.34),
+        eval_metric=metric,
+        batch_end_callback=mx.callback.Speedometer(batch_size, 50)
+    )
diff --git a/example/nce-loss/nce.py b/example/nce-loss/nce.py
index 7f57dfdb75..d3a4baa6c5 100644
--- a/example/nce-loss/nce.py
+++ b/example/nce-loss/nce.py
@@ -15,48 +15,51 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# pylint:skip-file
-import sys
-sys.path.insert(0, "../../python")
+# pylint: disable=invalid-name, missing-docstring, too-many-arguments, unused-argument
+from __future__ import print_function
+
+from operator import itemgetter
+
 import mxnet as mx
 import numpy as np
-from operator import itemgetter
+
 
 def nce_loss(data, label, label_weight, embed_weight, vocab_size, num_hidden, num_label):
-    label_embed = mx.sym.Embedding(data = label, input_dim = vocab_size,
-                                   weight = embed_weight,
-                                   output_dim = num_hidden, name = 'label_embed')
-    data = mx.sym.Reshape(data = data, shape = (-1, 1, num_hidden))
+    label_embed = mx.sym.Embedding(data=label, input_dim=vocab_size,
+                                   weight=embed_weight,
+                                   output_dim=num_hidden, name='label_embed')
+    data = mx.sym.Reshape(data=data, shape=(-1, 1, num_hidden))
     pred = mx.sym.broadcast_mul(data, label_embed)
-    pred = mx.sym.sum(data = pred, axis = 2)
-    return mx.sym.LogisticRegressionOutput(data = pred,
-                                           label = label_weight)
+    pred = mx.sym.sum(data=pred, axis=2)
+    return mx.sym.LogisticRegressionOutput(data=pred,
+                                           label=label_weight)
 
 
-def nce_loss_subwords(data, label, label_mask, label_weight, embed_weight, vocab_size, num_hidden, num_label):
+def nce_loss_subwords(
+        data, label, label_mask, label_weight, embed_weight, vocab_size, num_hidden, num_label):
     """NCE-Loss layer under subword-units input.
     """
     # get subword-units embedding.
-    label_units_embed = mx.sym.Embedding(data = label,
-                                         input_dim = vocab_size,
-                                         weight = embed_weight,
-                                         output_dim = num_hidden)
+    label_units_embed = mx.sym.Embedding(data=label,
+                                         input_dim=vocab_size,
+                                         weight=embed_weight,
+                                         output_dim=num_hidden)
     # get valid subword-units embedding with the help of label_mask
     # it's achieve by multiply zeros to useless units in order to handle variable-length input.
-    label_units_embed = mx.sym.broadcast_mul(lhs = label_units_embed,
-                                             rhs = label_mask,
-                                             name = 'label_units_embed')
+    label_units_embed = mx.sym.broadcast_mul(lhs=label_units_embed,
+                                             rhs=label_mask,
+                                             name='label_units_embed')
     # sum over them to get label word embedding.
-    label_embed = mx.sym.sum(label_units_embed, axis=2, name = 'label_embed')
+    label_embed = mx.sym.sum(label_units_embed, axis=2, name='label_embed')
 
     # by boardcast_mul and sum you can get prediction scores in all num_label inputs,
     # which is easy to feed into LogisticRegressionOutput and make your code more concise.
-    data = mx.sym.Reshape(data = data, shape = (-1, 1, num_hidden))
+    data = mx.sym.Reshape(data=data, shape=(-1, 1, num_hidden))
     pred = mx.sym.broadcast_mul(data, label_embed)
-    pred = mx.sym.sum(data = pred, axis = 2)
+    pred = mx.sym.sum(data=pred, axis=2)
 
-    return mx.sym.LogisticRegressionOutput(data = pred,
-                                           label = label_weight)
+    return mx.sym.LogisticRegressionOutput(data=pred,
+                                           label=label_weight)
 
 
 class NceAccuracy(mx.metric.EvalMetric):
@@ -71,6 +74,7 @@ def update(self, labels, preds):
                 self.sum_metric += 1
             self.num_inst += 1
 
+
 class NceAuc(mx.metric.EvalMetric):
     def __init__(self):
         super(NceAuc, self).__init__('nce-auc')
@@ -82,12 +86,12 @@ def update(self, labels, preds):
         for i in range(preds.shape[0]):
             for j in range(preds.shape[1]):
                 tmp.append((label_weight[i][j], preds[i][j]))
-        tmp = sorted(tmp, key = itemgetter(1), reverse = True)
+        tmp = sorted(tmp, key=itemgetter(1), reverse=True)
         m = 0.0
         n = 0.0
         z = 0.0
         k = 0
-        for a, b in tmp:
+        for a, _ in tmp:
             if a > 0.5:
                 m += 1.0
                 z += len(tmp) - k
@@ -100,6 +104,7 @@ def update(self, labels, preds):
         self.sum_metric += z
         self.num_inst += 1
 
+
 class NceLSTMAuc(mx.metric.EvalMetric):
     def __init__(self):
         super(NceLSTMAuc, self).__init__('nce-lstm-auc')
@@ -115,12 +120,12 @@ def update(self, labels, preds):
         for i in range(preds.shape[0]):
             for j in range(preds.shape[1]):
                 tmp.append((label_weight[i][j], preds[i][j]))
-        tmp = sorted(tmp, key = itemgetter(1), reverse = True)
+        tmp = sorted(tmp, key=itemgetter(1), reverse=True)
         m = 0.0
         n = 0.0
         z = 0.0
         k = 0
-        for a, b in tmp:
+        for a, _ in tmp:
             if a > 0.5:
                 m += 1.0
                 z += len(tmp) - k
diff --git a/example/nce-loss/toy_nce.py b/example/nce-loss/toy_nce.py
index 39da7c7790..49f0a4ac24 100644
--- a/example/nce-loss/toy_nce.py
+++ b/example/nce-loss/toy_nce.py
@@ -15,30 +15,34 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# pylint:skip-file
+# pylint: disable=invalid-name, missing-docstring, too-many-arguments
+from __future__ import print_function
+
 import logging
-import sys, random, time
-sys.path.insert(0, "../../python")
+import random
+
 import mxnet as mx
 import numpy as np
-from collections import namedtuple
-from nce import *
+from nce import nce_loss, NceAccuracy
+
 
-def get_net(vocab_size, num_label):
+def get_net(vocab_size_, num_label_):
     data = mx.sym.Variable('data')
     label = mx.sym.Variable('label')
     label_weight = mx.sym.Variable('label_weight')
     embed_weight = mx.sym.Variable('embed_weight')
-    pred = mx.sym.FullyConnected(data = data, num_hidden = 100)
-    ret = nce_loss(data = pred,
-                    label = label,
-                    label_weight = label_weight,
-                    embed_weight = embed_weight,
-                    vocab_size = vocab_size,
-                    num_hidden = 100,
-                    num_label = num_label)
+    pred = mx.sym.FullyConnected(data=data, num_hidden=100)
+    ret = nce_loss(
+        data=pred,
+        label=label,
+        label_weight=label_weight,
+        embed_weight=embed_weight,
+        vocab_size=vocab_size_,
+        num_hidden=100,
+        num_label=num_label_)
     return ret
 
+
 class SimpleBatch(object):
     def __init__(self, data_names, data, label_names, label):
         self.data = data
@@ -56,16 +60,16 @@ def provide_label(self):
 
 
 class DataIter(mx.io.DataIter):
-    def __init__(self, count, batch_size, vocab_size, num_label, feature_size):
+    def __init__(self, count, batch_size_, vocab_size_, num_label_, feature_size_):
         super(DataIter, self).__init__()
-        self.batch_size = batch_size
+        self.batch_size = batch_size_
         self.count = count
-        self.vocab_size = vocab_size
-        self.num_label = num_label
-        self.feature_size = feature_size
-        self.provide_data = [('data', (batch_size, feature_size))]
-        self.provide_label = [('label', (self.batch_size, num_label)),
-                              ('label_weight', (self.batch_size, num_label))]
+        self.vocab_size = vocab_size_
+        self.num_label = num_label_
+        self.feature_size = feature_size_
+        self.provide_data = [('data', (batch_size_, feature_size_))]
+        self.provide_label = [('label', (self.batch_size, num_label_)),
+                              ('label_weight', (self.batch_size, num_label_))]
 
     def mock_sample(self):
         ret = np.zeros(self.feature_size)
@@ -82,11 +86,11 @@ def mock_sample(self):
         return ret, la
 
     def __iter__(self):
-        for _ in range(self.count / self.batch_size):
+        for _ in range(self.count // self.batch_size):
             data = []
             label = []
             label_weight = []
-            for i in range(self.batch_size):
+            for _ in range(self.batch_size):
                 d, l = self.mock_sample()
                 data.append(d)
                 label.append(l)
@@ -100,6 +104,7 @@ def __iter__(self):
     def reset(self):
         pass
 
+
 if __name__ == '__main__':
     head = '%(asctime)-15s %(message)s'
     logging.basicConfig(level=logging.DEBUG, format=head)
@@ -113,17 +118,21 @@ def reset(self):
     data_test = DataIter(1000, batch_size, vocab_size, num_label, feature_size)
 
     network = get_net(vocab_size, num_label)
-    devs = [mx.cpu()]
-    model = mx.model.FeedForward(ctx = devs,
-                                 symbol = network,
-                                 num_epoch = 20,
-                                 learning_rate = 0.03,
-                                 momentum = 0.9,
-                                 wd = 0.00001,
-                                 initializer=mx.init.Xavier(factor_type="in", magnitude=2.34))
+    model = mx.mod.Module(
+        symbol=network,
+        data_names=[x[0] for x in data_train.provide_data],
+        label_names=[y[0] for y in data_train.provide_label],
+        context=[mx.cpu()]
+    )
 
     metric = NceAccuracy()
-    model.fit(X = data_train, eval_data = data_test,
-              eval_metric = metric,
-              batch_end_callback = mx.callback.Speedometer(batch_size, 50),)
-
+    model.fit(
+        train_data=data_train,
+        eval_data=data_test,
+        num_epoch=20,
+        optimizer='sgd',
+        optimizer_params={'learning_rate': 0.03, 'momentum': 0.9, 'wd': 0.00001},
+        initializer=mx.init.Xavier(factor_type='in', magnitude=2.34),
+        eval_metric=metric,
+        batch_end_callback=mx.callback.Speedometer(batch_size, 50)
+    )
diff --git a/example/nce-loss/toy_softmax.py b/example/nce-loss/toy_softmax.py
index ff6ff4327c..2d5f898639 100644
--- a/example/nce-loss/toy_softmax.py
+++ b/example/nce-loss/toy_softmax.py
@@ -15,24 +15,28 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# pylint:skip-file
+# pylint: disable=invalid-name, missing-docstring, too-many-arguments
+from __future__ import print_function
+
+from collections import namedtuple
 import logging
-import sys, random, time
-sys.path.insert(0, "../../python")
+import random
+
 import mxnet as mx
 import numpy as np
-from collections import namedtuple
 
 ToyModel = namedtuple("ToyModel", ["ex", "symbol", "param_blocks"])
 
-def get_net(vocab_size):
+
+def get_net(vocab_size_):
     data = mx.sym.Variable('data')
     label = mx.sym.Variable('label')
-    pred = mx.sym.FullyConnected(data = data, num_hidden = 100)
-    pred = mx.sym.FullyConnected(data = pred, num_hidden = vocab_size)
-    sm = mx.sym.SoftmaxOutput(data = pred, label = label)
+    pred = mx.sym.FullyConnected(data=data, num_hidden=100)
+    pred = mx.sym.FullyConnected(data=pred, num_hidden=vocab_size_)
+    sm = mx.sym.SoftmaxOutput(data=pred, label=label)
     return sm
 
+
 class SimpleBatch(object):
     def __init__(self, data_names, data, label_names, label):
         self.data = data
@@ -50,14 +54,14 @@ def provide_label(self):
 
 
 class DataIter(mx.io.DataIter):
-    def __init__(self, count, batch_size, vocab_size, num_label, feature_size):
+    def __init__(self, count, batch_size_, vocab_size_, num_label_, feature_size_):
         super(DataIter, self).__init__()
-        self.batch_size = batch_size
+        self.batch_size = batch_size_
         self.count = count
-        self.vocab_size = vocab_size
-        self.num_label = num_label
-        self.feature_size = feature_size
-        self.provide_data = [('data', (batch_size, feature_size))]
+        self.vocab_size = vocab_size_
+        self.num_label = num_label_
+        self.feature_size = feature_size_
+        self.provide_data = [('data', (batch_size_, feature_size_))]
         self.provide_label = [('label', (self.batch_size,))]
 
     def mock_sample(self):
@@ -73,10 +77,10 @@ def mock_sample(self):
         return ret, s % self.vocab_size
 
     def __iter__(self):
-        for _ in range(self.count / self.batch_size):
+        for _ in range(self.count // self.batch_size):
             data = []
             label = []
-            for i in range(self.batch_size):
+            for _ in range(self.batch_size):
                 d, l = self.mock_sample()
                 data.append(d)
                 label.append(l)
@@ -89,6 +93,7 @@ def __iter__(self):
     def reset(self):
         pass
 
+
 if __name__ == '__main__':
     head = '%(asctime)-15s %(message)s'
     logging.basicConfig(level=logging.DEBUG, format=head)
@@ -102,15 +107,20 @@ def reset(self):
     data_test = DataIter(1000, batch_size, vocab_size, num_label, feature_size)
 
     network = get_net(vocab_size)
-    devs = mx.cpu()
-    model = mx.model.FeedForward(ctx = devs,
-                                 symbol = network,
-                                 num_epoch = 20,
-                                 learning_rate = 0.03,
-                                 momentum = 0.9,
-                                 wd = 0.0000,
-                                 initializer=mx.init.Xavier(factor_type="in", magnitude=2.34))
-
-    model.fit(X = data_train, eval_data = data_test,
-              batch_end_callback = mx.callback.Speedometer(batch_size, 50),)
 
+    model = mx.mod.Module(
+        symbol=network,
+        data_names=[x[0] for x in data_train.provide_data],
+        label_names=[y[0] for y in data_train.provide_label],
+        context=[mx.cpu()]
+    )
+
+    model.fit(
+        train_data=data_train,
+        eval_data=data_test,
+        num_epoch=20,
+        optimizer='sgd',
+        optimizer_params={'learning_rate': 0.03, 'momentum': 0.9, 'wd': 0.0000},
+        initializer=mx.init.Xavier(factor_type='in', magnitude=2.34),
+        batch_end_callback=mx.callback.Speedometer(batch_size, 50)
+    )
diff --git a/example/nce-loss/wordvec.py b/example/nce-loss/wordvec.py
index 887d586ff3..2c63b307da 100644
--- a/example/nce-loss/wordvec.py
+++ b/example/nce-loss/wordvec.py
@@ -15,39 +15,41 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# pylint:skip-file
+# pylint: disable=invalid-name, missing-docstring, deprecated-module, len-as-condition
+# pylint: disable=too-many-instance-attributes
 from __future__ import print_function
+
 import logging
-import sys, random, time, math
-sys.path.insert(0, "../../python")
-import mxnet as mx
-import numpy as np
-from collections import namedtuple
-from nce import *
-from operator import itemgetter
+import math
+import random
 from optparse import OptionParser
 
-def get_net(vocab_size, num_input, num_label):
+import mxnet as mx
+from nce import nce_loss, NceAuc
+
+
+def get_net(vocab_size, num_input, num_label_):
     data = mx.sym.Variable('data')
     label = mx.sym.Variable('label')
     label_weight = mx.sym.Variable('label_weight')
     embed_weight = mx.sym.Variable('embed_weight')
-    data_embed = mx.sym.Embedding(data = data, input_dim = vocab_size,
-                                  weight = embed_weight,
-                                  output_dim = 100, name = 'data_embed')
-    datavec = mx.sym.SliceChannel(data = data_embed,
-                                     num_outputs = num_input,
-                                     squeeze_axis = 1, name = 'data_slice')
+    data_embed = mx.sym.Embedding(data=data, input_dim=vocab_size,
+                                  weight=embed_weight,
+                                  output_dim=100, name='data_embed')
+    datavec = mx.sym.SliceChannel(data=data_embed,
+                                  num_outputs=num_input,
+                                  squeeze_axis=1, name='data_slice')
     pred = datavec[0]
     for i in range(1, num_input):
         pred = pred + datavec[i]
-    return nce_loss(data = pred,
-                    label = label,
-                    label_weight = label_weight,
-                    embed_weight = embed_weight,
-                    vocab_size = vocab_size,
-                    num_hidden = 100,
-                    num_label = num_label)
+    return nce_loss(data=pred,
+                    label=label,
+                    label_weight=label_weight,
+                    embed_weight=embed_weight,
+                    vocab_size=vocab_size,
+                    num_hidden=100,
+                    num_label=num_label_)
+
 
 def load_data(name):
     buf = open(name).read()
@@ -72,6 +74,7 @@ def load_data(name):
         negative += [i for _ in range(v)]
     return data, negative, vocab, freq
 
+
 class SimpleBatch(object):
     def __init__(self, data_names, data, label_names, label):
         self.data = data
@@ -89,34 +92,32 @@ def provide_label(self):
 
 
 class DataIter(mx.io.DataIter):
-    def __init__(self, name, batch_size, num_label):
+    def __init__(self, name, batch_size_, num_label_):
         super(DataIter, self).__init__()
-        self.batch_size = batch_size
+        self.batch_size = batch_size_
         self.data, self.negative, self.vocab, self.freq = load_data(name)
         self.vocab_size = 1 + len(self.vocab)
-        print(self.vocab_size)
-        self.num_label = num_label
-        self.provide_data = [('data', (batch_size, num_label - 1))]
-        self.provide_label = [('label', (self.batch_size, num_label)),
-                              ('label_weight', (self.batch_size, num_label))]
+        print("Vocabulary Size: {}".format(self.vocab_size))
+        self.num_label = num_label_
+        self.provide_data = [('data', (batch_size_, num_label_ - 1))]
+        self.provide_label = [('label', (self.batch_size, num_label_)),
+                              ('label_weight', (self.batch_size, num_label_))]
 
     def sample_ne(self):
         return self.negative[random.randint(0, len(self.negative) - 1)]
 
     def __iter__(self):
-        print('begin')
         batch_data = []
         batch_label = []
         batch_label_weight = []
         start = random.randint(0, self.num_label - 1)
         for i in range(start, len(self.data) - self.num_label - start, self.num_label):
-            context = self.data[i: i + self.num_label / 2] \
-                      + self.data[i + 1 + self.num_label / 2: i + self.num_label]
-            target_word = self.data[i + self.num_label / 2]
+            context = self.data[i: i + self.num_label // 2] \
+                      + self.data[i + 1 + self.num_label // 2: i + self.num_label]
+            target_word = self.data[i + self.num_label // 2]
             if self.freq[target_word] < 5:
                 continue
-            target = [target_word] \
-                     + [self.sample_ne() for _ in range(self.num_label - 1)]
+            target = [target_word] + [self.sample_ne() for _ in range(self.num_label - 1)]
             target_weight = [1.0] + [0.0 for _ in range(self.num_label - 1)]
             batch_data.append(context)
             batch_label.append(target)
@@ -134,13 +135,16 @@ def __iter__(self):
     def reset(self):
         pass
 
+
 if __name__ == '__main__':
     head = '%(asctime)-15s %(message)s'
     logging.basicConfig(level=logging.DEBUG, format=head)
 
     parser = OptionParser()
-    parser.add_option("-g", "--gpu", action = "store_true", dest = "gpu", default = False,
-                      help = "use gpu")
+    parser.add_option("-g", "--gpu", action="store_true", dest="gpu", default=False,
+                      help="use gpu")
+    options, args = parser.parse_args()
+
     batch_size = 256
     num_label = 5
 
@@ -148,21 +152,25 @@ def reset(self):
 
     network = get_net(data_train.vocab_size, num_label - 1, num_label)
 
-    options, args = parser.parse_args()
     devs = mx.cpu()
-    if options.gpu == True:
+    if options.gpu:
         devs = mx.gpu()
-    model = mx.model.FeedForward(ctx = devs,
-                                 symbol = network,
-                                 num_epoch = 20,
-                                 learning_rate = 0.3,
-                                 momentum = 0.9,
-                                 wd = 0.0000,
-                                 initializer=mx.init.Xavier(factor_type="in", magnitude=2.34))
 
+    model = mx.mod.Module(
+        symbol=network,
+        data_names=[x[0] for x in data_train.provide_data],
+        label_names=[y[0] for y in data_train.provide_label],
+        context=[devs]
+    )
 
+    print("Training on {}".format("GPU" if options.gpu else "CPU"))
     metric = NceAuc()
-    model.fit(X = data_train,
-              eval_metric = metric,
-              batch_end_callback = mx.callback.Speedometer(batch_size, 50),)
-
+    model.fit(
+        train_data=data_train,
+        num_epoch=20,
+        optimizer='sgd',
+        optimizer_params={'learning_rate': 0.3, 'momentum': 0.9, 'wd': 0.0000},
+        initializer=mx.init.Xavier(factor_type='in', magnitude=2.34),
+        eval_metric=metric,
+        batch_end_callback=mx.callback.Speedometer(batch_size, 50)
+    )
diff --git a/example/nce-loss/wordvec_subwords.py b/example/nce-loss/wordvec_subwords.py
index c8d46a1aeb..45edd1d984 100644
--- a/example/nce-loss/wordvec_subwords.py
+++ b/example/nce-loss/wordvec_subwords.py
@@ -15,20 +15,23 @@
 # specific language governing permissions and limitations
 # under the License.
 
-# pylint:skip-file
+# pylint: disable=invalid-name, missing-docstring, too-many-arguments, deprecated-module
+# pylint: disable=too-many-locals, len-as-condition, too-many-instance-attributes
+from __future__ import print_function
+
+from collections import Counter
 import logging
-import sys, random, time, math
+import math
+from optparse import OptionParser
+import random
+
 import mxnet as mx
 import numpy as np
-from nce import *
-from operator import itemgetter
-from optparse import OptionParser
-from collections import Counter
+from nce import nce_loss_subwords, NceAuc
 
-import logging
-head = head = '%(asctime)-15s %(message)s'
-logging.basicConfig(level=logging.INFO, format=head)
 
+head = '%(asctime)-15s %(message)s'
+logging.basicConfig(level=logging.INFO, format=head)
 
 # ----------------------------------------------------------------------------------------
 EMBEDDING_SIZE = 100
@@ -42,7 +45,7 @@
 
 
 # ----------------------------------------------------------------------------------------
-def get_net(vocab_size, num_input, num_label):
+def get_net(vocab_size, num_input, num_label_):
     data = mx.sym.Variable('data')
     mask = mx.sym.Variable('mask')  # use mask to handle variable-length input.
     label = mx.sym.Variable('label')
@@ -77,7 +80,7 @@ def get_net(vocab_size, num_input, num_label):
                              embed_weight=embed_weight,
                              vocab_size=vocab_size,
                              num_hidden=EMBEDDING_SIZE,
-                             num_label=num_label)
+                             num_label=num_label_)
 
 
 def get_subword_units(token, gram=GRAMS):
@@ -99,7 +102,7 @@ def get_subword_representation(wid, vocab_inv, units_vocab, max_len):
 
 def prepare_subword_units(tks):
     # statistics on units
-    units_vocab = {PADDING_CHAR:1}
+    units_vocab = {PADDING_CHAR: 1}
     max_len = 0
     unit_set = set()
     logging.info('grams: %d', GRAMS)
@@ -115,13 +118,13 @@ def prepare_subword_units(tks):
             continue
         if unit not in units_vocab:
             units_vocab[unit] = len(units_vocab)
-        uid = units_vocab[unit]
+        # uid = units_vocab[unit]
     return units_vocab, max_len
 
 
 def load_data_as_subword_units(name):
     tks = []
-    fread = open(name, 'r')
+    fread = open(name, 'rb')
     logging.info('reading corpus from file...')
     for line in fread:
         line = line.strip().decode('utf-8')
@@ -196,18 +199,18 @@ def provide_label(self):
 
 
 class DataIter(mx.io.DataIter):
-    def __init__(self, fname, batch_size, num_label):
+    def __init__(self, fname, batch_size_, num_label_):
         super(DataIter, self).__init__()
-        self.batch_size = batch_size
+        self.batch_size = batch_size_
         self.data, self.units, self.weights, self.negative_units, self.negative_weights, \
-        self.vocab, self.units_vocab, self.freq, self.max_len = load_data_as_subword_units(fname)
+            self.vocab, self.units_vocab, self.freq, self.max_len = load_data_as_subword_units(fname)
         self.vocab_size = len(self.units_vocab)
-        self.num_label = num_label
-        self.provide_data = [('data', (batch_size, num_label - 1, self.max_len)),
-                             ('mask', (batch_size, num_label - 1, self.max_len, 1))]
-        self.provide_label = [('label', (self.batch_size, num_label, self.max_len)),
-                              ('label_weight', (self.batch_size, num_label)),
-                              ('label_mask', (self.batch_size, num_label, self.max_len, 1))]
+        self.num_label = num_label_
+        self.provide_data = [('data', (batch_size_, num_label_ - 1, self.max_len)),
+                             ('mask', (batch_size_, num_label_ - 1, self.max_len, 1))]
+        self.provide_label = [('label', (self.batch_size, num_label_, self.max_len)),
+                              ('label_weight', (self.batch_size, num_label_)),
+                              ('label_mask', (self.batch_size, num_label_, self.max_len, 1))]
 
     def sample_ne(self):
         # a negative sample.
@@ -225,18 +228,18 @@ def __iter__(self):
         batch_label_weight = []
         start = random.randint(0, self.num_label - 1)
         for i in range(start, len(self.units) - self.num_label - start, self.num_label):
-            context_units = self.units[i: i + self.num_label / 2] + \
-                            self.units[i + 1 + self.num_label / 2: i + self.num_label]
-            context_mask = self.weights[i: i + self.num_label / 2] + \
-                           self.weights[i + 1 + self.num_label / 2: i + self.num_label]
-            target_units = self.units[i + self.num_label / 2]
-            target_word = self.data[i + self.num_label / 2]
+            context_units = self.units[i: i + self.num_label // 2] + \
+                            self.units[i + 1 + self.num_label // 2: i + self.num_label]
+            context_mask = self.weights[i: i + self.num_label // 2] + \
+                           self.weights[i + 1 + self.num_label // 2: i + self.num_label]
+            target_units = self.units[i + self.num_label // 2]
+            target_word = self.data[i + self.num_label // 2]
             if self.freq[target_word] < MIN_COUNT:
                 continue
             indices = self.sample_ne_indices()
             target = [target_units] + [self.negative_units[i] for i in indices]
             target_weight = [1.0] + [0.0 for _ in range(self.num_label - 1)]
-            target_mask = [self.weights[i + self.num_label / 2]] + [self.negative_weights[i] for i in indices]
+            target_mask = [self.weights[i + self.num_label // 2]] + [self.negative_weights[i] for i in indices]
 
             batch_data.append(context_units)
             batch_data_mask.append(context_mask)
@@ -271,6 +274,7 @@ def reset(self):
     parser = OptionParser()
     parser.add_option("-g", "--gpu", action="store_true", dest="gpu", default=False,
                       help="use gpu")
+    options, args = parser.parse_args()
 
     batch_size = BATCH_SIZE
     num_label = NUM_LABEL
@@ -279,20 +283,25 @@ def reset(self):
 
     network = get_net(data_train.vocab_size, num_label - 1, num_label)
 
-    options, args = parser.parse_args()
-    # devs = mx.cpu()
-    devs = [mx.cpu(i) for i in range(4)]
-    if options.gpu == True:
+    devs = mx.cpu()
+    if options.gpu:
         devs = mx.gpu()
-    model = mx.model.FeedForward(ctx=devs,
-                                 symbol=network,
-                                 num_epoch=NUM_EPOCH,
-                                 learning_rate=0.3,
-                                 momentum=0.9,
-                                 wd=0.0000,
-                                 initializer=mx.init.Xavier(factor_type="in", magnitude=2.34))
 
+    model = mx.mod.Module(
+        symbol=network,
+        data_names=[x[0] for x in data_train.provide_data],
+        label_names=[y[0] for y in data_train.provide_label],
+        context=[devs]
+    )
+
+    print("Training on {}".format("GPU" if options.gpu else "CPU"))
     metric = NceAuc()
-    model.fit(X=data_train,
-              eval_metric=metric,
-              batch_end_callback=mx.callback.Speedometer(batch_size, 50), )
+    model.fit(
+        train_data=data_train,
+        num_epoch=NUM_EPOCH,
+        optimizer='sgd',
+        optimizer_params={'learning_rate': 0.3, 'momentum': 0.9, 'wd': 0.0000},
+        initializer=mx.init.Xavier(factor_type='in', magnitude=2.34),
+        eval_metric=metric,
+        batch_end_callback=mx.callback.Speedometer(batch_size, 50)
+    )


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services