You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2021/02/24 06:50:15 UTC

[GitHub] [incubator-mxnet] MrChengmo opened a new issue #19949: DistributeTraining throw "dmlc::Error" when using nn.Embedding(sparse_grad=True)

MrChengmo opened a new issue #19949:
URL: https://github.com/apache/incubator-mxnet/issues/19949


   ## Description
   (A clear and concise description of what the bug is.)
   Hi~ I try to use a sparse embedding and three layers of DNN to train the recommendation algorithm of criteo dataset.
   
   here is my network：
   ```python
   class CtrDnn(nn.HybridBlock):
       def __init__(self, sparse_feature_number, sparse_feature_dim,
                    dense_feature_dim, num_field, layer_sizes, **kwargs):
           super(CtrDnn, self).__init__(**kwargs)
           self.sparse_feature_number = sparse_feature_number
           self.sparse_feature_dim = sparse_feature_dim
   
           sizes = [sparse_feature_dim * num_field +
                    dense_feature_dim] + layer_sizes
   
           self.embedding = nn.Embedding(
               sparse_feature_number, sparse_feature_dim, sparse_grad=True)
   
           self.dense1 = nn.Dense(in_units=sizes[0],
                                  units=sizes[1],
                                  activation='relu',
                                  weight_initializer=mx.init.Normal(1.0 / math.sqrt(sizes[1])))
   
           self.dense2 = nn.Dense(in_units=sizes[1],
                                  units=sizes[2],
                                  activation='relu',
                                  weight_initializer=mx.init.Normal(1.0 / math.sqrt(sizes[2])))
   
           self.dense3 = nn.Dense(in_units=sizes[2],
                                  units=sizes[3],
                                  activation='relu',
                                  weight_initializer=mx.init.Normal(1.0 / math.sqrt(sizes[3])))
   
           self.dense4 = nn.Dense(in_units=layer_sizes[-1],
                                  units=2,
                                  weight_initializer=mx.init.Normal(1.0 / math.sqrt(sizes[-1])))
   
       def hybrid_forward(self, F, sparse_inputs, dense_inputs):
           sparse_embs = []
           for s_input in sparse_inputs:
               emb = self.embedding(s_input)
               sparse_embs.append(emb)
   
           for i in range(len(sparse_embs)):
               sparse_embs[i] = F.reshape(
                   sparse_embs[i], (-1, self.sparse_feature_dim))
   
           dnn_input = F.concat(sparse_embs[0],
                                sparse_embs[1],
                                sparse_embs[2],
                                sparse_embs[3],
                                sparse_embs[4],
                                sparse_embs[5],
                                sparse_embs[6],
                                sparse_embs[7],
                                sparse_embs[8],
                                sparse_embs[9],
                                sparse_embs[10],
                                sparse_embs[11],
                                sparse_embs[12],
                                sparse_embs[13],
                                sparse_embs[14],
                                sparse_embs[15],
                                sparse_embs[16],
                                sparse_embs[17],
                                sparse_embs[18],
                                sparse_embs[19],
                                sparse_embs[20],
                                sparse_embs[21],
                                sparse_embs[22],
                                sparse_embs[23],
                                sparse_embs[24],
                                sparse_embs[25],
                                dense_inputs,
                                dim=1)
           layer1 = self.dense1(dnn_input)
           layer2 = self.dense2(layer1)
           layer3 = self.dense3(layer2)
           dnn_output = self.dense4(layer3)
   
           return dnn_output
   ```
   
    it works well on single machine，but when I try distributed train with  kv("dist_async") , it throw error
   
   ### Error Message
   (Paste the complete error message. Please also include stack trace by setting environment variable `DMLC_LOG_STACK_TRACE_DEPTH=100` before running your script.)
   
   ```bash
   [06:21:15] src/van.cc:310: Bind to role=worker, ip=192.168.1.2, port=35008, is_recovery=0
   [06:21:15] src/van.cc:257: W[9] is connected to others
   2021-02-24 06:21:15,768 - INFO - File list: ['./train_data/part-0']
   2021-02-24 06:21:15,775 - INFO - File: ./train_data/part-0 has 20000 examples
   2021-02-24 06:21:15,775 - INFO - Total example: 20000
   2021-02-24 06:21:16,346 - INFO - Load Data in memory finish, using time: 0.5777103900909424
   2021-02-24 06:21:16,347 - INFO - Epoch 0 training begin
   [06:21:16] src/operator/tensor/./.././../common/utils.h:473:
   Storage fallback detected:
   Copy from row_sparse storage type on cpu to default storage type on cpu.
   A temporary ndarray with default storage type will be generated in order to perform the copy. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [06:21:16] /home/centos/mxnet/3rdparty/ps-lite/include/ps/kv_app.h:697: Check failed: lens->size() == keys.size() (7626 vs. 2) :
   ```
   
   ## To Reproduce
   (If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
   
   Here is my train.py
   
   ```python
   class Train(object):
   
       def run(self):
           # hyper parameters
           epochs = 1
           batch_size = 1000
           sparse_feature_number = 1000001
           sparse_feature_dim = 10
           dense_feature_dim = 13
           num_field = 26
           layer_sizes = [400, 400, 400]
           train_data_path = "./train_data"
           print_step = 5
           distributed_train = True
           cpu_num = int(os.getenv("CPU_NUM", 1))
   
           # create network
           ctx = mx.cpu()
           net = CtrDnn(sparse_feature_number, sparse_feature_dim,
                        dense_feature_dim, num_field, layer_sizes)
           net.initialize(ctx=ctx)
           # net.hybridize()
   
           self.loss = gluon.loss.SoftmaxCrossEntropyLoss()
   
           if distributed_train:
               self.store = kv.create('dist_async')
           else:
               self.store = kv.create('local')
   
           # Load the training data
           reader_start_time = time.time()
   
           file_list = self.get_file_list(train_data_path, distributed_train)
           reader = Reader()
           dataset = reader.load_criteo_dataset(file_list)
           train_data = gluon.data.DataLoader(
               dataset, batch_size, num_workers=cpu_num, last_batch="discard")
           reader_end_time = time.time()
           logger.info("Load Data in memory finish, using time: {}".format(
               reader_end_time - reader_start_time))
   
           if distributed_train:
               trainer = gluon.Trainer(net.collect_params(), 'adam', {
                   'learning_rate': 0.0001, 'lazy_update': True}, kvstore=self.store, update_on_kvstore=True)
           else:
               trainer = gluon.Trainer(net.collect_params(), 'adam', {
                                       'learning_rate': 0.0001}, kvstore=self.store)
   
           for epoch in range(epochs):
               logger.info("Epoch {} training begin".format(epoch))
               epoch_start_time = time.time()
   
               batch_id = 1
               train_run_cost = 0.0
               total_examples = 0
               self.global_score = None
               self.global_label = None
   
               for batch in train_data:
                   train_start = time.time()
                   loss_value = self.train_batch(
                       batch, ctx, net, trainer)
   
                   train_run_cost += (time.time() - train_start)
                   total_examples += batch_size
   
                   batch_id += 1
                   if batch_id % print_step == 0:
                       metric_start = time.time()
                       fpr, tpr, _ = metrics.roc_curve(
                           list(self.global_lable.asnumpy()), list(self.global_score.asnumpy()))
                       auc_value = metrics.auc(fpr, tpr)
                       train_run_cost += (time.time() - metric_start)
   
                       metrics_string = "auc: {}, loss: {}".format(
                           auc_value, loss_value)
                       profiler_string = ""
                       profiler_string += "using_time: {} sec ".format(
                           train_run_cost)
                       profiler_string += "avg_batch_cost: {} sec, ".format(
                           format((train_run_cost) / print_step, '.5f'))
                       profiler_string += " ips: {} example/sec ".format(
                           format(total_examples / (train_run_cost), '.5f'))
                       logger.info("Epoch: {}, Batch: {}, {} {}".format(
                           epoch, batch_id, metrics_string, profiler_string))
                       train_run_cost = 0.0
                       total_examples = 0
   
               epoch_end_time = time.time()
               logger.info(
                   "Epoch: {}, using time {} second,".format(
                       epoch, epoch_end_time - epoch_start_time))
   
       def calc_auc(self, label, output):
           output_exp = output.exp()
           paratition = output_exp.sum(axis=1, keepdims=True)
           score = output_exp / paratition
           score = nd.slice_axis(score, axis=1, begin=1, end=2)
   
           if self.global_score is None:
               # for first time
               self.global_score = score
               self.global_lable = label
           else:
               self.global_score = nd.concat(self.global_score, score, dim=0)
               self.global_lable = nd.concat(self.global_lable, label, dim=0)
   
       def forward_backward(self, network, label, sparse_input, dense_input):
           # Ask autograd to remember the forward pass
           with autograd.record():
               output = network(sparse_input, dense_input)
               losses = self.loss(output, label)
               self.calc_auc(label, output)
   
           for l in [losses]:
               l.backward()
   
           return np.mean(losses.as_np_ndarray())
   
       def train_batch(self, batch_list, context, network, gluon_trainer):
           label = batch_list[0]
           # label = gluon.utils.split_and_load(label, context)
   
           sparse_input = batch_list[1:-1]
   
           dense_input = batch_list[-1]
   
           # Run the forward and backward pass
           loss = self.forward_backward(network, label, sparse_input, dense_input)
   
           # Update the parameters
           this_batch_size = batch_list[0].shape[0]
           gluon_trainer.step(this_batch_size)
   
           return loss
   
       def get_example_num(self, file_list):
           count = 0
           for f in file_list:
               last_count = count
               for _, _ in enumerate(open(f, 'r')):
                   count += 1
               logger.info("File: %s has %s examples" % (f, count - last_count))
           logger.info("Total example: %s" % count)
           return count
   
       def get_file_list(self, data_path, split_file_list=False):
           assert os.path.exists(data_path)
           file_list = [data_path + "/%s" % x for x in os.listdir(data_path)]
           file_list.sort()
           if split_file_list:
               file_list = self.get_file_shard(file_list)
           logger.info("File list: {}".format(file_list))
           self.get_example_num(file_list)
           return file_list
   
       def get_file_shard(self, files):
           if not isinstance(files, list):
               raise TypeError("files should be a list of file need to be read.")
   
           trainer_id = self.store.rank
           trainers = self.store.num_workers
   
           remainder = len(files) % trainers
           blocksize = int(len(files) / trainers)
   
           blocks = [blocksize] * trainers
           for i in range(remainder):
               blocks[i] += 1
   
           trainer_files = [[]] * trainers
           begin = 0
           for i in range(trainers):
               trainer_files[i] = files[begin:begin + blocks[i]]
               begin += blocks[i]
   
           return trainer_files[trainer_id]
   ```
   
   ### Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   I uploaded the complete code here：https://github.com/MrChengmo/MxnetPS-Example/tree/main/dnn
   
   1.  Train with single machine：`python -u train.py`
   2. Single machine simulation of distributed operation： `bash local_cluster.sh`
   
   ## What have you tried to solve it?
   
   1. it works well when set  `distributed_train=False`
   2. it can't work when use `net.hybridize()`
   
   ## Environment
   
   I run my code in docker deepo, with mxnet version == 1.7.0
   ```bash
   docker pull ufoym/deepo:cpu
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] github-actions[bot] commented on issue #19949: DistributeTraining throw "dmlc::Error" when using nn.Embedding(sparse_grad=True)

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on issue #19949:
URL: https://github.com/apache/incubator-mxnet/issues/19949#issuecomment-784836330


   Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
   Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
   If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on [contributing to MXNet](https://mxnet.apache.org/community/contribute) and our [development guides wiki](https://cwiki.apache.org/confluence/display/MXNET/Developments).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] MrChengmo edited a comment on issue #19949: DistributeTraining throw "dmlc::Error" when using nn.Embedding(sparse_grad=True)

Posted by GitBox <gi...@apache.org>.

MrChengmo edited a comment on issue #19949:
URL: https://github.com/apache/incubator-mxnet/issues/19949#issuecomment-785559171


   > Do you observe the Storage fallback also on single machine? Or does it only occur in the distributed setting?
   
   - Set `kv_store="local"` & `nn.Embedding(sparse_gard)`,  works well on single machine (distributed_traing=False), use command `python -u train.py `.
   - Set `kv_store="dist_async"` & `nn.Embedding(sparse_gard)`,  set `distributed_traing=True`, use command `bash local_cluster.sh` to simulate distributed training， throw error


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] MrChengmo edited a comment on issue #19949: DistributeTraining throw "dmlc::Error" when using nn.Embedding(sparse_grad=True)

Posted by GitBox <gi...@apache.org>.

MrChengmo edited a comment on issue #19949:
URL: https://github.com/apache/incubator-mxnet/issues/19949#issuecomment-785559171


   > Do you observe the Storage fallback also on single machine? Or does it only occur in the distributed setting?
   
   - Set `kv_store="local"` & `nn.Embedding(sparse_gard)`,  works well on single machine (distributed_training=False), use command `python -u train.py `.
   - Set `kv_store="dist_async"` & `nn.Embedding(sparse_gard)`,  set `distributed_traing=True`, use command `bash local_cluster.sh` to simulate distributed training， throw error


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] MrChengmo commented on issue #19949: DistributeTraining throw "dmlc::Error" when using nn.Embedding(sparse_grad=True)

Posted by GitBox <gi...@apache.org>.

MrChengmo commented on issue #19949:
URL: https://github.com/apache/incubator-mxnet/issues/19949#issuecomment-785559171


   > Do you observe the Storage fallback also on single machine? Or does it only occur in the distributed setting?
   
   - Set `kv_store="local"` & `nn.Embedding(sparse_gard)`,  works well on single machine (distributed_traing=False), use command `python -u train.py `.
   - Set `kv_store="dist_async"` & `nn.Embedding(sparse_gard)`,  set `distributed_traing=True`, use command `bash local_cluster` to simulate distributed training， throw error


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] leezu commented on issue #19949: DistributeTraining throw "dmlc::Error" when using nn.Embedding(sparse_grad=True)

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #19949:
URL: https://github.com/apache/incubator-mxnet/issues/19949#issuecomment-785248093


   Do you observe the Storage fallback also on single machine? Or does it only occur in the distributed setting?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org