You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/09/17 11:05:26 UTC

[GitHub] liu6381810 opened a new issue #12577: Training with fc and multi-gpu is much slower than single gpu

liu6381810 opened a new issue #12577: Training with fc and multi-gpu is much slower than single gpu
URL: https://github.com/apache/incubator-mxnet/issues/12577
 
 
   Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.
   
   For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io 
   
   ## Description
   trainning with multi-gpu is much slower than single gpu
   
   ## Environment info (Required)
   Ubuntu16.04
   python2.7
   latest mxnet
   I use symbol not gloun
   8 * V100
   
   
   
   
   ## Details:
   Actually I need train fm for Recommender Systems,and I don't follow the sparse example in mxnet
   I use the embedding layer to do this
   But I find if I use one V100 with batch_size=8192  The speed is 40w+ sample
   and if I change to use 8*V100 with batch_size=8192 which means every V100 got batch_size=1024 then I get 1.8W+sample per second
   It shows with 8 * V100 it's much slower than one V100
   I don't think the limit is data io because I can get 40w+ sample per second with one V100
   
   I  also use 40 fc layers to fit data generated by y=ax+b 
   2-gpu is also slower than 1-gpu with the same batch_size
   
   
   So what's the reason about this?I would appreciate it if anyone can give any advice!
   
   Below is my symbol
   
           self.feat_index = mx.symbol.Variable(name="feat_index") #batch * F
           self.feat_value = mx.symbol.Variable(name="feat_value") #batch * F
           self.label = mx.symbol.Variable(name="label")
           
           self.weights = self._initialize_weights()
           
   
           self.embeddings = mx.symbol.Embedding(self.feat_index,
                                                 input_dim = self.feature_size,
                                                 output_dim = self.embedding_size, name="embed1")
           
           feat_value = mx.symbol.reshape(self.feat_value, [-1, self.field_size, 1]) #batch * F * 1
           self.embeddings = mx.symbol.broadcast_mul(self.embeddings, feat_value)
           
           # ---------- first order term ----------
           
           self.y_first_order = mx.symbol.Embedding(self.feat_index,
                                                    input_dim = self.feature_size,
                                                    output_dim = 1, name="embed2")
           self.y_first_order = mx.symbol.sum(mx.symbol.elemwise_mul(self.y_first_order, feat_value), 2)  # None * F
   
           
           # ---------- second order term ---------------
           
           self.summed_features_emb = mx.symbol.sum(self.embeddings, 1)  # None * K
           self.summed_features_emb_square = mx.symbol.square(self.summed_features_emb)  # None * K
   
           # square_sum part
           self.squared_features_emb = mx.symbol.square(self.embeddings)
           self.squared_sum_features_emb = mx.symbol.sum(self.squared_features_emb, 1)  # None * K
   
   
           self.y_second_order = 0.5 * mx.symbol.elemwise_sub(self.summed_features_emb_square, 
                                                              self.squared_sum_features_emb)  # None * K
           
           
           self.concat_input = mx.symbol.concat(self.y_first_order, self.y_second_order, dim=1)
           self.out = mx.symbol.sum(mx.symbol.broadcast_add(mx.symbol.dot(self.concat_input, 
                                                                              self.weights["concat_projection"]), 
                                                                self.weights["concat_bias"]),1)
           
           self.model = mx.symbol.LogisticRegressionOutput(self.out, self.label)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services