You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bluemarlin.apache.org by GitBox <gi...@apache.org> on 2022/02/11 12:25:29 UTC

[GitHub] [incubator-bluemarlin] Bimlesh759-AI opened a new issue #46: [BLUEMARLIN-25] : Multiple GPU support for DIN lookalike model training

Bimlesh759-AI opened a new issue #46:
URL: https://github.com/apache/incubator-bluemarlin/issues/46


   1. Current DIN Lookalike model training is not supporting multiple gpu. We have two gpu available but it is using only one gpu always. It is desired that during training, It should use all available gpu.
   2. Or Can the script be modified to Tensorflow 2.0, In this version there are api for using all available gpu.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@bluemarlin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-bluemarlin] jimmylao commented on issue #46: [BLUEMARLIN-25] : Multiple GPU support for DIN lookalike model training

Posted by GitBox <gi...@apache.org>.

jimmylao commented on issue #46:
URL: https://github.com/apache/incubator-bluemarlin/issues/46#issuecomment-1036274916


   @Bimlesh759-AI 
   In the long run, for the topic of parallel training deep learning model, you may want to make use of all possible resource, including parallel (multiple GPUs) as well as distributed (multiple servers with GPUs) computing. So far, there may be 2 options that support both parallel and distributed training
   1. Tensorflow 2
   2. Uber's open source project - Horovod
   
   Since the code for DIN model uses tensorflow 1.x, there are some effort need to be done using either TF 2 or Horovod.
   For TF2, current TF 1.x code need to be upgraded to TF 2
   For Horovod, it supports TF 1.x, however, you need to take some effort to make it work. - it's the option for Amazon SageMaker
   
   Most important, you may compare how much gain (speed-up) can be achieved by each of this approach. Conceptually, both of them should work for parallel training, in practice, quantitative evaluation of speed-up performance need to be compared.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@bluemarlin.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org