You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@singa.apache.org by "wangwei (JIRA)" <ji...@apache.org> on 2015/12/30 04:38:49 UTC

[jira] [Created] (SINGA-119) Remove job registration before launching the training program

wangwei created SINGA-119:
-----------------------------

             Summary: Remove job registration before launching the training program
                 Key: SINGA-119
                 URL: https://issues.apache.org/jira/browse/SINGA-119
             Project: Singa
          Issue Type: New Feature
            Reporter: wangwei
            Assignee: Sheng Wang


Job registration, including getting the job ID, is necessary for training in a cluster. It is done in the `bin/singa-run.sh` script and before ssh to each node to invoke the training program.

For some situations, e.g, a small model or a single node (with multiple GPU cards), users do not need to train the model on multiple nodes. Many models can be trained on a single node (process) with multiple GPU cards. In this case, it would be better to remove the Job registration step to make job launching simple. For instance, users can start the training by
{code}
./singa -conf examples/cifar10/job.conf
{code}
or via python script SINGA-81
{code}
python tool/python/examples/cifar10.py
{code}

The Job ID is determined inside the program by cluster_rt.cc, which communicates with the zookeeper server. We may later make zookeeper an optional dependency for training in a single node, as it is mainly used for generating a unique job ID.

For an extreme case where there is a single worker, we do not need to create a server thread. In fact, we can create an Updater instance inside the worker, which updates the parameters locally. It would speed up the training on a single GPU card, because we do not need to transfer the gradients and parameters between the worker and the server. Currently, we have to transfer the gradients from worker (GPU memory) to the server (CPU memory), which is time consuming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)