You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@submarine.apache.org by GitBox <gi...@apache.org> on 2020/04/19 16:34:59 UTC

[GitHub] [submarine] wangdatan commented on a change in pull request #261: SUBMARINE-469. Update Submarine Architecture Doc

wangdatan commented on a change in pull request #261: SUBMARINE-469. Update Submarine Architecture Doc
URL: https://github.com/apache/submarine/pull/261#discussion_r410942191
 
 

 ##########
 File path: docs/design/architecture-and-requirements.md
 ##########
 @@ -142,164 +137,211 @@ Following items are charters of Submarine project:
 5) Users can define a list of parameters of a notebook (looks like parameters of the notebook's main function) to allow executing a notebook like a job. (P1)
 6) Different users can collaborate on the same notebook at the same time. (P2)
 
-#### Job
+A running notebook instance is called notebook session (or session for short).
 
-Job of Submarine is an executable code section. It could be a shell command, a Python command, a Spark job, a SQL query, a training job (such as Tensorflow), etc. 
+### Experiment
 
-1) Job can be submitted from UI/CLI.
-2) Job can be monitored/managed from UI/CLI.
-3) Job should not bind to one resource management platform (YARN/K8s).
+Experiments of Submarine is an offline task. It could be a shell command, a Python command, a Spark job, a SQL query, or even a workflow. 
 
-#### Training Job
+The primary purposes of experiments under Submarine's context is to do training tasks, offline scoring, etc. However, experiment can be generalized to do other tasks as well.
 
-Training job is a special kind of job, which includes Tensorflow, PyTorch, and other different frameworks: 
+Major requirement of experiment: 
 
-1) Allow model engineer, data scientist to run *unmodified* Tensorflow programs on YARN/K8s/Container-cloud. 
-2) Allow jobs easy access data/models in HDFS and other storage. 
-3) Support run distributed Tensorflow jobs with simple configs.
-4) Support run user-specified Docker images.
-5) Support specify GPU and other resources.
-6) Support launch tensorboard (and other equivalents for non-TF frameworks) for training jobs if user specified.
+1) Experiments can be submitted from UI/CLI/SDK.
+2) Experiments can be monitored/managed from UI/CLI/SDK.
+3) Experiments should not bind to one resource management platform (K8s/YARN).
 
-[TODO] (Need help)
+#### Type of experiments
+
+![](../assets/design/experiments.png)
+
+There're two types of experiments: 
+`Adhoc experiments`: which includes a Python/R/notebook, or even an adhoc Tensorflow/PyTorch task, etc. 
+
+`Predefined experiment library`: This is specialized experiments, which including developed libraries such as CTR, BERT, etc. Users are only required to specify a few parameters such as input, output, hyper parameters, etc. Instead of worrying about where's training script/dependencies located.
+
+#### Adhoc experiment
+
+Requirements:
+
+- Allow run adhoc scripts.
+- Allow model engineer, data scientist to run Tensorflow/Pytorch programs on YARN/K8s/Container-cloud. 
+- Allow jobs easy access data/models in HDFS/s3, etc. 
+- Support run distributed Tensorflow/Pytorch jobs with simple configs.
+- Support run user-specified Docker images.
+- Support specify GPU and other resources.
+
+#### Predefined experiment library
+
+Here's an example of predefined experiment library to train deepfm model: 
+
+```
+{
+  "input": {
+    "train_data": ["hdfs:///user/submarine/data/tr.libsvm"],
+    "valid_data": ["hdfs:///user/submarine/data/va.libsvm"],
+    "test_data": ["hdfs:///user/submarine/data/te.libsvm"],
+    "type": "libsvm"
+  },
+  "output": {
+    "save_model_dir": "hdfs:///user/submarine/deepfm",
+    "metric": "auc"
+  },
+  "training": {
+    "batch_size" : 512,
+    "field_size": 39,
+    "num_epochs": 3,
+    "feature_size": 117581,
+    ...
+  }
+}
+```
+
+Predefined experiment libraries can be shared across users on the same platform, users can also add new or modified predefined experiment library via UI/REST API.
+
+We will also model AutoML, auto hyper-parameter tuning to predefined experiment library.
+
+#### Pipeline 
+
+Pipeline is a special kind of experiment:
+
+- A pipeline is a DAG of experiments. 
+- Can be also treated as a special kind of experiment.
+- Users can submit/terminate a pipeline.
+- Pipeline can be created/submitted via UI/API.
+
+### Environment Profiles
+
+Environment profiles (or environment for short) defines a set of libraries and when Docker is being used, a Docker image in order to run an experiment or a notebook. 
+
+Docker or VM image (such as AMI: Amazon Machine Images) defines the base layer of the environment. 
+
+On top of that, users can define a set of libraries (such as Python/R) to install.
+
+Users can save different environment configs which can be also shared across the platform. Environment profiles can be used to run a notebook (e.g. by choosing different kernel from Jupyter), or an experiment. Predefined experiment library includes what environment to use so users don't have to choose which environment to use.
+
+Environments can be added/listed/deleted/selected through CLI/SDK.
 
-#### Model Management 
+### Model
 
-After training, there will be model artifacts created. Users should be able to:
+#### Model management 
 
-1) View model metrics.
-2) Save, versioning, tagging model.
-3) Run model verification tasks. 
-4) Run A/B testing, push to production, etc.
+- Model artifacts are generated by experiments or notebook.
+- A model consists of artifacts from one or multiple files. 
+- Users can choose to save, tag, version a produced model.
+- Once The Model is saved, Users can do the online serving (endpoint) or offline scoring of the model.
+
+#### Model serving (Endpoint)
 
 Review comment:
   Make sense, updating.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services