You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@singa.apache.org by wa...@apache.org on 2020/04/08 16:51:37 UTC

[singa-doc] branch master updated: archive version 3.0.0.rc1

This is an automated email from the ASF dual-hosted git repository.

wangwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/singa-doc.git


The following commit(s) were added to refs/heads/master by this push:
     new 0c8dd58  archive version 3.0.0.rc1
0c8dd58 is described below

commit 0c8dd58569aaa2de306f0bda6e58e6735efd79e6
Author: wang wei <wa...@gmail.com>
AuthorDate: Thu Apr 9 00:46:42 2020 +0800

    archive version 3.0.0.rc1
---
 .../versioned_docs/version-3.0.0.rc1/autograd.md   | 267 ++++++++
 .../version-3.0.0.rc1/benchmark-train.md           |  27 +
 .../versioned_docs/version-3.0.0.rc1/build.md      | 529 ++++++++++++++
 .../version-3.0.0.rc1/contribute-code.md           | 118 ++++
 .../version-3.0.0.rc1/contribute-docs.md           | 105 +++
 .../versioned_docs/version-3.0.0.rc1/device.md     |  33 +
 .../versioned_docs/version-3.0.0.rc1/dist-train.md | 427 ++++++++++++
 .../version-3.0.0.rc1/download-singa.md            | 170 +++++
 .../versioned_docs/version-3.0.0.rc1/examples.md   |  57 ++
 .../version-3.0.0.rc1/git-workflow.md              | 131 ++++
 .../versioned_docs/version-3.0.0.rc1/graph.md      | 590 ++++++++++++++++
 .../version-3.0.0.rc1/history-singa.md             |  42 ++
 .../version-3.0.0.rc1/how-to-release.md            | 142 ++++
 .../version-3.0.0.rc1/install-win.md               | 400 +++++++++++
 .../version-3.0.0.rc1/installation.md              | 144 ++++
 .../version-3.0.0.rc1/issue-tracking.md            |  12 +
 .../versioned_docs/version-3.0.0.rc1/mail-lists.md |  16 +
 .../versioned_docs/version-3.0.0.rc1/onnx.md       | 762 +++++++++++++++++++++
 .../releases/RELEASE_NOTES_0.1.0.md                | 153 +++++
 .../releases/RELEASE_NOTES_0.2.0.md                |  82 +++
 .../releases/RELEASE_NOTES_0.3.0.md                |  43 ++
 .../releases/RELEASE_NOTES_1.0.0.md                |  96 +++
 .../releases/RELEASE_NOTES_1.1.0.md                |  57 ++
 .../releases/RELEASE_NOTES_1.2.0.md                |  63 ++
 .../releases/RELEASE_NOTES_2.0.0.md                |  56 ++
 .../versioned_docs/version-3.0.0.rc1/security.md   |  10 +
 .../version-3.0.0.rc1/software-stack.md            | 143 ++++
 .../version-3.0.0.rc1/source-repository.md         |  24 +
 .../versioned_docs/version-3.0.0.rc1/team-list.md  |  59 ++
 .../versioned_docs/version-3.0.0.rc1/tensor.md     | 241 +++++++
 .../version-3.0.0.rc1-sidebars.json                |  34 +
 docs-site/website/versions.json                    |   1 +
 32 files changed, 5034 insertions(+)

diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/autograd.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/autograd.md
new file mode 100644
index 0000000..3ebefa6
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/autograd.md
@@ -0,0 +1,267 @@
+---
+id: version-3.0.0.rc1-autograd
+title: Autograd
+original_id: autograd
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+There are two typical ways to implement autograd, via symbolic differentiation
+like [Theano](http://deeplearning.net/software/theano/index.html) or reverse
+differentiation like
+[Pytorch](https://pytorch.org/docs/stable/notes/autograd.html). SINGA follows
+Pytorch way, which records the computation graph and apply the backward
+propagation automatically after forward propagation. The autograd algorithm is
+explained in details
+[here](https://pytorch.org/docs/stable/notes/autograd.html). We explain the
+relevant modules in Singa and give an example to illustrate the usage.
+
+## Relevant Modules
+
+There are three classes involved in autograd, namely `singa.tensor.Tensor`,
+`singa.autograd.Operation`, and `singa.autograd.Layer`. In the rest of this
+article, we use tensor, operation and layer to refer to an instance of the
+respective class.
+
+### Tensor
+
+Three attributes of Tensor are used by autograd,
+
+- `.creator` is an `Operation` instance. It records the operation that generates
+  the Tensor instance.
+- `.requires_grad` is a boolean variable. It is used to indicate that the
+  autograd algorithm needs to compute the gradient of the tensor (i.e., the
+  owner). For example, during backpropagation, the gradients of the tensors for
+  the weight matrix of a linear layer and the feature maps of a convolution
+  layer (not the bottom layer) should be computed.
+- `.stores_grad` is a boolean variable. It is used to indicate that the gradient
+  of the owner tensor should be stored and output by the backward function. For
+  example, the gradient of the feature maps is computed during backpropagation,
+  but is not included in the output of the backward function.
+
+Programmers can change `requires_grad` and `stores_grad` of a Tensor instance.
+For example, if later is set to True, the corresponding gradient is included in
+the output of the backward function. It should be noted that if `stores_grad` is
+True, then `requires_grad` must be true, not vice versa.
+
+### Operation
+
+It takes one or more `Tensor` instances as input, and then outputs one or more
+`Tensor` instances. For example, ReLU can be implemented as a specific Operation
+subclass. When an `Operation` instance is called (after instantiation), the
+following two steps are executed:
+
+1. record the source operations, i.e., the `creator`s of the input tensors.
+2. do calculation by calling member function `.forward()`
+
+There are two member functions for forwarding and backwarding, i.e.,
+`.forward()` and `.backward()`. They take `Tensor.data` as inputs (the type is
+`CTensor`), and output `Ctensor`s. To add a specific operation, subclass
+`operation` should implement their own `.forward()` and `.backward()`. The
+`backward()` function is called by the `backward()` function of autograd
+automatically during backward propogation to compute the gradients of inputs
+(according to the `require_grad` field).
+
+### Layer
+
+For those operations that require parameters, we package them into a new class,
+`Layer`. For example, convolution operation is wrapped into a convolution layer.
+`Layer` manages (stores) the parameters and calls the corresponding `Operation`s
+to implement the transformation.
+
+## Examples
+
+Multiple examples are provided in the
+[example folder](https://github.com/apache/singa/tree/master/examples/autograd).
+We explain two representative examples here.
+
+### Operation only
+
+The following codes implement a MLP model using only Operation instances (no
+Layer instances).
+
+#### Import packages
+
+```python
+from singa.tensor import Tensor
+from singa import autograd
+from singa import opt
+```
+
+#### Create weight matrix and bias vector
+
+The parameter tensors are created with both `requires_grad` and `stores_grad`
+set to `True`.
+
+```python
+w0 = Tensor(shape=(2, 3), requires_grad=True, stores_grad=True)
+w0.gaussian(0.0, 0.1)
+b0 = Tensor(shape=(1, 3), requires_grad=True, stores_grad=True)
+b0.set_value(0.0)
+
+w1 = Tensor(shape=(3, 2), requires_grad=True, stores_grad=True)
+w1.gaussian(0.0, 0.1)
+b1 = Tensor(shape=(1, 2), requires_grad=True, stores_grad=True)
+b1.set_value(0.0)
+```
+
+#### Training
+
+```python
+inputs = Tensor(data=data)  # data matrix
+target = Tensor(data=label) # label vector
+autograd.training = True    # for training
+sgd = opt.SGD(0.05)   # optimizer
+
+for i in range(10):
+    x = autograd.matmul(inputs, w0) # matrix multiplication
+    x = autograd.add_bias(x, b0)    # add the bias vector
+    x = autograd.relu(x)            # ReLU activation operation
+
+    x = autograd.matmul(x, w1)
+    x = autograd.add_bias(x, b1)
+
+    loss = autograd.softmax_cross_entropy(x, target)
+
+    for p, g in autograd.backward(loss):
+        sgd.update(p, g)
+```
+
+### Operation + Layer
+
+The following
+[example](https://github.com/apache/singa/blob/master/examples/autograd/mnist_cnn.py)
+implements a CNN model using layers provided by the autograd module.
+
+#### Create the layers
+
+```python
+conv1 = autograd.Conv2d(1, 32, 3, padding=1, bias=False)
+bn1 = autograd.BatchNorm2d(32)
+pooling1 = autograd.MaxPool2d(3, 1, padding=1)
+conv21 = autograd.Conv2d(32, 16, 3, padding=1)
+conv22 = autograd.Conv2d(32, 16, 3, padding=1)
+bn2 = autograd.BatchNorm2d(32)
+linear = autograd.Linear(32 * 28 * 28, 10)
+pooling2 = autograd.AvgPool2d(3, 1, padding=1)
+```
+
+#### Define the forward function
+
+The operations in the forward pass will be recorded automatically for backward
+propagation.
+
+```python
+def forward(x, t):
+    # x is the input data (a batch of images)
+    # t the the label vector (a batch of integers)
+    y = conv1(x)           # Conv layer
+    y = autograd.relu(y)   # ReLU operation
+    y = bn1(y)             # BN layer
+    y = pooling1(y)        # Pooling Layer
+
+    # two parallel convolution layers
+    y1 = conv21(y)
+    y2 = conv22(y)
+    y = autograd.cat((y1, y2), 1)  # cat operation
+    y = autograd.relu(y)           # ReLU operation
+    y = bn2(y)
+    y = pooling2(y)
+
+    y = autograd.flatten(y)        # flatten operation
+    y = linear(y)                  # Linear layer
+    loss = autograd.softmax_cross_entropy(y, t)  # operation
+    return loss, y
+```
+
+#### Training
+
+```python
+autograd.training = True
+for epoch in range(epochs):
+    for i in range(batch_number):
+        inputs = tensor.Tensor(device=dev, data=x_train[
+                               i * batch_sz:(1 + i) * batch_sz], stores_grad=False)
+        targets = tensor.Tensor(device=dev, data=y_train[
+                                i * batch_sz:(1 + i) * batch_sz], requires_grad=False, stores_grad=False)
+
+        loss, y = forward(inputs, targets) # forward the net
+
+        for p, gp in autograd.backward(loss):  # auto backward
+            sgd.update(p, gp)
+```
+
+### Using the Module API
+
+The following
+[example](https://github.com/apache/singa/blob/master/examples/autograd/cnn_module.py)
+implements a CNN model using the Module provided by the module.
+
+#### Define the subclass of Module
+
+Define the model class, it should be the subclass of the Module. In this way,
+all operations used during traing phase will form a calculation graph and will
+be analyzed. The operations in the graph will be scheduled and executed
+efficiently. Layers can also be included in the module class.
+
+```python
+class MLP(module.Module):  # the model is a subclass of Module
+
+    def __init__(self, optimizer):
+        super(MLP, self).__init__()
+
+        # init the operators, layers and other objects
+        self.w0 = Tensor(shape=(2, 3), requires_grad=True, stores_grad=True)
+        self.w0.gaussian(0.0, 0.1)
+        self.b0 = Tensor(shape=(3,), requires_grad=True, stores_grad=True)
+        self.b0.set_value(0.0)
+
+        self.w1 = Tensor(shape=(3, 2), requires_grad=True, stores_grad=True)
+        self.w1.gaussian(0.0, 0.1)
+        self.b1 = Tensor(shape=(2,), requires_grad=True, stores_grad=True)
+        self.b1.set_value(0.0)
+
+        # init the optimizer
+        self.optimizer = optimizer
+
+    def forward(self, inputs):  # define the forward function
+        x = autograd.matmul(inputs, self.w0)
+        x = autograd.add_bias(x, self.b0)
+        x = autograd.relu(x)
+        x = autograd.matmul(x, self.w1)
+        x = autograd.add_bias(x, self.b1)
+        return x
+
+    def loss(self, out, target): # define the loss function
+        # can use the loss operations provided by SINGA or self-defined function
+        return autograd.softmax_cross_entropy(out, target)
+
+    def optim(self, loss):       # define the optim function
+        # can use the optimizer provided by SINGA or self-defined function
+        return self.optimizer.backward_and_update(loss)
+```
+
+#### Training
+
+```python
+# create a model instance
+model = MLP(sgd)
+# declare what device to train on
+model.on_device(dev)
+# declare execution mode and order
+model.graph(graph, sequential)
+
+for i in range(niters):
+    out = model(inputs)
+    loss = model.loss(out, target)
+    model.optim(loss)
+
+    if i % (niters / 10) == 0 and rank_in_global == 0:
+        print("training loss = ", tensor.to_numpy(loss)[0], flush=True)
+```
+
+### Python API
+
+Refer
+[here](https://singa.readthedocs.io/en/latest/docs/autograd.html#module-singa.autograd)
+for more details of Python API.
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/benchmark-train.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/benchmark-train.md
new file mode 100644
index 0000000..0fcd34e
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/benchmark-train.md
@@ -0,0 +1,27 @@
+---
+id: version-3.0.0.rc1-benchmark-train
+title: Benchmark for Distributed Training
+original_id: benchmark-train
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+Workload: we use a deep convolutional neural network,
+[ResNet-50](https://github.com/apache/singa/blob/master/examples/autograd/resnet.py)
+as the application. ResNet-50 is has 50 convolution layers for image
+classification. It requires 3.8 GFLOPs to pass a single image (of size 224x224)
+through the network. The input image size is 224x224.
+
+Hardware: we use p2.8xlarge instances from AWS, each of which has 8 Nvidia Tesla
+K80 GPUs, 96 GB GPU memory in total, 32 vCPU, 488 GB main memory, 10 Gbps
+network bandwidth.
+
+Metric: we measure the time per iteration for different number of workers to
+evaluate the scalability of SINGA. The batch size is fixed to be 32 per GPU.
+Synchronous training scheme is applied. As a result, the effective batch size is
+$32N$, where N is the number of GPUs. We compare with a popular open source
+system which uses the parameter server topology. The first GPU is selected as
+the server.
+
+![Benchmark Experiments](assets/benchmark.png) <br/> **Scalability test. Bars
+are for the throughput; lines are for the communication cost.**
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/build.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/build.md
new file mode 100644
index 0000000..9358c37
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/build.md
@@ -0,0 +1,529 @@
+---
+id: version-3.0.0.rc1-build
+title: Build SINGA from Source
+original_id: build
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+The source files could be downloaded either as a
+[tar.gz file](https://dist.apache.org/repos/dist/dev/singa/), or as a git repo
+
+```shell
+$ git clone https://github.com/apache/singa.git
+$ cd singa/
+```
+
+If you want to contribute code to SINGA, refer to
+[contribute-code page](contribute-code.md) for the steps and requirements.
+
+## Use Conda to build SINGA
+
+Conda-build is a building tool that installs the dependent libraries from
+anaconda cloud and executes the building scripts.
+
+To install conda-build (after installing conda)
+
+```shell
+conda install conda-build
+```
+
+### Build CPU Version
+
+To build the CPU version of SINGA
+
+```shell
+conda build tool/conda/singa/
+```
+
+The above commands have been tested on Ubuntu (14.04, 16.04 and 18.04) and macOS
+10.11. Refer to the [Travis-CI page](https://travis-ci.org/apache/singa) for
+more information.
+
+### Build GPU Version
+
+To build the GPU version of SINGA, the building machine must have Nvida GPU, and
+the CUDA driver (>= 384.81), CUDA toolkit (>=9) and cuDNN (>=7) must have be
+installed. The following two Docker images provide the building environment:
+
+1. apache/singa:conda-cuda9.0
+2. apache/singa:conda-cuda10.0
+
+Once the building environment is ready, you need to export the CUDA version
+first, and then run conda command to build SINGA
+
+```shell
+export CUDA=x.y (e.g. 9.0)
+conda build tool/conda/singa/
+```
+
+### Post Processing
+
+The location of the generated package file (`.tar.gz`) is shown on the screen.
+The generated package can be installed directly,
+
+```shell
+conda install -c conda-forge --use-local <path to the package file>
+```
+
+or uploaded to anaconda cloud for others to download and install. You need to
+register an account on anaconda for
+[uploading the package](https://docs.anaconda.com/anaconda-cloud/user-guide/getting-started/).
+
+```shell
+conda install anaconda-client
+anaconda login
+anaconda upload -l main <path to the package file>
+```
+
+After uploading the package to the cloud, you can see it on
+[Anaconda Cloud](https://anaconda.org/) website or via the following command
+
+```shell
+conda search -c <anaconda username> singa
+```
+
+Each specific SINGA package is identified by the version and build string. To
+install a specific SINGA package, you need to provide all the information, e.g.,
+
+```shell
+conda install -c <anaconda username> -c conda-forge singa=2.1.0.dev=cpu_py36
+```
+
+To make the installation command simple, you can create the following additional
+packages which depend on the latest CPU and GPU SINGA packages.
+
+```console
+# for singa-cpu
+conda build tool/conda/cpu/  --python=3.6
+conda build tool/conda/cpu/  --python=3.7
+# for singa-gpu
+conda build tool/conda/gpu/  --python=3.6
+conda build tool/conda/gpu/  --python=3.7
+```
+
+Therefore, when you run
+
+```shell
+conda install -c <anaconda username> -c conda-forge singa-xpu
+```
+
+(`xpu` is either 'cpu' or 'gpu'), the corresponding real SINGA package is
+installed as the dependent library.
+
+## Use native tools to build SINGA on Ubuntu
+
+Refer to SINGA
+[Dockerfiles](https://github.com/apache/singa/blob/master/tool/docker/devel/ubuntu/cuda9/Dockerfile#L30)
+for the instructions of installing the dependent libraries on Ubuntu 16.04. You
+can also create a Docker container using the [devel images]() and build SINGA
+inside the container. To build SINGA with GPU, DNNL, Python and unit tests, run
+the following instructions
+
+```shell
+mkdir build    # at the root of singa folder
+cd build
+cmake -DENABLE_TEST=ON -DUSE_CUDA=ON -DUSE_DNNL=ON -DUSE_PYTHON3=ON ..
+make
+cd python
+pip install .
+```
+
+The details of the CMake options are explained in the last section of this page.
+The last command install the Python package. You can also run
+`pip install -e .`, which creates symlinks instead of copying the Python files
+into the site-package folder.
+
+If SINGA is compiled with ENABLE_TEST=ON, you can run the unit tests by
+
+```shell
+$ ./bin/test_singa
+```
+
+You can see all the testing cases with testing results. If SINGA passes all
+tests, then you have successfully installed SINGA.
+
+## Use native tools to Build SINGA on Centos7
+
+Building from source will be different for Centos7 as package names
+differ.Follow the instructions given below.
+
+### Installing dependencies
+
+Basic packages/libraries
+
+```shell
+sudo yum install freetype-devel libXft-devel ncurses-devel openblas-devel blas-devel lapack devel atlas-devel kernel-headers unzip wget pkgconfig zip zlib-devel libcurl-devel cmake curl unzip dh-autoreconf git python-devel glog-devel protobuf-devel
+```
+
+For build-essential
+
+```shell
+sudo yum group install "Development Tools"
+```
+
+For installing swig
+
+```shell
+sudo yum install pcre-devel
+wget http://prdownloads.sourceforge.net/swig/swig-3.0.10.tar.gz
+tar xvzf swig-3.0.10.tar.gz
+cd swig-3.0.10.tar.gz
+./configure --prefix=${RUN}
+make
+make install
+```
+
+For installing gfortran
+
+```shell
+sudo yum install centos-release-scl-rh
+sudo yum --enablerepo=centos-sclo-rh-testing install devtoolset-7-gcc-gfortran
+```
+
+For installing pip and other packages
+
+```shell
+sudo yum install epel-release
+sudo yum install python-pip
+pip install matplotlib numpy pandas scikit-learn pydot
+```
+
+### Installation
+
+Follow steps 1-5 of _Use native tools to build SINGA on Ubuntu_
+
+### Testing
+
+You can run the unit tests by,
+
+```shell
+$ ./bin/test_singa
+```
+
+You can see all the testing cases with testing results. If SINGA passes all
+tests, then you have successfully installed SINGA.
+
+## Compile SINGA on Windows
+
+Instructions for building on Windows with Python support can be found
+[install-win page](install-win.md).
+
+## More details about the compilation options
+
+### USE_MODULES (deprecated)
+
+If protobuf and openblas are not installed, you can compile SINGA together with
+them
+
+```shell
+$ In SINGA ROOT folder
+$ mkdir build
+$ cd build
+$ cmake -DUSE_MODULES=ON ..
+$ make
+```
+
+cmake would download OpenBlas and Protobuf (2.6.1) and compile them together
+with SINGA.
+
+You can use `ccmake ..` to configure the compilation options. If some dependent
+libraries are not in the system default paths, you need to export the following
+environment variables
+
+```shell
+export CMAKE_INCLUDE_PATH=<path to the header file folder>
+export CMAKE_LIBRARY_PATH=<path to the lib file folder>
+```
+
+### USE_PYTHON
+
+Option for compiling the Python wrapper for SINGA,
+
+```shell
+$ cmake -DUSE_PYTHON=ON ..
+$ make
+$ cd python
+$ pip install .
+```
+
+### USE_CUDA
+
+Users are encouraged to install the CUDA and
+[cuDNN](https://developer.nvidia.com/cudnn) for running SINGA on GPUs to get
+better performance.
+
+SINGA has been tested over CUDA 9/10, and cuDNN 7. If cuDNN is installed into
+non-system folder, e.g. /home/bob/local/cudnn/, the following commands should be
+executed for cmake and the runtime to find it
+
+```shell
+$ export CMAKE_INCLUDE_PATH=/home/bob/local/cudnn/include:$CMAKE_INCLUDE_PATH
+$ export CMAKE_LIBRARY_PATH=/home/bob/local/cudnn/lib64:$CMAKE_LIBRARY_PATH
+$ export LD_LIBRARY_PATH=/home/bob/local/cudnn/lib64:$LD_LIBRARY_PATH
+```
+
+The cmake options for CUDA and cuDNN should be switched on
+
+```shell
+# Dependent libs are install already
+$ cmake -DUSE_CUDA=ON ..
+$ make
+```
+
+### USE_DNNL
+
+User can enable DNNL to enhance the performance of CPU computation.
+
+Installation guide of DNNL could be found
+[here](https://github.com/intel/mkl-dnn#installation).
+
+SINGA has been tested over DNNL v1.1.
+
+To build SINGA with DNNL support:
+
+```shell
+# Dependent libs are installed already
+$ cmake -DUSE_DNNL=ON ..
+$ make
+```
+
+### USE_OPENCL
+
+SINGA uses opencl-headers and viennacl (version 1.7.1 or newer) for OpenCL
+support, which can be installed using via
+
+```shell
+# On Ubuntu 16.04
+$ sudo apt-get install opencl-headers, libviennacl-dev
+# On Fedora
+$ sudo yum install opencl-headers, viennacl
+```
+
+Additionally, you will need the OpenCL Installable Client Driver (ICD) for the
+platforms that you want to run OpenCL on.
+
+- For AMD and nVidia GPUs, the driver package should also install the correct
+  OpenCL ICD.
+- For Intel CPUs and/or GPUs, get the driver from the
+  [Intel website.](https://software.intel.com/en-us/articles/opencl-drivers)
+  Note that the drivers provided on that website only supports recent CPUs and
+  Iris GPUs.
+- For older Intel CPUs, you can use the `beignet-opencl-icd` package.
+
+Note that running OpenCL on CPUs is not currently recommended because it is
+slow. Memory transfer is on the order of whole seconds (1000's of ms on CPUs as
+compared to 1's of ms on GPUs).
+
+More information on setting up a working OpenCL environment may be found
+[here](https://wiki.tiker.net/OpenCLHowTo).
+
+If the package version of ViennaCL is not at least 1.7.1, you will need to build
+it from source:
+
+Clone [the repository from here](https://github.com/viennacl/viennacl-dev),
+checkout the `release-1.7.1` tag and build it. Remember to add its directory to
+`PATH` and the built libraries to `LD_LIBRARY_PATH`.
+
+To build SINGA with OpenCL support (tested on SINGA 1.1):
+
+```shell
+$ cmake -DUSE_OPENCL=ON ..
+$ make
+```
+
+### PACKAGE
+
+This setting is used to build the Debian package. Set PACKAGE=ON and build the
+package with make command like this:
+
+```shell
+$ cmake -DPACKAGE=ON
+$ make package
+```
+
+## FAQ
+
+- Q: Error from 'import singa'
+
+  A: Please check the detailed error from
+  `python -c "from singa import _singa_wrap"`. Sometimes it is caused by the
+  dependent libraries, e.g. there are multiple versions of protobuf, missing of
+  cudnn, numpy version mismatch. Following steps show the solutions for
+  different cases
+
+  1. Check the cudnn and cuda. If cudnn is missing or not match with the wheel
+     version, you can download the correct version of cudnn into ~/local/cudnn/
+     and
+
+     ```shell
+     $ echo "export LD_LIBRARY_PATH=/home/<yourname>/local/cudnn/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc
+     ```
+
+  2. If it is the problem related to protobuf. You can install protobuf (3.6.1)
+     from source into a local folder, say ~/local/; Decompress the tar file, and
+     then
+
+     ```shell
+     $ ./configure --prefix=/home/<yourname>local
+     $ make && make install
+     $ echo "export LD_LIBRARY_PATH=/home/<yourname>/local/lib:$LD_LIBRARY_PATH" >> ~/.bashrc
+     $ source ~/.bashrc
+     ```
+
+  3. If it cannot find other libs including python, then create virtual env
+     using `pip` or `conda`;
+
+  4. If it is not caused by the above reasons, go to the folder of
+     `_singa_wrap.so`,
+
+     ```shell
+     $ python
+     >> import importlib
+     >> importlib.import_module('_singa_wrap')
+     ```
+
+     Check the error message. For example, if the numpy version mismatches, the
+     error message would be,
+
+     ```shell
+     RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa
+     ```
+
+     Then you need to upgrade the numpy.
+
+* Q: Error from running `cmake ..`, which cannot find the dependent libraries.
+
+  A: If you haven't installed the libraries, install them. If you installed the
+  libraries in a folder that is outside of the system folder, e.g. /usr/local,
+  you need to export the following variables
+
+  ```shell
+  $ export CMAKE_INCLUDE_PATH=<path to your header file folder>
+  $ export CMAKE_LIBRARY_PATH=<path to your lib file folder>
+  ```
+
+- Q: Error from `make`, e.g. the linking phase
+
+  A: If your libraries are in other folders than system default paths, you need
+  to export the following varaibles
+
+  ```shell
+  $ export LIBRARY_PATH=<path to your lib file folder>
+  $ export LD_LIBRARY_PATH=<path to your lib file folder>
+  ```
+
+* Q: Error from header files, e.g. 'cblas.h no such file or directory exists'
+
+  A: You need to include the folder of the cblas.h into CPLUS_INCLUDE_PATH,
+  e.g.,
+
+  ```shell
+  $ export CPLUS_INCLUDE_PATH=/opt/OpenBLAS/include:$CPLUS_INCLUDE_PATH
+  ```
+
+* Q:While compiling SINGA, I get error `SSE2 instruction set not enabled`
+
+  A:You can try following command:
+
+  ```shell
+  $ make CFLAGS='-msse2' CXXFLAGS='-msse2'
+  ```
+
+* Q:I get `ImportError: cannot import name enum_type_wrapper` from
+  google.protobuf.internal when I try to import .py files.
+
+  A: You need to install the python binding of protobuf, which could be
+  installed via
+
+  ```shell
+  $ sudo apt-get install protobuf
+  ```
+
+  or from source
+
+  ```shell
+  $ cd /PROTOBUF/SOURCE/FOLDER
+  $ cd python
+  $ python setup.py build
+  $ python setup.py install
+  ```
+
+* Q: When I build OpenBLAS from source, I am told that I need a Fortran
+  compiler.
+
+  A: You can compile OpenBLAS by
+
+  ```shell
+  $ make ONLY_CBLAS=1
+  ```
+
+  or install it using
+
+  ```shell
+  $ sudo apt-get install libopenblas-dev
+  ```
+
+* Q: When I build protocol buffer, it reports that `GLIBC++_3.4.20` not found in
+  `/usr/lib64/libstdc++.so.6`?
+
+  A: This means the linker found libstdc++.so.6 but that library belongs to an
+  older version of GCC than was used to compile and link the program. The
+  program depends on code defined in the newer libstdc++ that belongs to the
+  newer version of GCC, so the linker must be told how to find the newer
+  libstdc++ shared library. The simplest way to fix this is to find the correct
+  libstdc++ and export it to LD_LIBRARY_PATH. For example, if GLIBC++\_3.4.20 is
+  listed in the output of the following command,
+
+        $ strings /usr/local/lib64/libstdc++.so.6|grep GLIBC++
+
+  then you just set your environment variable as
+
+        $ export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH
+
+* Q: When I build glog, it reports that "src/logging_unittest.cc:83:20: error:
+  ‘gflags’ is not a namespace-name"
+
+  A: It maybe that you have installed gflags with a different namespace such as
+  "google". so glog can't find 'gflags' namespace. Because it is not necessary
+  to have gflags to build glog. So you can change the configure.ac file to
+  ignore gflags.
+
+        1. cd to glog src directory
+        2. change line 125 of configure.ac  to "AC_CHECK_LIB(gflags, main, ac_cv_have_libgflags=0, ac_cv_have_libgflags=0)"
+        3. autoreconf
+
+  After this, you can build glog again.
+
+* Q: When using virtual environment, every time I run pip install, it would
+  reinstall numpy. However, the numpy would not be used when I `import numpy`
+
+  A: It could be caused by the `PYTHONPATH` which should be set to empty when
+  you are using virtual environment to avoid the conflicts with the path of the
+  virtual environment.
+
+* Q: When compiling PySINGA from source, there is a compilation error due to the
+  missing of <numpy/objectarray.h>
+
+  A: Please install numpy and export the path of numpy header files as
+
+        $ export CPLUS_INCLUDE_PATH=`python -c "import numpy; print numpy.get_include()"`:$CPLUS_INCLUDE_PATH
+
+* Q: When I run SINGA in Mac OS X, I got the error "Fatal Python error:
+  PyThreadState_Get: no current thread Abort trap: 6"
+
+  A: This error happens typically when you have multiple version of Python on
+  your system and you installed SINGA via pip (this problem is resolved for
+  installation via conda), e.g, the one comes with the OS and the one installed
+  by Homebrew. The Python linked by PySINGA must be the same as the Python
+  interpreter. You can check your interpreter by `which python` and check the
+  Python linked by PySINGA via `otool -L <path to _singa_wrap.so>`. To fix this
+  error, compile SINGA with the correct version of Python. In particular, if you
+  build PySINGA from source, you need to specify the paths when invoking
+  [cmake](http://stackoverflow.com/questions/15291500/i-have-2-versions-of-python-installed-but-cmake-is-using-older-version-how-do)
+
+        $ cmake -DPYTHON_LIBRARY=`python-config --prefix`/lib/libpython2.7.dylib -DPYTHON_INCLUDE_DIR=`python-config --prefix`/include/python2.7/ ..
+
+  If installed PySINGA from binary packages, e.g. debian or wheel, then you need
+  to change the python interpreter, e.g., reset the \$PATH to put the correct
+  path of Python at the front position.
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/contribute-code.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/contribute-code.md
new file mode 100644
index 0000000..4e7bbcc
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/contribute-code.md
@@ -0,0 +1,118 @@
+---
+id: version-3.0.0.rc1-contribute-code
+title: How to Contribute Code
+original_id: contribute-code
+---
+
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed [...]
+
+## Coding Style
+
+The SINGA codebase follows the Google Style for both
+[CPP](http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml) and
+[Python](http://google.github.io/styleguide/pyguide.html) code.
+
+A simple way to enforce the Google coding styles is to use the linting and
+formating tools in the Visual Studio Code editor:
+
+- [C/C++ extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools)
+- [Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python)
+- [cpplint extension](https://marketplace.visualstudio.com/items?itemName=mine.cpplint)
+- [Clang-Format](https://marketplace.visualstudio.com/items?itemName=xaver.clang-format)
+
+Once the extensions are installed, edit the `settings.json` file.
+
+```json
+{
+  "[cpp]": {
+    "editor.defaultFormatter": "xaver.clang-format"
+  },
+  "cpplint.cpplintPath": "path/to/cpplint",
+
+  "editor.formatOnSave": true,
+  "python.formatting.provider": "yapf",
+  "python.linting.enabled": true,
+  "python.linting.lintOnSave": true,
+  "clang-format.language.cpp.style": "google",
+  "python.formatting.yapfArgs": ["--style", "{based_on_style: google}"]
+}
+```
+
+Depending on your platform, the user settings file is located here:
+
+1. Windows %APPDATA%\Code\User\settings.json
+2. macOS "\$HOME/Library/Application Support/Code/User/settings.json"
+3. Linux "\$HOME/.config/Code/User/settings.json"
+
+Configurations are specified in corresponding config files. And these tools
+would look up for configuration files in the root of the project automatically,
+e.g. `.pylintrc`.
+
+#### Tool Installation
+
+It is ideal when all the contributors uses the same version of code formatting
+tool (clang-format 9.0.0 and yapf 0.29.0), so that all code formatting in
+different PRs would be identical to get rid of github pull request conflicts.
+
+First, install LLVM 9.0 which provides clang-format version 9.0.0. The download
+page of LLVM is:
+
+- [LLVM](http://releases.llvm.org/download.html#9.0.0)
+
+Second, install cpplint, pylint and yapf
+
+- OSX:
+
+  ```
+  $ sudo pip install cpplint
+  $ which cpplint
+  /path/to/cpplint
+
+  $ pip install yapf==0.29.0
+  $ pip install pylint
+  ```
+
+- Windows: Install Anaconda for package management.
+
+  ```
+  $ pip install cpplint
+  $ where cpplint
+  C:/path/to/cpplint.exe
+
+  $ pip install yapf==0.29.0
+  $ pip install pylint
+  ```
+
+#### Usage
+
+- After the configuration, linting should be automatically applied when editing
+  source code file. Errors and warnings are listed in Visual Studio Code
+  `PROBLEMS` panel.
+- Code Formatting could be done by bringing up Command Palette(`Shift+Ctrl+P` in
+  Windows or `Shift+Command+P` in OSX) and type `Format Document`.
+
+#### Submission
+
+You need to fix the format errors before submitting the pull requests.
+
+## Developing Environment
+
+Visual Studio Code is recommended as the editor. Extensions like Python, C/C++,
+Code Spell Checker, autoDocstring, vim, Remote Development could be installed. A
+reference configuration (i.e., `settings.json`) of these extensions is
+[here](https://gist.github.com/nudles/3d23cfb6ffb30ca7636c45fe60278c55).
+
+If you update the CPP code, you need to recompile SINGA
+[from source](./build.md). It is recommended to use the native building tools in
+the `*-devel` Docker images or `conda build`.
+
+If you only update the Python code, you can install SINGAS once, and then copy
+the updated Python files to replace those in the Python installation folder,
+
+```shell
+cp python/singa/xx.py  <path to conda>/lib/python3.7/site-packages/singa/
+```
+
+## Workflow
+
+Please refer to the [git workflow page](./git-workflow).
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/contribute-docs.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/contribute-docs.md
new file mode 100644
index 0000000..ded1143
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/contribute-docs.md
@@ -0,0 +1,105 @@
+---
+id: version-3.0.0.rc1-contribute-docs
+title: How to Contribute to Documentation
+original_id: contribute-docs
+---
+
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed [...]
+
+There are two types of documentation, namely markdown files and API usage
+reference. This guideline introduces some tools and instruction in preparing the
+source markdown files and API comments.
+
+The markdown files will be built into HTML pages via
+[Docusaurus](https://docusaurus.io/); The API comments (from the source code)
+will be used to generate API reference pages using Sphinx (for Python) and
+Doxygen (for CPP).
+
+## Markdown Files
+
+Try to follow the
+[Google Documentation style](https://developers.google.com/style). For example,
+
+1. Remove 'please' from an instruction. 'Please click...' VS 'Click ...'.
+2. Follow the
+   [standard captitalization rules](https://owl.purdue.edu/owl/general_writing/mechanics/help_with_capitals.html).
+3. Use 'you' instead of 'we' in the instructions.
+4. Use present tense and avoid 'will'
+5. Prefer active voice than passive voice.
+
+In addition, to make the documentation consistent,
+
+1. Keep the line short, e.g., length<=80
+2. Use the relative path assuming that we are in the root folder of the repo,
+   e.g., `doc-site/docs` refers to `singa-doc/docs-site/docs`
+3. Higlight the command, path, class function and variable using backticks,
+   e.g., `Tensor`, `singa-doc/docs-site/docs`.
+4. To hightlight other terms/concepts, use _graph_ or **graph**
+
+The [prettier tool](https://prettier.io/) used by this project will auto-format
+the code according to the
+[configuration](https://github.com/apache/singa-doc/blob/master/docs-site/.prettierrc)
+when we do `git commit`. For example, it will wrap the text in the markdown file
+to at most 80 characters (except the lines for comments).
+
+When introducing a concept (e.g., the `Tensor` class), provide the overview (the
+purpose and relation to other concepts), APIs and examples. Google colab can be
+used to demonstrate the usage.
+
+Refer to [this page](https://github.com/apache/singa-doc/tree/master/docs-site)
+for the details on how to edit the markdown files and build the website.
+
+## API References
+
+### CPP API
+
+Follow the
+[Google CPP Comments Style](https://google.github.io/styleguide/cppguide.html#Comments).
+
+To generate docs, run "doxygen" from the doc folder (Doxygen >= 1.8 recommended)
+
+### Python API
+
+Follow the
+[Google Python DocString Style](http://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings).
+
+## Visual Studio Code (vscode)
+
+If you use vscode as the editor, the following plugins are useful.
+
+### Docstring Snippet
+
+[autoDocstring](https://marketplace.visualstudio.com/items?itemName=njpwerner.autodocstring)
+generates the docstring of functions, classes, etc. Choose the DocString Format
+to `google`.
+
+### Spell Check
+
+[Code Spell Checker](https://marketplace.visualstudio.com/items?itemName=streetsidesoftware.code-spell-checker)
+can be configured to check the comments of the code, or .md and .rst files.
+
+To do spell check only for comments of Python code, add the following snippet
+via `File - Preferences - User Snippets - python.json`
+
+    "cspell check" : {
+    "prefix": "cspell",
+    "body": [
+        "# Directives for doing spell check only for python and c/cpp comments",
+        "# cSpell:includeRegExp #.* ",
+        "# cSpell:includeRegExp (\"\"\"|''')[^\1]*\1",
+        "# cSpell: CStyleComment",
+    ],
+    "description": "# spell check only for python comments"
+    }
+
+To do spell check only for comments of Cpp code, add the following snippet via
+`File - Preferences - User Snippets - cpp.json`
+
+    "cspell check" : {
+    "prefix": "cspell",
+    "body": [
+        "// Directive for doing spell check only for cpp comments",
+        "// cSpell:includeRegExp CStyleComment",
+    ],
+    "description": "# spell check only for cpp comments"
+    }
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/device.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/device.md
new file mode 100644
index 0000000..626ebf9
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/device.md
@@ -0,0 +1,33 @@
+---
+id: version-3.0.0.rc1-device
+title: Device
+original_id: device
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+The Device abstract represents a hardware device with memory and computing
+units. All [Tensor operations](./tensor) are scheduled by the resident device
+for execution. Tensor memory is also managed by the device's memory manager.
+Therefore, optimization of memory and execution are implemented in the Device
+class.
+
+## Specific devices
+
+Currently, SINGA has three Device implmentations,
+
+1.  CudaGPU for an Nvidia GPU card which runs Cuda code
+2.  CppCPU for a CPU which runs Cpp code
+3.  OpenclGPU for a GPU card which runs OpenCL code
+
+## Example Usage
+
+The following code provides examples of creating devices:
+
+```python
+from singa import device
+cuda = device.create_cuda_gpu_on(0)  # use GPU card of ID 0
+host = device.get_default_device()  # get the default host device (a CppCPU)
+ary1 = device.create_cuda_gpus(2)  # create 2 devices, starting from ID 0
+ary2 = device.create_cuda_gpus([0,2])  # create 2 devices on ID 0 and 2
+```
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/dist-train.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/dist-train.md
new file mode 100644
index 0000000..3fad5fc
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/dist-train.md
@@ -0,0 +1,427 @@
+---
+id: version-3.0.0.rc1-dist-train
+title: Distributed Training
+original_id: dist-train
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA supports data parallel training across multiple GPUs (on a single node or
+across different nodes). The following figure illustrates the data parallel
+training:
+
+![MPI.png](assets/MPI.png)
+
+In distributed training, each process (called a worker) runs a training script
+over a single GPU. Each process has an individual communication rank. The
+training data is partitioned among the workers and the model is replicated on
+every worker. In each iteration, the workers read a mini-batch of data (e.g.,
+256 images) from its partition and run the BackPropagation algorithm to compute
+the gradients of the weights, which are averaged via all-reduce (provided by
+[NCCL](https://developer.nvidia.com/nccl)) for weight update following
+stochastic gradient descent algorithms (SGD).
+
+The all-reduce operation by NCCL can be used to reduce and synchronize the
+gradients from different GPUs. Let's consider the training with 4 GPUs as shown
+below. Once the gradients from the 4 GPUs are calculated, all-reduce will return
+the sum of the gradients over the GPUs and make it available on every GPU. Then
+the averaged gradients can be easily calculated.
+
+![AllReduce.png](assets/AllReduce.png)
+
+## Usage
+
+SINGA implements a module called `DistOpt` (a subclass of `Opt`) for distributed
+training. It wraps a normal SGD optimizer and calls `Communicator` for gradients
+synchronization. The following example illustrates the usage of `DistOpt` for
+training a CNN model over the MNIST dataset. The source code is available
+[here](https://github.com/apache/singa/blob/master/examples/cnn/), and there is
+a [Colab notebook]() for it.
+
+### Example Code
+
+1. Define the neural network model:
+
+```python
+class CNN:
+    def __init__(self):
+        self.conv1 = autograd.Conv2d(1, 20, 5, padding=0)
+        self.conv2 = autograd.Conv2d(20, 50, 5, padding=0)
+        self.linear1 = autograd.Linear(4 * 4 * 50, 500)
+        self.linear2 = autograd.Linear(500, 10)
+        self.pooling1 = autograd.MaxPool2d(2, 2, padding=0)
+        self.pooling2 = autograd.MaxPool2d(2, 2, padding=0)
+
+    def forward(self, x):
+        y = self.conv1(x)
+        y = autograd.relu(y)
+        y = self.pooling1(y)
+        y = self.conv2(y)
+        y = autograd.relu(y)
+        y = self.pooling2(y)
+        y = autograd.flatten(y)
+        y = self.linear1(y)
+        y = autograd.relu(y)
+        y = self.linear2(y)
+        return y
+
+# create model
+model = CNN()
+```
+
+2. Create the `DistOpt` instance:
+
+```python
+sgd = opt.SGD(lr=0.005, momentum=0.9, weight_decay=1e-5)
+sgd = opt.DistOpt(sgd)
+dev = device.create_cuda_gpu_on(sgd.local_rank)
+```
+
+Here are some explanations concerning some variables in the code:
+
+(i) `dev`
+
+dev represents the `Device` instance, where to load data and run the CNN model.
+
+(ii)`local_rank`
+
+Local rank represents the GPU number the current process is using in the same
+node. For example, if you are using a node with 2 GPUs, `local_rank=0` means
+that this process is using the first GPU, while `local_rank=1` means using the
+second GPU. Using MPI or multiprocess, you are able to run the same training
+script which is only different in the value of `local_rank`.
+
+(iii)`global_rank`
+
+Rank in global represents the global rank considered all the processes in all
+the nodes you are using. Let's consider the case you have 3 nodes and each of
+the node has two GPUs, `global_rank=0` means the process using the 1st GPU at
+the 1st node, `global_rank=2` means the process using the 1st GPU of the 2nd
+node, and `global_rank=4` means the process using the 1st GPU of the 3rd node.
+
+3. Load and partition the training/validation data:
+
+```python
+def data_partition(dataset_x, dataset_y, global_rank, world_size):
+    data_per_rank = dataset_x.shape[0] // world_size
+    idx_start = global_rank * data_per_rank
+    idx_end = (global_rank + 1) * data_per_rank
+    return dataset_x[idx_start:idx_end], dataset_y[idx_start:idx_end]
+
+train_x, train_y, test_x, test_y = load_dataset()
+train_x, train_y = data_partition(train_x, train_y,
+                                  sgd.global_rank, sgd.world_size)
+test_x, test_y = data_partition(test_x, test_y,
+                                sgd.global_rank, sgd.world_size)
+```
+
+A partition of the dataset is returned for this `dev`.
+
+4. Initialize and synchronize the model parameters among all workers:
+
+```python
+def synchronize(tensor, dist_opt):
+    dist_opt.all_reduce(tensor.data)
+    tensor /= dist_opt.world_size
+
+#Synchronize the initial parameter
+tx = tensor.Tensor((batch_size, 1, IMG_SIZE, IMG_SIZE), dev, tensor.float32)
+ty = tensor.Tensor((batch_size, num_classes), dev, tensor.int32)
+...
+out = model.forward(tx)
+loss = autograd.softmax_cross_entropy(out, ty)
+for p, g in autograd.backward(loss):
+    synchronize(p, sgd)
+```
+
+Here, `world_size` represents the total number of processes in all the nodes you
+are using for distributed training.
+
+5. Run BackPropagation and distributed SGD
+
+```python
+for epoch in range(max_epoch):
+    for b in range(num_train_batch):
+        x = train_x[idx[b * batch_size: (b + 1) * batch_size]]
+        y = train_y[idx[b * batch_size: (b + 1) * batch_size]]
+        tx.copy_from_numpy(x)
+        ty.copy_from_numpy(y)
+        out = model.forward(tx)
+        loss = autograd.softmax_cross_entropy(out, ty)
+        # do backpropagation and all-reduce
+        sgd.backward_and_update(loss)
+```
+
+### Execution Instruction
+
+There are two ways to launch the training: MPI or Python multiprocessing.
+
+#### Python multiprocessing
+
+It works on a single node with multiple GPUs, where each GPU is one worker.
+
+1. Put all the above training codes in a function
+
+```python
+def train_mnist_cnn(nccl_id=None, local_rank=None, world_size=None):
+    ...
+```
+
+2. Create `mnist_multiprocess.py`
+
+```python
+if __name__ == '__main__':
+    # Generate a NCCL ID to be used for collective communication
+    nccl_id = singa.NcclIdHolder()
+
+    # Define the number of GPUs to be used in the training process
+    world_size = int(sys.argv[1])
+
+    # Define and launch the multi-processing
+	import multiprocessing
+    process = []
+    for local_rank in range(0, world_size):
+        process.append(multiprocessing.Process(target=train_mnist_cnn,
+                       args=(nccl_id, local_rank, world_size)))
+
+    for p in process:
+        p.start()
+```
+
+Here are some explanations concerning the variables created above:
+
+(i) `nccl_id`
+
+Note that we need to generate a NCCL ID here to be used for collective
+communication, and then pass it to all the processes. The NCCL ID is like a
+ticket, where only the processes with this ID can join the all-reduce operation.
+(Later if we use MPI, the passing of NCCL ID is not necessary, because the ID is
+broadcased by MPI in our code automatically)
+
+(ii) `world_size`
+
+world_size is the number of GPUs you would like to use for training.
+
+(iii) `local_rank`
+
+local_rank determine the local rank of the distributed training and which gpu is
+used in the process. In the code above, we used a for loop to run the train
+function where the argument local_rank iterates from 0 to world_size. In this
+case, different processes can use different GPUs for training.
+
+The arguments for creating the `DistOpt` instance should be updated as follows
+
+```python
+sgd = opt.DistOpt(sgd, nccl_id=nccl_id, local_rank=local_rank, world_size=world_size)
+```
+
+3. Run `mnist_multiprocess.py`
+
+```sh
+python mnist_multiprocess.py 2
+```
+
+It results in speed up compared to the single GPU training.
+
+```
+Starting Epoch 0:
+Training loss = 408.909790, training accuracy = 0.880475
+Evaluation accuracy = 0.956430
+Starting Epoch 1:
+Training loss = 102.396790, training accuracy = 0.967415
+Evaluation accuracy = 0.977564
+Starting Epoch 2:
+Training loss = 69.217010, training accuracy = 0.977915
+Evaluation accuracy = 0.981370
+Starting Epoch 3:
+Training loss = 54.248390, training accuracy = 0.982823
+Evaluation accuracy = 0.984075
+Starting Epoch 4:
+Training loss = 45.213406, training accuracy = 0.985560
+Evaluation accuracy = 0.985276
+Starting Epoch 5:
+Training loss = 38.868435, training accuracy = 0.987764
+Evaluation accuracy = 0.986278
+Starting Epoch 6:
+Training loss = 34.078186, training accuracy = 0.989149
+Evaluation accuracy = 0.987881
+Starting Epoch 7:
+Training loss = 30.138697, training accuracy = 0.990451
+Evaluation accuracy = 0.988181
+Starting Epoch 8:
+Training loss = 26.854443, training accuracy = 0.991520
+Evaluation accuracy = 0.988682
+Starting Epoch 9:
+Training loss = 24.039650, training accuracy = 0.992405
+Evaluation accuracy = 0.989083
+```
+
+#### MPI
+
+It works for both single node and multiple nodes as long as there are multiple
+GPUs.
+
+1. Create `mnist_dist.py`
+
+```python
+if __name__ == '__main__':
+    train_mnist_cnn()
+```
+
+2. Generate a hostfile for MPI, e.g. the hostfile below uses 2 processes (i.e.,
+   2 GPUs) on a single node
+
+```txt
+localhost:2
+```
+
+3. Launch the training via `mpiexec`
+
+```sh
+mpiexec --hostfile host_file python mnist_dist.py
+```
+
+It could result in speed up compared to the single GPU training.
+
+```
+Starting Epoch 0:
+Training loss = 383.969543, training accuracy = 0.886402
+Evaluation accuracy = 0.954327
+Starting Epoch 1:
+Training loss = 97.531479, training accuracy = 0.969451
+Evaluation accuracy = 0.977163
+Starting Epoch 2:
+Training loss = 67.166870, training accuracy = 0.978516
+Evaluation accuracy = 0.980769
+Starting Epoch 3:
+Training loss = 53.369656, training accuracy = 0.983040
+Evaluation accuracy = 0.983974
+Starting Epoch 4:
+Training loss = 45.100403, training accuracy = 0.985777
+Evaluation accuracy = 0.986078
+Starting Epoch 5:
+Training loss = 39.330826, training accuracy = 0.987447
+Evaluation accuracy = 0.987179
+Starting Epoch 6:
+Training loss = 34.655270, training accuracy = 0.988799
+Evaluation accuracy = 0.987780
+Starting Epoch 7:
+Training loss = 30.749735, training accuracy = 0.989984
+Evaluation accuracy = 0.988281
+Starting Epoch 8:
+Training loss = 27.422146, training accuracy = 0.991319
+Evaluation accuracy = 0.988582
+Starting Epoch 9:
+Training loss = 24.548153, training accuracy = 0.992171
+Evaluation accuracy = 0.988682
+```
+
+## Optimizations for Distributed Training
+
+SINGA provides multiple optimization strategies for distributed training to
+reduce the communication cost. Refer to the API for `DistOpt` for the
+configuration of each strategy.
+
+### No Optimizations
+
+```python
+sgd.backward_and_update(loss)
+```
+
+`loss` is the output tensor from the loss function, e.g., cross-entropy for
+classification tasks.
+
+### Half-precision Gradients
+
+```python
+sgd.backward_and_update_half(loss)
+```
+
+It converts each gradient value to 16-bit representation (i.e., half-precision)
+before calling all-reduce.
+
+### Partial Synchronization
+
+```python
+sgd.backward_and_partial_update(loss)
+```
+
+In each iteration, every rank do the local sgd update. Then, only a chunk of
+parameters are averaged for synchronization, which saves the communication cost.
+The chunk size is configured when creating the `DistOpt` instance.
+
+### Gradient Sparsification
+
+```python
+sgd.backward_and_sparse_update(loss)
+```
+
+It applies sparsification schemes to select a subset of gradients for
+all-reduce. There are two scheme:
+
+- The top-K largest elements are selected. spars is the portion (0 - 1) of total
+  elements selected.
+
+```python
+sgd.backward_and_sparse_update(loss = loss, spars = spars, topK = True)
+```
+
+- All gradients whose absolute value are larger than predefined threshold spars
+  are selected.
+
+```python
+sgd.backward_and_sparse_update(loss = loss, spars = spars, topK = False)
+```
+
+The hyper-parameters are configured when creating the `DistOpt` instance.
+
+## Implementation
+
+This section is mainly for developers who want to know how the code in
+distribute module is implemented.
+
+### C interface for NCCL communicator
+
+Firstly, the communication layer is written in C language
+[communicator.cc](https://github.com/apache/singa/blob/master/src/io/communicator.cc).
+It applies the NCCL library for collective communication.
+
+There are two constructors for the communicator, one for MPI and another for
+multiprocess.
+
+(i) Constructor using MPI
+
+The constructor first obtains the global rank and the world size first, and
+calculate the local rank. Then, rank 0 generates a NCCL ID and broadcast it to
+every rank. After that, it calls the setup function to initialize the NCCL
+communicator, cuda streams, and buffers.
+
+(ii) Constructor using Python multiprocess
+
+The constructor first obtains the rank, the world size, and the NCCL ID from the
+input argument. After that, it calls the setup function to initialize the NCCL
+communicator, cuda streams, and buffers.
+
+After the initialization, it provides the all-reduce functionality to
+synchronize the model parameters or gradients. For instance, synch takes a input
+tensor and perform all-reduce through the NCCL routine. After we call synch, it
+is necessary to call wait function to wait for the all-reduce operation to be
+completed.
+
+### Python interface for DistOpt
+
+Then, the python interface provide a
+[DistOpt](https://github.com/apache/singa/blob/master/python/singa/opt.py) class
+to wrap an
+[optimizer](https://github.com/apache/singa/blob/master/python/singa/opt.py)
+object to perform distributed training based on MPI or multiprocessing. During
+the initialization, it creates a NCCL communicator object (from the C interface
+as mentioned in the subsection above). Then, this communicator object is used
+for every all-reduce operations in DistOpt.
+
+In MPI or multiprocess, each process has an individual rank, which gives
+information of which GPU the individual process is using. The training data is
+partitioned, so that each process can evaluate the sub-gradient based on the
+partitioned training data. Once the sub-gradient is calculated on each
+processes, the overall stochastic gradient is obtained by all-reducing the
+sub-gradients evaluated by all processes.
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/download-singa.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/download-singa.md
new file mode 100644
index 0000000..8bea604
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/download-singa.md
@@ -0,0 +1,170 @@
+---
+id: version-3.0.0.rc1-download-singa
+title: Download SINGA
+original_id: download-singa
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+## Verify
+
+To verify the downloaded tar.gz file, download the
+[KEYS](https://www.apache.org/dist/incubator/singa/KEYS) and ASC files and then
+execute the following commands
+
+```shell
+% gpg --import KEYS
+% gpg --verify downloaded_file.asc downloaded_file
+```
+
+You can also check the SHA512 or MD5 values to see if the download is completed.
+
+## Incubating v2.0.0 (20 April 2019):
+
+- [Apache SINGA 2.0.0 (incubating)](http://www.apache.org/dyn/closer.cgi/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz)
+  [\[SHA512\]](https://www.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz.sha512)
+  [\[ASC\]](https://www.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz.asc)
+- [Release Notes 2.0.0 (incubating)](releases/RELEASE_NOTES_2.0.0.html)
+- New features and major updates,
+  - Enhance autograd (for Convolution networks and recurrent networks)
+  - Support ONNX
+  - Improve the CPP operations via Intel MKL DNN lib
+  - Implement tensor broadcasting
+  - Move Docker images under Apache user name
+  - Update depdent lib versions in conda-build config
+
+## Incubating v1.2.0 (6 June 2018):
+
+- [Apache SINGA 1.2.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz)
+  [\[SHA512\]](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz.sha512)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz.asc)
+- [Release Notes 1.2.0 (incubating)](releases/RELEASE_NOTES_1.2.0.html)
+- New features and major updates,
+  - Implement autograd (currently support MLP model)
+  - Upgrade PySinga to support Python 3
+  - Improve the Tensor class with the stride field
+  - Upgrade cuDNN from V5 to V7
+  - Add VGG, Inception V4, ResNet, and DenseNet for ImageNet classification
+  - Create alias for conda packages
+  - Complete documentation in Chinese
+  - Add instructions for running Singa on Windows
+  - Update the compilation, CI
+  - Fix some bugs
+
+## Incubating v1.1.0 (12 February 2017):
+
+- [Apache SINGA 1.1.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz.asc)
+- [Release Notes 1.1.0 (incubating)](releases/RELEASE_NOTES_1.1.0.html)
+- New features and major updates,
+  - Create Docker images (CPU and GPU versions)
+  - Create Amazon AMI for SINGA (CPU version)
+  - Integrate with Jenkins for automatically generating Wheel and Debian
+    packages (for installation), and updating the website.
+  - Enhance the FeedFowardNet, e.g., multiple inputs and verbose mode for
+    debugging
+  - Add Concat and Slice layers
+  - Extend CrossEntropyLoss to accept instance with multiple labels
+  - Add image_tool.py with image augmentation methods
+  - Support model loading and saving via the Snapshot API
+  - Compile SINGA source on Windows
+  - Compile mandatory dependent libraries together with SINGA code
+  - Enable Java binding (basic) for SINGA
+  - Add version ID in checkpointing files
+  - Add Rafiki toolkit for providing RESTFul APIs
+  - Add examples pretrained from Caffe, including GoogleNet
+
+## Incubating v1.0.0 (8 September 2016):
+
+- [Apache SINGA 1.0.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz.asc)
+- [Release Notes 1.0.0 (incubating)](releases/RELEASE_NOTES_1.0.0.html)
+- New features and major updates,
+  - Tensor abstraction for supporting more machine learning models.
+  - Device abstraction for running on different hardware devices, including CPU,
+    (Nvidia/AMD) GPU and FPGA (to be tested in later versions).
+  - Replace GNU autotool with cmake for compilation.
+  - Support Mac OS
+  - Improve Python binding, including installation and programming
+  - More deep learning models, including VGG and ResNet
+  - More IO classes for reading/writing files and encoding/decoding data
+  - New network communication components directly based on Socket.
+  - Cudnn V5 with Dropout and RNN layers.
+  - Replace website building tool from maven to Sphinx
+  - Integrate Travis-CI
+
+## Incubating v0.3.0 (20 April 2016):
+
+- [Apache SINGA 0.3.0 (incubating)](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz.asc)
+- [Release Notes 0.3.0 (incubating)](releases/RELEASE_NOTES_0.3.0.html)
+- New features and major updates,
+  - Training on GPU cluster enables training of deep learning models over a GPU
+    cluster.
+  - Python wrapper improvement makes it easy to configure the job, including
+    neural net and SGD algorithm.
+  - New SGD updaters are added, including Adam, AdaDelta and AdaMax.
+  - Installation has fewer dependent libraries for single node training.
+  - Heterogeneous training with CPU and GPU.
+  - Support cuDNN V4.
+  - Data prefetching.
+  - Fix some bugs.
+
+## Incubating v0.2.0 (14 January 2016):
+
+- [Apache SINGA 0.2.0 (incubating)](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz.asc)
+- [Release Notes 0.2.0 (incubating)](releases/RELEASE_NOTES_0.2.0.html)
+- New features and major updates,
+  - Training on GPU enables training of complex models on a single node with
+    multiple GPU cards.
+  - Hybrid neural net partitioning supports data and model parallelism at the
+    same time.
+  - Python wrapper makes it easy to configure the job, including neural net and
+    SGD algorithm.
+  - RNN model and BPTT algorithm are implemented to support applications based
+    on RNN models, e.g., GRU.
+  - Cloud software integration includes Mesos, Docker and HDFS.
+  - Visualization of neural net structure and layer information, which is
+    helpful for debugging.
+  - Linear algebra functions and random functions against Blobs and raw data
+    pointers.
+  - New layers, including SoftmaxLayer, ArgSortLayer, DummyLayer, RNN layers and
+    cuDNN layers.
+  - Update Layer class to carry multiple data/grad Blobs.
+  - Extract features and test performance for new data by loading previously
+    trained model parameters.
+  - Add Store class for IO operations.
+
+## Incubating v0.1.0 (8 October 2015):
+
+- [Apache SINGA 0.1.0 (incubating)](https://archive.apache.org/dist/incubator/singa/apache-singa-incubating-0.1.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/apache-singa-incubating-0.1.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/apache-singa-incubating-0.1.0.tar.gz.asc)
+- [Amazon EC2 image](https://console.aws.amazon.com/ec2/v2/home?region=ap-southeast-1#LaunchInstanceWizard:ami=ami-b41001e6)
+- [Release Notes 0.1.0 (incubating)](releases/RELEASE_NOTES_0.1.0.html)
+- Major features include,
+  - Installation using GNU build utility
+  - Scripts for job management with zookeeper
+  - Programming model based on NeuralNet and Layer abstractions.
+  - System architecture based on Worker, Server and Stub.
+  - Training models from three different model categories, namely, feed-forward
+    models, energy models and RNN models.
+  - Synchronous and asynchronous distributed training frameworks using CPU
+  - Checkpoint and restore
+  - Unit test using gtest
+
+> **Disclaimer**
+>
+> Apache SINGA is an effort undergoing incubation at The Apache Software
+> Foundation (ASF), sponsored by the name of Apache Incubator PMC. Incubation is
+> required of all newly accepted projects until a further review indicates that
+> the infrastructure, communications, and decision making process have
+> stabilized in a manner consistent with other successful ASF projects. While
+> incubation status is not necessarily a reflection of the completeness or
+> stability of the code, it does indicate that the project has yet to be fully
+> endorsed by the ASF.
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/examples.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/examples.md
new file mode 100644
index 0000000..9092047
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/examples.md
@@ -0,0 +1,57 @@
+---
+id: version-3.0.0.rc1-examples
+title: Examples
+original_id: examples
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+This page lists some example deep learning tasks using SINGA. The source code is
+maintained inside SINGA repo on
+[Github](https://github.com/apache/singa/tree/master/examples). For examples
+running on CPU or single GPU using SINGA Python APIs, they are also available on
+[Google Colab](https://colab.research.google.com/). You can run them directly on
+Google Cloud without setting up the environment locally. The link to each
+example is given below.
+
+## Image Classification
+
+| Model       | Dataset                           | Links                                                                                                   |
+| ----------- | --------------------------------- | ------------------------------------------------------------------------------------------------------- |
+| Simple CNN  | MNIST, CIFAR10, CIFAR100          | [Colab](https://colab.research.google.com/drive/1fbGUs1AsoX6bU5F745RwQpohP4bHTktq)                      |
+| AlexNet     | ImageNet                          | [Cpp]()                                                                                                 |
+| VGG         | ImageNet                          | [Cpp](), [Python](), [Colab](https://colab.research.google.com/drive/14kxgRKtbjPCKKsDJVNi3AvTev81Gp_Ds) |
+| XceptionNet | MNIST, CIFAR10, CIFAR100          | [Python]()                                                                                              |
+| ResNet      | MNIST, CIFAR10, CIFAR100, CIFAR10 | [Python](), [Colab](https://colab.research.google.com/drive/1u1RYefSsVbiP4I-5wiBKHjsT9L0FxLm9)          |
+| MobileNet   | ImageNet                          | [Colab](https://colab.research.google.com/drive/1HsixqJMIpKyEPhkbB8jy7NwNEFEAUWAf)                      |
+
+## Object Detection
+
+| Model       | Dataset    | Links                                                                              |
+| ----------- | ---------- | ---------------------------------------------------------------------------------- |
+| Tiny YOLOv2 | Pascal VOC | [Colab](https://colab.research.google.com/drive/11V4I6cRjIJNUv5ZGsEGwqHuoQEie6b1T) |
+
+## Face and Emotion Recognition
+
+| Model           | Dataset                                                                                                                                                | Links                                                                              |
+| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------- |
+| ArcFace         | Refined MS-Celeb-1M                                                                                                                                    | [Colab](https://colab.research.google.com/drive/1qanaqUKGIDtifdzEzJOHjEj4kYzA9uJC) |
+| Emotion FerPlus | [Facial Expression Recognition Challenge](https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data) | [Colab](https://colab.research.google.com/drive/1XHtBQGRhe58PDi4LGYJzYueWBeWbO23r) |
+
+## Image Generation
+
+| Model | Dataset | Links                                                                              |
+| ----- | ------- | ---------------------------------------------------------------------------------- |
+| GAN   | MNIST   | [Colab](https://colab.research.google.com/drive/1f86MNDW47DJqHoIqWD1tOxcyx2MWys8L) |
+| LSGAN | MNIST   | [Colab](https://colab.research.google.com/drive/1C6jNRf28vnFOI9JVM4lpkJPqxsnhxdol) |
+
+## Machine Comprehension
+
+| Model      | Dataset                                                                   | Links                                                                              |
+| ---------- | ------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
+| Bert-Squad | [SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) | [Colab](https://colab.research.google.com/drive/1kud-lUPjS_u-TkDAzihBTw0Vqr0FjCE-) |
+
+## Misc.
+
+- Restricted Boltzmann Machine over the MNIST dataset, [source](),
+  [Colab](https://colab.research.google.com/drive/19996noGu9JyHHkVmp4edBGu7PJSRQKsd).
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/git-workflow.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/git-workflow.md
new file mode 100644
index 0000000..c956cf3
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/git-workflow.md
@@ -0,0 +1,131 @@
+---
+id: version-3.0.0.rc1-git-workflow
+title: Git Workflow
+original_id: git-workflow
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+## For Developers
+
+1. Fork the [SINGA Github repository](https://github.com/apache/singa) to your
+   own Github account.
+
+2. Clone the **repo** (short for repository) from your Github
+
+   ```shell
+   git clone https://github.com/<Github account>/singa.git
+   git remote add upstream https://github.com/apache/singa.git
+   ```
+
+3. Create a new branch (e.g., `feature-foo` or `fixbug-foo`), work on it and
+   commit your code.
+
+   ```shell
+   git checkout dev
+   git checkout -b feature-foo
+   # write your code
+   git add <created/updated files>
+   git commit
+   ```
+
+   The commit message should include:
+
+   - A descriptive Title.
+   - A detailed description. If the commit is to fix a bug, the description
+     should ideally include a short reproduction of the problem. For new
+     features, it may include the motivation/purpose of this new feature.
+
+   If your branch has many small commits, you need to clean those commits via
+
+   ```shell
+   git rebase -i <commit id>
+   ```
+
+   You can
+   [squash and reword](https://help.github.com/en/articles/about-git-rebase) the
+   commits.
+
+4. When you are working on the code, the `dev` of SINGA may have been updated by
+   others; In this case, you need to pull the latest dev
+
+   ```shell
+   git checkout dev
+   git pull upstream dev:dev
+   ```
+
+5. [Rebase](https://git-scm.com/book/en/v2/Git-Branching-Rebasing) `feature-foo`
+   onto the `dev` branch and push commits to your own Github account (the new
+   branch). The rebase operation is to make the commit history clean. The
+   following git instructors should be executed after committing the current
+   work:
+
+   ```shell
+   git checkout feature-foo
+   git rebase dev
+   git push origin feature-foo:feature-foo
+   ```
+
+   The rebase command does the
+   [following steps](https://git-scm.com/book/en/v2/Git-Branching-Rebasing):
+   "This operation works by going to the common ancestor of the two branches
+   (the one you’re on and the one you’re rebasing onto), getting the diff
+   introduced by each commit of the branch you’re on, saving those diffs to
+   temporary files, resetting the current branch to the same commit as the
+   branch you are rebasing onto, and finally applying each change in turn."
+   Therefore, after executing it, you will be still on the feature branch, but
+   your own commit IDs/hashes are changed since the diffs are committed during
+   rebase; and your branch now has the latest code from the dev branch and your
+   own branch.
+
+6. Open a pull request (PR) against the dev branch of apache/singa on Github
+   website. If you want to inform other contributors who worked on the same
+   files, you can find the file(s) on Github and click "Blame" to see a
+   line-by-line annotation of who changed the code last. Then, you can add
+   @username in the PR description to ping them immediately. Please state that
+   the contribution is your original work and that you license the work to the
+   project under the project's open source license. Further commits (e.g., bug
+   fix) to your new branch will be added to this pull request automatically by
+   Github.
+
+7. Wait for committers to review the PR. During this time, the dev of SINGA may
+   have been updated by others, and then you need to
+   [merge the latest dev](https://docs.fast.ai/dev/git.html#how-to-keep-your-feature-branch-up-to-date)
+   to resolve conflicts. Some people
+   [rebase the PR onto the latest dev](https://github.com/edx/edx-platform/wiki/How-to-Rebase-a-Pull-Request)
+   instead of merging. However, if other developers fetch this PR to add new
+   features and then send PR, the rebase operation would introduce **duplicate
+   commits** (with different hash) in the future PR. See
+   [The Golden Rule of Rebasing](https://www.atlassian.com/git/tutorials/merging-vs-rebasing)
+   for the details of when to avoid using rebase. Another simple solution to
+   update the PR (to fix conflicts or commit errors) is to checkout a new branch
+   from the latest dev branch of Apache SINGAS repo; copy and paste the
+   updated/added code; commit and send a new PR.
+
+## For Committers
+
+Committers can merge the pull requests (PRs) into the dev branch of the upstream
+repo. Before merging each PR, the committer should
+
+- check the commit message (content and format)
+- check the changes to existing code. API changes should be recorded
+- check the Travis testing results for code/doc format and unit tests
+
+There are two approaches to merge a pull request:
+
+- On Github. Follow the [instructions](https://gitbox.apache.org/setup/) to
+  connect your Apache account with your Github account. After that you can
+  directly merge PRs on GitHub.
+- To merge pull request https://github.com/apache/singa/pull/xxx via command
+  line, the following instructions should be executed,
+
+  ```shell
+  git clone https://github.com/apache/singa.git
+  git remote add asf https://gitbox.apache.org/repos/asf/singa.git
+  git fetch origin pull/xxx/head:prxxx
+  git checkout dev
+  git merge --no-ff prxxx
+  git push asf dev:dev
+  ```
+
+  Do not use rebase to merge the PR; and disable fast forward.
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/graph.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/graph.md
new file mode 100644
index 0000000..6bfe177
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/graph.md
@@ -0,0 +1,590 @@
+---
+id: version-3.0.0.rc1-graph
+title: Computational Graph
+original_id: graph
+---
+
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed [...]
+
+SINGA can buffering operations to create a computational graph (CG). With the
+computational graph, SINGA can schedule the execution of operations as well as
+the memory allocation and release. It makes training more efficient while using
+less memory.
+
+## About Computational Graph
+
+### Introduction
+
+Computational graph is used to represent networks of the flow of computation. It
+is composed of many nodes and edges, where nodes represent various operations
+and edges represent data. In deep neural networks, nodes are tensor-based
+operations such as convolution and edges are tensors.
+
+The entire neural network is equivalent to a computational graph, all neural
+networks can correspond to a calculation graph. By representing the neural
+network as a calculation graph, some optimizations for neural networks can be
+performed on the calculation graph.
+
+### Pipeline
+
+The whole process of using the calculational graph to represent the model and
+execute the graph consists of roughly four steps. The whole process is actually
+similar to compiling. We first describe the program with code, then translate
+the program into intermediate code, then optimize the intermediate code and
+finally come up with many ways to efficiently execute the code. In neural
+networks, the intermediate code is the calculation graph. We can optimize
+through techniques like common sub-expression elimination. When the computer
+executes the compiled binary file, it can be efficiently executed by using
+multi-thread technology, and the same as the execution of the calculation graph.
+Therefore, some ideas of compilation principles can also be used in the
+optimization of calculation graphs.
+
+- Write the python code for the model.
+
+- Construct the computational graph based on the python code.
+- Optimize the computational graph.
+- Execute the computational graph efficiently.
+
+Figure 1 shows a simple example of going through the entire process.
+
+<img src="assets/GraphPipeline.png" alt="The pipeline of using computational graph" style="zoom:40%;" />
+
+<br/>**Figure 1 - The pipeline of using computational graph**
+
+### An example of MLP
+
+A simple MLP model can be constructed on the Python side by using some APIs of
+SINGA.
+
+```python
+x = autograd.matmul(inputs, w0)
+x = autograd.add_bias(x, b0)
+x = autograd.relu(x)
+x = autograd.matmul(x, w1)
+x = autograd.add_bias(x, b1)
+loss = autograd.softmax_cross_entropy(x, target)
+sgd.backward_and_update(loss)
+```
+
+When the model is defined, there is actually a calculation graph corresponding
+to it. This calculation graph contains the calculations that the entire SINGA
+will perform. Figure 2 shows the computational graph corresponding to the MLP
+model defined above.
+
+![The computational graph of MLP](assets/GraphOfMLP.png)
+
+<br/>**Figure 2 - The computational graph of MLP**
+
+## Features
+
+There are four main components of a computational graph in SINGA, namely (i)
+Computational graph construction, (ii) Lazy allocation, (iii) Automatic
+recycling, (iv) Shared memory. Details are as follows:
+
+- `Computational graph construction`: Construct a computational graph based on
+  the mathematical or deep learning operations, and then run the graph to
+  accomplish the training task. The computational graph also includes operations
+  like communicator.synch and communicator.fusedSynch for the distributed
+  training.
+- `Lazy allocation`: When blocks are allocated, devices do not allocate memory
+  for them immediately. Devices do memory allocation only when an operation uses
+  this block for the first time.
+- `Automatic recycling`: When we are running a graph in an iteration, it
+  automatically deallocates the intermediate tensors which won't be used again
+  in the remaining operations.
+- `Shared memory`: When two operations will never be performed at the same time,
+  the result tensors produced by them can share a piece of memory.
+
+## How to use
+
+- A CNN example.
+
+```Python
+
+class CNN(module.Module):
+
+    def __init__(self, optimizer):
+        super(CNN, self).__init__()
+
+        self.conv1 = autograd.Conv2d(1, 20, 5, padding=0)
+        self.conv2 = autograd.Conv2d(20, 50, 5, padding=0)
+        self.linear1 = autograd.Linear(4 * 4 * 50, 500)
+        self.linear2 = autograd.Linear(500, 10)
+        self.pooling1 = autograd.MaxPool2d(2, 2, padding=0)
+        self.pooling2 = autograd.MaxPool2d(2, 2, padding=0)
+
+        self.optimizer = optimizer
+
+    def forward(self, x):
+        y = self.conv1(x)
+        y = autograd.relu(y)
+        y = self.pooling1(y)
+        y = self.conv2(y)
+        y = autograd.relu(y)
+        y = self.pooling2(y)
+        y = autograd.flatten(y)
+        y = self.linear1(y)
+        y = autograd.relu(y)
+        y = self.linear2(y)
+        return y
+
+    def loss(self, x, ty):
+        return autograd.softmax_cross_entropy(x, ty)
+
+    def optim(self, loss):
+        self.optimizer.backward_and_update(loss)
+
+# initialization other objects
+# ......
+model = CNN(sgd)
+model.train()
+model.on_device(dev)
+model.graph(graph, sequential)
+
+# Train
+for b in range(num_train_batch):
+    # Generate the patch data in this iteration
+    # ......
+
+    # Copy the patch data into input tensors
+    tx.copy_from_numpy(x)
+    ty.copy_from_numpy(y)
+
+    # Train the model
+    out = model(tx)
+    loss = model.loss(out, ty)
+    model.optim(loss)
+```
+
+A Google Colab notebook of this example is available
+[here](https://colab.research.google.com/drive/1fbGUs1AsoX6bU5F745RwQpohP4bHTktq).
+
+- Some settings:
+  [module.py](https://github.com/apache/singa/blob/master/python/singa/module.py)
+  - `training`: whether to train the neural network defined in the class or for
+    evaluation.
+  - `graph_mode`: the model class defined by users can be trained using
+    computational graph or not.
+  - `sequential`: execute operations in graph serially or in the order of BFS.
+- More examples:
+  - [MLP](https://github.com/apache/singa/blob/master/examples/autograd/mlp_module.py)
+  - [CNN](https://github.com/apache/singa/blob/master/examples/autograd/cnn_module.py)
+  - [ResNet](https://github.com/apache/singa/blob/master/examples/autograd/resnet_module.py)
+
+## Experiments
+
+### Single node
+
+- Experiment settings
+  - Model
+    - Using layer: ResNet50 in
+      [resnet.py](https://github.com/apache/singa/blob/master/examples/autograd/resnet.py)
+    - Using module: ResNet50 in
+      [resnet_module.py](https://github.com/apache/singa/blob/master/examples/autograd/resnet_module.py)
+  - GPU: NVIDIA RTX 2080Ti
+- Notations
+  - `s` :second
+  - `it` : iteration
+  - `Mem`:peak memory usage of single GPU
+  - `Throughout`:number of images processed per second
+  - `Time`:total time
+  - `Speed`:iterations per second
+  - `Reduction`:the memory usage reduction rate compared with that using layer
+  - `Speedup`: speedup ratio compared with dev branch
+- Result
+  <table style="text-align: center">
+      <tr>
+          <th style="text-align: center">Batchsize</th>
+          <th style="text-align: center">Cases</th>
+          <th style="text-align: center">Mem(MB)</th>
+          <th style="text-align: center">Time(s)</th>
+          <th style="text-align: center">Speed(it/s)</th>
+          <th style="text-align: center">Throughput</th>
+          <th style="text-align: center">Reduction</th>
+          <th style="text-align: center">Speedup</th>
+      </tr>
+      <tr>
+          <td rowspan="4">16</td>
+          <td nowrap>layer</td>
+          <td>4975</td>
+          <td>14.1952</td>
+          <td>14.0893</td>
+          <td>225.4285</td>
+          <td>0.00%</td>
+          <td>1.0000</td>
+      </tr>
+      <tr>
+          <td nowrap>module:disable graph</td>
+          <td>4995</td>
+          <td>14.1264</td>
+          <td>14.1579</td>
+          <td>226.5261</td>
+          <td>-0.40%</td>
+          <td>1.0049</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph, bfs</td>
+          <td>3283</td>
+          <td>13.7438</td>
+          <td>14.5520</td>
+          <td>232.8318</td>
+          <td>34.01%</td>
+          <td>1.0328</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph, serial</td>
+          <td>3265</td>
+          <td>13.7420</td>
+          <td>14.5540</td>
+          <td>232.8635</td>
+          <td>34.37%</td>
+          <td>1.0330</td>
+      </tr>
+      <tr>
+          <td rowspan="4">32</td>
+          <td nowrap>layer</td>
+          <td>10119</td>
+          <td>13.4587</td>
+          <td>7.4302</td>
+          <td>237.7649</td>
+          <td>0.00%</td>
+          <td>1.0000</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph</td>
+          <td>10109</td>
+          <td>13.2952</td>
+          <td>7.5315</td>
+          <td>240.6875</td>
+          <td>0.10%</td>
+          <td>1.0123</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph, bfs</td>
+          <td>6839</td>
+          <td>13.1059</td>
+          <td>7.6302</td>
+          <td>244.1648</td>
+          <td>32.41%</td>
+          <td>1.0269</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph, serial</td>
+          <td>6845</td>
+          <td>13.0489</td>
+          <td>7.6635</td>
+          <td>245.2312</td>
+          <td>32.35%</td>
+          <td>1.0314</td>
+      </tr>
+  </table>
+
+### Multi processes
+
+- Experiment settings
+  - Model
+    - using Layer: ResNet50 in
+      [resnet_dist.py](https://github.com/apache/singa/blob/master/examples/autograd/resnet_dist.py)
+    - using Module: ResNet50 in
+      [resnet_module.py](https://github.com/apache/singa/blob/master/examples/autograd/resnet_module.py)
+  - GPU: NVIDIA RTX 2080Ti \* 2
+  - MPI: two MPI processes on one node
+- Notations: the same as above
+- Result
+  <table style="text-align: center">
+      <tr>
+          <th style="text-align: center">Batchsize</th>
+          <th style="text-align: center">Cases</th>
+          <th style="text-align: center">Mem(MB)</th>
+          <th style="text-align: center">Time(s)</th>
+          <th style="text-align: center">Speed(it/s)</th>
+          <th style="text-align: center">Throughput</th>
+          <th style="text-align: center">Reduction</th>
+          <th style="text-align: center">Speedup</th>
+      </tr>
+      <tr>
+          <td rowspan="4">16</td>
+          <td nowrap>layer</td>
+          <td>5439</td>
+          <td>17.3323</td>
+          <td>11.5391</td>
+          <td>369.2522</td>
+          <td>0.00%</td>
+          <td>1.0000</td>
+      </tr>
+      <tr>
+          <td nowrap>module:disable graph</td>
+          <td>5427</td>
+          <td>17.8232</td>
+          <td>11.2213</td>
+          <td>359.0831</td>
+          <td>0.22%</td>
+          <td>0.9725</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph, bfs</td>
+          <td>3389</td>
+          <td>18.2310</td>
+          <td>10.9703</td>
+          <td>351.0504</td>
+          <td>37.69%</td>
+          <td>0.9507</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph, serial</td>
+          <td>3437</td>
+          <td>17.0389</td>
+          <td>11.7378</td>
+          <td>375.6103</td>
+          <td>36.81%</td>
+          <td>1.0172</td>
+      </tr>
+      <tr>
+          <td rowspan="4">32</td>
+          <td nowrap>layer</td>
+          <td>10547</td>
+          <td>14.8635</td>
+          <td>6.7279</td>
+          <td>430.5858</td>
+          <td>0.00%</td>
+          <td>1.0000</td>
+      </tr>
+      <tr>
+          <td nowrap>module:disable graph</td>
+          <td>10503</td>
+          <td>14.7746</td>
+          <td>6.7684</td>
+          <td>433.1748</td>
+          <td>0.42%</td>
+          <td>1.0060</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph, bfs</td>
+          <td>6935</td>
+          <td>14.8553</td>
+          <td>6.7316</td>
+          <td>430.8231</td>
+          <td>34.25%</td>
+          <td>1.0006</td>
+      </tr>
+      <tr>
+          <td nowrap>module:enable graph, serial</td>
+          <td>7027</td>
+          <td>14.3271</td>
+          <td>6.9798</td>
+          <td>446.7074</td>
+          <td>33.37%</td>
+          <td>1.0374</td>
+      </tr>
+  </table>
+
+### Conclusion
+
+- Computational graph does not affect training time and memory usage if the
+  graph is disabled.
+- Computational graph can significantly reduce memory usage and training time.
+
+## Implementation
+
+### Computational graph construction
+
+- `Buffer the operations`: Use the technique of delayed execution to falsely
+  perform operations in the forward propagation and backward propagation once.
+  Buffer all the operations and the tensors read or written by each operation.
+  Take matmul for example.
+
+  ```python
+  # user calls an api to do matmul on two tensors
+  x = autograd.matmul(inputs, w0)
+
+  # Python code inside the api
+  singa.Mult(inputs, w)
+  ```
+
+  ```c++
+  // the backend platform
+  // pass the specific execution function of the operation
+  // and the tensors it will reads and writes during the calculation to the device.
+  C->device()->Exec(
+      [a, A, b, B, CRef](Context *ctx) mutable {
+          GEMV<DType, Lang>(a, A, B, b, &CRef, ctx);
+      },
+      read_blocks, {C->block()});
+  ```
+
+- `Build nodes and edges`: Build the nodes and edges of the operations passed to
+  the device and add them into the computational graph. Since we just told the
+  scheduler which blocks these operations will read and write and some of the
+  tensors will share the same blocks, the scheduler will split one edge into
+  multiple to ensure that the constructed graph is a directed acyclic graph.
+
+- `Analyze the graph`: Calculate dependencies between all the operations to
+  decide the order of execution. The system will only analyze the same graph
+  once. If new operations are added to the graph, the calculation graph will be
+  re-analyzed.
+
+- `Run graph`: Execute all the operations in the order we just calculated to
+  update all the parameters. Tensors are well scheduled to allocate and
+  deallocate to save memory. After the analyzing, the operations in the graph
+  can be executed based on the result of analyzing.
+
+- `Module`: Provided a module class on the Python side for users to use this
+  feature more conveniently.
+
+### Lazy allocation
+
+- When a device needs to create a new block, pass the device to that block only,
+  instead of allocating a piece of memory from the mempool and passing the
+  pointer to that block.
+- When a block is accessed for the first time, the device corresponding to the
+  block allocates memory and then access it.
+
+### Automatic recycling
+
+- When calculating dependencies between the operations during graph
+  construction, the reference count of tensors can also be calculated.
+- When an operation is completed, the schedualer decrease the reference count of
+  tensors that the operation used.
+- If a tensor's reference count reaches zero, it means the tensor won't be
+  accessed by latter operations, so we can recycle its memory.
+- The program will track the usage of the block. If a block is used on the
+  python side, it will not be recycled, which is convenient for debugging on the
+  python side.
+
+### Shared memory
+
+- Once the kernel function of an operation is added into the default cuda stream
+  and the tensors used by the operation can be freed when the calculation is
+  complete, the scheduler will free these tensors' memory immediately and no
+  need to wait for the calculation to complete. Because subsequent operations
+  will not be performed at the same time as the current operation as the
+  platform now used the default stream of CUDA to finish the calculation. So the
+  following tensors can share the same memory with these tensors.
+- Use a mempool to manage the GPU memory. Scheduler returns the memory used by
+  tensors to the mempool and the latter tensors will apply for memory from
+  mempool. The mempool will find the most suitable blocks returned by the
+  previous tensors for the latter tensors to share as much memory as possible.
+
+## How to add a new operation
+
+For new operations to be included in the computational graph, they should be
+submitted to the device. Device class on the CPP side will add these operations
+in the computational graph and the scheduler will schedule them automatically.
+
+#### Requirements
+
+When submitting operations, there are some requirements.
+
+- Need to pass in the function that the operation executes and the data blocks
+  that the operation reads and writes
+
+- For the function of the operation: All variables used in lambda expressions
+  need to be captured according to the following rules.
+
+  - `capture by value`: If the variable is a local variable or will be
+    immediately released (e.g. intermediate tensors). If not captured by value,
+    these variables will be destroyed after buffering. Buffering is just a way
+    to defer real calculations.
+  - `capture by reference`:If the variable is recorded on the python side or a
+    global variable (e.g. The parameter W and ConvHand in the Conv2d class).
+
+  - `mutable`: The lambda expression should have mutable tag if a variable
+    captured by value is modified in an expression
+
+#### Example
+
+- Python side:
+  [\_Conv2d](https://github.com/apache/singa/blob/dev/python/singa/autograd.py#L1191)
+  records x, W, b and handle in the class.
+
+```python
+class _Conv2d(Operation):
+
+    def __init__(self, handle, odd_padding=(0, 0, 0, 0)):
+        super(_Conv2d, self).__init__()
+        self.handle = handle  # record handle
+        self.odd_padding = odd_padding
+        if self.odd_padding != (0, 0, 0, 0):
+            self.re_new_handle = True
+
+    def forward(self, x, W, b=None):
+		# other code
+        # ......
+
+        if training:
+            if self.handle.bias_term:
+                self.inputs = (x, W, b) # record x, W, b
+            else:
+                self.inputs = (x, W)
+
+		# other code
+        # ......
+
+        if (type(self.handle) != singa.ConvHandle):
+            return singa.GpuConvForward(x, W, b, self.handle)
+        else:
+            return singa.CpuConvForward(x, W, b, self.handle)
+
+    def backward(self, dy):
+        if (type(self.handle) != singa.ConvHandle):
+            dx = singa.GpuConvBackwardx(dy, self.inputs[1], self.inputs[0],
+                                        self.handle)
+            dW = singa.GpuConvBackwardW(dy, self.inputs[0], self.inputs[1],
+                                        self.handle)
+            db = singa.GpuConvBackwardb(
+                dy, self.inputs[2],
+                self.handle) if self.handle.bias_term else None
+        else:
+            dx = singa.CpuConvBackwardx(dy, self.inputs[1], self.inputs[0],
+                                        self.handle)
+            dW = singa.CpuConvBackwardW(dy, self.inputs[0], self.inputs[1],
+                                        self.handle)
+            db = singa.CpuConvBackwardb(
+                dy, self.inputs[2],
+                self.handle) if self.handle.bias_term else None
+        if self.odd_padding != (0, 0, 0, 0):
+            dx = utils.handle_odd_pad_bwd(dx, self.odd_padding)
+
+        if db:
+            return dx, dW, db
+
+        else:
+            return dx, dW
+```
+
+- C++ side:
+  [convolution.cc](https://github.com/apache/singa/blob/dev/src/model/operation/convolution.cc)
+
+```c++
+Tensor GpuConvBackwardx(const Tensor &dy, const Tensor &W, const Tensor &x,
+                        const CudnnConvHandle &cch) {
+  CHECK_EQ(dy.device()->lang(), kCuda);
+
+  Tensor dx;
+  dx.ResetLike(x);
+
+  dy.device()->Exec(
+      /*
+       * dx is a local variable so it's captured by value
+       * dy is an intermediate tensor and isn't recorded on the python side
+       * W is an intermediate tensor but it's recorded on the python side
+       * chh is a variable and it's recorded on the python side
+       */
+      [dx, dy, &W, &cch](Context *ctx) mutable {
+        Block *wblock = W.block(), *dyblock = dy.block(), *dxblock = dx.block();
+        float alpha = 1.f, beta = 0.f;
+        cudnnConvolutionBackwardData(
+            ctx->cudnn_handle, &alpha, cch.filter_desc, wblock->data(),
+            cch.y_desc, dyblock->data(), cch.conv_desc, cch.bp_data_alg,
+            cch.workspace.block()->mutable_data(),
+            cch.workspace_count * sizeof(float), &beta, cch.x_desc,
+            dxblock->mutable_data());
+      },
+      {dy.block(), W.block()}, {dx.block(), cch.workspace.block()});
+      /* the lambda expression reads the blocks of tensor dy and w
+       * and writes the blocks of tensor dx and chh.workspace
+       */
+
+  return dx;
+}
+```
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/history-singa.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/history-singa.md
new file mode 100644
index 0000000..1584e99
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/history-singa.md
@@ -0,0 +1,42 @@
+---
+id: version-3.0.0.rc1-history-singa
+title: History of SINGA
+original_id: history-singa
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+## History
+
+SINGA was initiated by the DB System Group at National University of Singapore
+in 2014, in collaboration with the database group of Zhejiang University. Please
+cite the following two papers if you use SINGA in your research:
+
+- B.C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K.
+  H. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng.
+  [SINGA: A distributed deep learning platform](http://www.comp.nus.edu.sg/~ooibc/singaopen-mm15.pdf).
+  ACM Multimedia (Open Source Software Competition) 2015
+
+- W. Wang, G. Chen, T. T. A. Dinh, B. C. Ooi, K.-L.Tan, J. Gao, and S. Wang.
+  [SINGA: putting deep learning in the hands of multimedia users](http://www.comp.nus.edu.sg/~ooibc/singa-mm15.pdf).
+  ACM Multimedia 2015.
+
+Rafiki is a sub module of SINGA. Please cite the following paper if you use
+Rafiki in your research:
+
+- Wei Wang, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng
+  Chin Ooi, Jie Shao, Moaz Reyad.
+  [Rafiki: Machine Learning as an Analytics Service System](http://www.vldb.org/pvldb/vol12/p128-wang.pdf).
+  [VLDB 2019](http://vldb.org/2019/)
+  ([BibTex](https://dblp.org/rec/bib2/journals/pvldb/WangWGZCNOS18.bib)).
+
+Companies like [NetEase](http://tech.163.com/17/0602/17/CLUL016I00098GJ5.html),
+[yzBigData](http://www.yzbigdata.com/en/index.html),
+[Shentilium](https://shentilium.com/), [Foodlg](http://www.foodlg.com/) and
+[Medilot](https://medilot.com/technologies) are using SINGA for their
+applications.
+
+## License
+
+SINGA is released under
+[Apache License Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/how-to-release.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/how-to-release.md
new file mode 100644
index 0000000..916b67b
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/how-to-release.md
@@ -0,0 +1,142 @@
+---
+id: version-3.0.0.rc1-how-to-release
+title: How to Prepare a Release
+original_id: how-to-release
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+This is a guide for the
+[release preparing process](http://www.apache.org/dev/release-publishing.html)
+in SINGA.
+
+1. Select a release manager. The release manager (RM) is the coordinator for the
+   release process. It is the RM's signature (.asc) that is uploaded together
+   with the release. The RM generates KEY (RSA 4096-bit) and uploads it to a
+   public key server. The RM needs to get his key endorsed (signed) by other
+   Apache user, to be connected to the web of trust. He should first ask the
+   mentor to help signing his key.
+   [How to generate the key](http://www.apache.org/dev/release-signing.html)?
+
+2. Check license. [FAQ](https://www.apache.org/legal/src-headers.html#faq-docs);
+   [SINGA Issue](https://issues.apache.org/jira/projects/SINGA/issues/SINGA-447)
+
+   - The codebase does not include third-party code which is not compatible to
+     APL;
+   - The dependencies are compatible with APL. GNU-like licenses are NOT
+     compatible;
+   - All source files written by us MUST include the Apache license header:
+     http://www.apache.org/legal/src-headers.html. There's a script in there
+     which helps propagating the header to all files.
+   - Update the LICENSE file. If we include any third party code in the release
+     package which is not APL, must state it at the end of the NOTICE file.
+
+3. Bump the version. Check code and documentation
+
+   - The build process is error-free.
+   - Unit tests are included (as much as possible)
+   - Conda packages run without errors.
+   - The online documentation on the Apache website is up to date.
+
+4. Prepare the RELEASE_NOTES file. Include the following items, Introduction,
+   Features, Bugs (link to JIRA or Github PR), Changes, Dependency list,
+   Incompatibility issues. Follow this
+   [example](http://commons.apache.org/proper/commons-digester/commons-digester-3.0/RELEASE-NOTES.txt).
+
+5. Prepare DISCLAIMER file. Modify from the
+   [template](http://incubator.apache.org/guides/branding.html#disclaimers)
+
+6. Package the release candidate. The release should be packaged into :
+   apache-singa-VERSION.tar.gz. The release should not include any binary files
+   including git files. Upload the release to for
+   [stage](https://dist.apache.org/repos/dist/dev/VERSION/). The tar file,
+   signature, KEY and SHA256 checksum file should be included. MD5 is no longer
+   used. Policy is
+   [here](http://www.apache.org/dev/release-distribution#sigs-and-sums)
+
+   - apache-singa-VERSION.tar.gz
+   - KEY
+   - XX.acs
+   - .SHA256
+
+7. Call for vote by sending an email
+
+   ```
+   To: dev@singa.apache.org
+   Subject: [VOTE] Release apache-singa-X.Y.Z (release candidate N)
+
+   Hi all,
+
+   I have created a build for Apache SINGA X.Y.Z, release candidate N.
+   The artifacts to be voted on are located here:  xxxx
+   The hashes of the artifacts are as follows: xxx
+   Release artifacts are signed with the following key: xxx
+   Please vote on releasing this package. The vote is open for at least
+   72 hours and passes if a majority of at least three +1 votes are cast.
+
+   [ ] +1 Release this package as Apache SINGA X.Y.Z
+   [ ] 0 I don't feel strongly about it, but I'm okay with the release
+   [ ] -1 Do not release this package because...
+
+   Here is my vote:
+   +1
+   ```
+
+8. Wait at least 48 hours for test responses. Any PMC, committer or contributor
+   can test features for releasing, and feedback. Everyone should check these
+   before vote +1. If the vote passes, then send the result email. Otherwise,
+   repeat from the beginning.
+
+   ```
+   To: dev@singa.apache.org
+   Subject: [RESULT] [VOTE] Release apache-singa-X.Y.Z (release candidate N)
+
+   Thanks to everyone who has voted and given their comments.
+   The tally is as follows.
+
+   N binding +1s:
+   <names>
+
+   N non-binding +1s:
+   <names>
+
+   No 0s or -1s.
+
+   I am delighted to announce that the proposal to release
+   Apache SINGA X.Y.Z has passed.
+   ```
+
+9. Upload the package for
+   [distribution](http://www.apache.org/dev/release-publishing.html#distribution)
+   to https://dist.apache.org/repos/dist/release/VERSION/.
+
+10. Update the Download page of SINGA website. The tar.gz file MUST be
+    downloaded from mirror, using closer.cgi script; other artifacts MUST be
+    downloaded from main Apache site. More details
+    [here](http://www.apache.org/dev/release-download-pages.html). Some feedback
+    we got during the previous releases: "Download pages must only link to
+    formal releases, so must not include links to GitHub.", "Links to KEYS, sigs
+    and hashes must not use dist.apache.org; instead use
+    https://www.apache.org/dist/singa/...;", "Also you only need one KEYS link,
+    and there should be a description of how to use KEYS + sig or hash to verify
+    the downloads."
+
+11. Remove the RC tag and compile the conda packages.
+
+12. Publish the release information.
+
+    ```
+    To: announce@apache.org, dev@singa.apache.org
+    Subject: [ANNOUNCE] Apache SINGA X.Y.Z released
+
+    We are pleased to announce that SINGA X.Y.Z is released.
+
+    SINGA is a general distributed deep learning platform
+    for training big deep learning models over large datasets.
+    The release is available at: http://singa.apache.org/downloads.html
+    The main features of this release include XXX
+    We look forward to hearing your feedback, suggestions,
+    and contributions to the project.
+
+    On behalf of the SINGA team, {SINGA Team Member Name}
+    ```
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/install-win.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/install-win.md
new file mode 100644
index 0000000..72c7c53
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/install-win.md
@@ -0,0 +1,400 @@
+---
+id: version-3.0.0.rc1-install-win
+title: Build SINGA on Windows
+original_id: install-win
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+The process of building SINGA from source on Microsoft Windows has four parts:
+install dependencies, build SINGA source, (optionally) install the python module
+and (optionally) run the unit tests.
+
+## Install Dependencies
+
+You may create a folder for building the dependencies.
+
+The dependencies are:
+
+- Compiler and IDE
+  - Visual Studio. The community edition is free and can be used to build SINGA.
+    https://www.visualstudio.com/
+- CMake
+  - Can be downloaded from http://cmake.org/
+  - Make sure the path to cmake executable is in the system path, or use full
+    path when calling cmake.
+- SWIG
+
+  - Can be downloaded from http://swig.org/
+  - Make sure the path to swig executable is in the system path, or use full
+    path when calling swig. Use a recent version such as 3.0.12.
+
+- Protocol Buffers
+  - Download a suitable version such as 2.6.1:
+    https://github.com/google/protobuf/releases/tag/v2.6.1 .
+  - Download both protobuf-2.6.1.zip and protoc-2.6.1-win32.zip .
+  - Extract both of them in dependencies folder. Add the path to protoc
+    executable to the system path, or use full path when calling it.
+  - Open the Visual Studio solution which can be found in vsproject folder.
+  - Change the build settings to Release and x64.
+  - build libprotobuf project.
+- Openblas
+
+  - Download a suitable source version such as 0.2.20 from
+    http://www.openblas.net
+  - Extract the source in the dependencies folder.
+  - If you don't have Perl installed, download a perl environment such as
+    Strawberry Perl (http://strawberryperl.com/)
+  - Build the Visual Studio solution by running this command in the source
+    folder:
+
+  ```bash
+  cmake -G "Visual Studio 15 2017 Win64"
+  ```
+
+  - Open the Visual Studio solution and change the build settings to Release and
+    x64.
+  - Build libopenblas project
+
+- Google glog
+  - Download a suitable version such as 0.3.5 from
+    https://github.com/google/glog/releases
+  - Extract the source in the dependencies folder.
+  - Open the Visual Studio solution.
+  - Change the build settings to Release and x64.
+  - Build libglog project
+
+## Build SINGA source
+
+- Download SINGA source code
+- Compile the protobuf files:
+
+  - Goto src/proto folder
+
+  ```shell
+  mkdir python_out
+  protoc.exe *.proto --python_out python_out
+  ```
+
+- Generate swig interfaces for C++ and Python: Goto src/api
+
+  ```shell
+  swig -python -c++ singa.i
+  ```
+
+- generate Visual Studio solution for SINGA: Goto SINGA source code root folder
+
+  ```shell
+  mkdir build
+  cd build
+  ```
+
+- Call cmake and add the paths in your system similar to the following example:
+
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64" ^
+    -DGLOG_INCLUDE_DIR="D:/WinSinga/dependencies/glog-0.3.5/src/windows" ^
+    -DGLOG_LIBRARIES="D:/WinSinga/dependencies/glog-0.3.5/x64/Release" ^
+    -DCBLAS_INCLUDE_DIR="D:/WinSinga/dependencies/openblas-0.2.20/lapack-netlib/CBLAS/include" ^
+    -DCBLAS_LIBRARIES="D:/WinSinga/dependencies/openblas-0.2.20/lib/RELEASE" ^
+    -DProtobuf_INCLUDE_DIR="D:/WinSinga/dependencies/protobuf-2.6.1/src" ^
+    -DProtobuf_LIBRARIES="D:/WinSinga/dependencies/protobuf-2.6.1/vsprojects/x64/Release" ^
+    -DProtobuf_PROTOC_EXECUTABLE="D:/WinSinga/dependencies/protoc-2.6.1-win32/protoc.exe" ^
+    ..
+  ```
+
+- Open the generated solution in Visual Studio
+- Change the build settings to Release and x64
+- Add the singa_wrap.cxx file from src/api to the singa_objects project
+- In the singa_objects project, open Additional Include Directories.
+- Add Python include path
+- Add numpy include path
+- Add protobuf include path
+- In the preprocessor definitions of the singa_objects project, add USE_GLOG
+- Build singa_objects project
+
+- In singa project:
+
+  - add singa_wrap.obj to Object Libraries
+  - change target name to \_singa_wrap
+  - change target extension to .pyd
+  - change configuration type to Dynamic Library (.dll)
+  - goto Additional Library Directories and add the path to python, openblas,
+    protobuf and glog libraries
+  - goto Additional Dependencies and add libopenblas.lib, libglog.lib and
+    libprotobuf.lib
+
+- build singa project
+
+## Install Python module
+
+- Change `_singa_wrap.so` to `_singa_wrap.pyd` in build/python/setup.py
+- Copy the files in `src/proto/python_out` to `build/python/singa/proto`
+
+- Optionally create and activate a virtual environment:
+
+  ```shell
+  mkdir SingaEnv
+  virtualenv SingaEnv
+  SingaEnv\Scripts\activate
+  ```
+
+- goto build/python folder and run:
+
+  ```shell
+  python setup.py install
+  ```
+
+- Make \_singa_wrap.pyd, libglog.dll and libopenblas.dll available by adding
+  them to the path or by copying them to singa package folder in the python
+  site-packages
+
+- Verify that SINGA is installed by running:
+
+  ```shell
+  python -c "from singa import tensor"
+  ```
+
+A video tutorial for the build process can be found here:
+
+[![youtube video](https://img.youtube.com/vi/cteER7WeiGk/0.jpg)](https://www.youtube.com/watch?v=cteER7WeiGk)
+
+## Run Unit Tests
+
+- In the test folder, generate the Visual Studio solution:
+
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64"
+  ```
+
+- Open the generated solution in Visual Studio.
+
+- Change the build settings to Release and x64.
+
+- Build glog project.
+
+- In test_singa project:
+
+  - Add USE_GLOG to the Preprocessor Definitions.
+  - In Additional Include Directories, add path of GLOG_INCLUDE_DIR,
+    CBLAS_INCLUDE_DIR and Protobuf_INCLUDE_DIR which were used in step 2 above.
+    Add also build and build/include folders.
+  - Goto Additional Library Directories and add the path to openblas, protobuf
+    and glog libraries. Add also build/src/singa_objects.dir/Release.
+  - Goto Additional Dependencies and add libopenblas.lib, libglog.lib and
+    libprotobuf.lib. Fix the names of the two libraries: gtest.lib and
+    singa_objects.lib.
+
+- Build test_singa project.
+
+- Make libglog.dll and libopenblas.dll available by adding them to the path or
+  by copying them to test/release folder
+
+- The unit tests can be executed
+
+  - From the command line:
+
+  ```shell
+  test_singa.exe
+  ```
+
+  - From Visual Studio:
+    - right click on the test_singa project and choose 'Set as StartUp Project'.
+    - from the Debug menu, choose 'Start Without Debugging'
+
+A video tutorial for running the unit tests can be found here:
+
+[![youtube video](https://img.youtube.com/vi/393gPtzMN1k/0.jpg)](https://www.youtube.com/watch?v=393gPtzMN1k)
+
+## Build GPU support with CUDA
+
+In this section, we will extend the previous steps to enable GPU.
+
+### Install Dependencies
+
+In addition to the dependencies in section 1 above, we will need the following:
+
+- CUDA
+
+  Download a suitable version such as 9.1 from
+  https://developer.nvidia.com/cuda-downloads . Make sure to install the Visual
+  Studio integration module.
+
+- cuDNN
+
+  Download a suitable version such as 7.1 from
+  https://developer.nvidia.com/cudnn
+
+- cnmem:
+
+  - Download the latest version from https://github.com/NVIDIA/cnmem
+  - Build the Visual Studio solution:
+
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64"
+  ```
+
+  - Open the generated solution in Visual Studio.
+  - Change the build settings to Release and x64.
+  - Build the cnmem project.
+
+### Build SINGA source
+
+- Call cmake and add the paths in your system similar to the following example:
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64" ^
+    -DGLOG_INCLUDE_DIR="D:/WinSinga/dependencies/glog-0.3.5/src/windows" ^
+    -DGLOG_LIBRARIES="D:/WinSinga/dependencies/glog-0.3.5/x64/Release" ^
+    -DCBLAS_INCLUDE_DIR="D:/WinSinga/dependencies/openblas-0.2.20/lapack-netlib/CBLAS/include" ^
+    -DCBLAS_LIBRARIES="D:/WinSinga/dependencies/openblas-0.2.20/lib/RELEASE" ^
+    -DProtobuf_INCLUDE_DIR="D:/WinSinga/dependencies/protobuf-2.6.1/src" ^
+    -DProtobuf_LIBRARIES="D:\WinSinga/dependencies/protobuf-2.6.1/vsprojects/x64/Release" ^
+    -DProtobuf_PROTOC_EXECUTABLE="D:/WinSinga/dependencies/protoc-2.6.1-win32/protoc.exe" ^
+    -DCUDNN_INCLUDE_DIR=D:\WinSinga\dependencies\cudnn-9.1-windows10-x64-v7.1\cuda\include ^
+    -DCUDNN_LIBRARIES=D:\WinSinga\dependencies\cudnn-9.1-windows10-x64-v7.1\cuda\lib\x64 ^
+    -DSWIG_DIR=D:\WinSinga\dependencies\swigwin-3.0.12 ^
+    -DSWIG_EXECUTABLE=D:\WinSinga\dependencies\swigwin-3.0.12\swig.exe ^
+    -DUSE_CUDA=YES ^
+    -DCUDNN_VERSION=7 ^
+    ..
+  ```
+
+* Generate swig interfaces for C++ and Python: Goto src/api
+
+  ```shell
+  swig -python -c++ singa.i
+  ```
+
+* Open the generated solution in Visual Studio
+
+* Change the build settings to Release and x64
+
+#### Building singa_objects
+
+- Add the singa_wrap.cxx file from src/api to the singa_objects project
+- In the singa_objects project, open Additional Include Directories.
+- Add Python include path
+- Add numpy include path
+- Add protobuf include path
+- Add include path for CUDA, cuDNN and cnmem
+- In the preprocessor definitions of the singa_objects project, add USE_GLOG,
+  USE_CUDA and USE_CUDNN. Remove DISABLE_WARNINGS.
+- Build singa_objects project
+
+#### Building singa-kernel
+
+- Create a new Visual Studio project of type "CUDA 9.1 Runtime". Give it a name
+  such as singa-kernel.
+- The project comes with an initial file called kernel.cu. Remove this file from
+  the project.
+- Add this file: src/core/tensor/math_kernel.cu
+- In the project settings:
+
+  - Set Platform Toolset to "Visual Studio 2015 (v140)"
+  - Set Configuration Type to " Static Library (.lib)"
+  - In the Include Directories, add build/include.
+
+- Build singa-kernel project
+
+#### Building singa
+
+- In singa project:
+
+  - add singa_wrap.obj to Object Libraries
+  - change target name to \_singa_wrap
+  - change target extension to .pyd
+  - change configuration type to Dynamic Library (.dll)
+  - goto Additional Library Directories and add the path to python, openblas,
+    protobuf and glog libraries
+  - Add also the library path to singa-kernel, cnmem, cuda and cudnn.
+  - goto Additional Dependencies and add libopenblas.lib, libglog.lib and
+    libprotobuf.lib.
+  - Add also: singa-kernel.lib, cnmem.lib, cudnn.lib, cuda.lib , cublas.lib,
+    curand.lib and cudart.lib.
+
+- build singa project
+
+### Install Python module
+
+- Change \_singa_wrap.so to \_singa_wrap.pyd in build/python/setup.py
+- Copy the files in src/proto/python_out to build/python/singa/proto
+
+- Optionally create and activate a virtual environment:
+
+  ```shell
+  mkdir SingaEnv
+  virtualenv SingaEnv
+  SingaEnv\Scripts\activate
+  ```
+
+- goto build/python folder and run:
+
+  ```shell
+  python setup.py install
+  ```
+
+- Make \_singa_wrap.pyd, libglog.dll, libopenblas.dll, cnmem.dll, CUDA Runtime
+  (e.g. cudart64_91.dll) and cuDNN (e.g. cudnn64_7.dll) available by adding them
+  to the path or by copying them to singa package folder in the python
+  site-packages
+
+- Verify that SINGA is installed by running:
+
+  ```shell
+  python -c "from singa import device; dev = device.create_cuda_gpu()"
+  ```
+
+A video tutorial for this part can be found here:
+
+[![youtube video](https://img.youtube.com/vi/YasKVjRtuDs/0.jpg)](https://www.youtube.com/watch?v=YasKVjRtuDs)
+
+### Run Unit Tests
+
+- In the test folder, generate the Visual Studio solution:
+
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64"
+  ```
+
+- Open the generated solution in Visual Studio, or add the project to the singa
+  solution that was created in step 5.2
+
+- Change the build settings to Release and x64.
+
+- Build glog project.
+
+- In test_singa project:
+
+  - Add USE_GLOG; USE_CUDA; USE_CUDNN to the Preprocessor Definitions.
+  - In Additional Include Directories, add path of GLOG_INCLUDE_DIR,
+    CBLAS_INCLUDE_DIR and Protobuf_INCLUDE_DIR which were used in step 5.2
+    above. Add also build, build/include, CUDA and cuDNN include folders.
+  - Goto Additional Library Directories and add the path to openblas, protobuf
+    and glog libraries. Add also build/src/singa_objects.dir/Release,
+    singa-kernel, cnmem, CUDA and cuDNN library paths.
+  - Goto Additional Dependencies and add libopenblas.lib; libglog.lib;
+    libprotobuf.lib; cnmem.lib; cudnn.lib; cuda.lib; cublas.lib; curand.lib;
+    cudart.lib; singa-kernel.lib. Fix the names of the two libraries: gtest.lib
+    and singa_objects.lib.
+
+* Build test_singa project.
+
+* Make libglog.dll, libopenblas.dll, cnmem.dll, cudart64_91.dll and
+  cudnn64_7.dll available by adding them to the path or by copying them to
+  test/release folder
+
+* The unit tests can be executed
+
+  - From the command line:
+
+    ```shell
+    test_singa.exe
+    ```
+
+  - From Visual Studio:
+    - right click on the test_singa project and choose 'Set as StartUp Project'.
+    - from the Debug menu, choose 'Start Without Debugging'
+
+A video tutorial for running the unit tests can be found here:
+
+[![youtube video](https://img.youtube.com/vi/YOjwtrvTPn4/0.jpg)](https://www.youtube.com/watch?v=YOjwtrvTPn4)
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/installation.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/installation.md
new file mode 100644
index 0000000..710ebd3
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/installation.md
@@ -0,0 +1,144 @@
+---
+id: version-3.0.0.rc1-installation
+title: Installation
+original_id: installation
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+## From Conda
+
+Conda is a package manager for Python, CPP and other packages.
+
+Currently, SINGA has conda packages for Linux and MacOSX.
+[Miniconda3](https://conda.io/miniconda.html) is recommended to use with SINGA.
+After installing miniconda, execute the one of the following commands to install
+SINGA.
+
+1. CPU only
+   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ntkhi-Z6XTR8WYPXiLwujHd2dOm0772V)
+
+```shell
+$ conda install -c nusdbsystem -c conda-forge singa-cpu
+```
+
+2. GPU with CUDA and cuDNN (CUDA driver >=384.81 is required)
+   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1do_TLJe18IthLOnBOsHCEe-FFPGk1sPJ)
+
+```shell
+$ conda install -c nusdbsystem -c conda-forge singa-gpu
+```
+
+3. Install a specific version of SINGA. The following command lists all the
+   available SINGA packages.
+
+```shell
+$ conda search -c nusdbsystem singa
+
+Loading channels: done
+# Name                       Version           Build  Channel
+singa                      2.1.0.dev        cpu_py36  nusdbsystem
+singa                      2.1.0.dev        cpu_py37  nusdbsystem
+```
+
+<!--- > Please note that using the nightly built images is not recommended except for SINGA development and testing. Using stable releases is recommended. -->
+
+The following command installs a specific version of SINGA,
+
+```shell
+$ conda install -c nusdbsystem -c conda-forge singa=X.Y.Z.dev=cpu_py37
+```
+
+If there is no error message from
+
+```shell
+$ python -c "from singa import tensor"
+```
+
+then SINGA is installed successfully.
+
+## Using Docker
+
+Install Docker on your local host machine following the
+[instructions](https://docs.docker.com/install/). Add your user into the
+[docker group](https://docs.docker.com/install/linux/linux-postinstall/) to run
+docker commands without `sudo`.
+
+1. CPU-only.
+
+```shell
+$ docker run -it apache/singa:X.Y.Z-cpu-ubuntu16.04 /bin/bash
+```
+
+2. With GPU enabled. Install
+   [Nvidia-Docker](https://github.com/NVIDIA/nvidia-docker) after install
+   Docker.
+
+```shell
+$ nvidia-docker run -it apache/singa:X.Y.Z-cuda9.0-cudnn7.4.2-ubuntu16.04 /bin/bash
+```
+
+3. For the complete list of SINGA Docker images (tags), visit the
+   [docker hub site](https://hub.docker.com/r/apache/singa/). For each docker
+   image, the tag is named as
+
+```shell
+version-(cpu|gpu)[-devel]
+```
+
+| Tag       | Description                      | Example value                                                                                                                                                             |
+| --------- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `version` | SINGA version                    | '2.0.0-rc0', '2.0.0', '1.2.0'                                                                                                                                             |
+| `cpu`     | the image cannot run on GPUs     | 'cpu'                                                                                                                                                                     |
+| `gpu`     | the image can run on Nvidia GPUs | 'gpu', or 'cudax.x-cudnnx.x' e.g., 'cuda10.0-cudnn7.3'                                                                                                                    |
+| `devel`   | indicator for development        | if absent, SINGA Python package is installed for runtime only; if present, the building environment is also created, you can recompile SINGA from source at '/root/singa' |
+| `OS`      | indicate OS version number       | 'ubuntu16.04', 'ubuntu18.04'                                                                                                                                              |
+
+## From source
+
+You can [build and install SINGA](build.md) from the source code using native
+building tools or conda-build, on local host OS or in a Docker container.
+
+## FAQ
+
+- Q: Error from `from singa import tensor`
+
+  A: Check the detailed error from
+
+  ```shell
+  python -c  "from singa import _singa_wrap"
+  # go to the folder of _singa_wrap.so
+  ldd path to _singa_wrap.so
+  python
+  >> import importlib
+  >> importlib.import_module('_singa_wrap')
+  ```
+
+  The folder of `_singa_wrap.so` is like
+  `~/miniconda3/lib/python3.7/site-packages/singa`. Normally, the error is
+  caused by the mismatch or missing of dependent libraries, e.g. cuDNN or
+  protobuf. The solution is to create a new virtual environment and install
+  SINGA in that environment, e.g.,
+
+  ```shell
+  conda create -n singa
+  conda activate singa
+  conda install -c nusdbsystem -c conda-forge singa-cpu
+  ```
+
+- Q: When using virtual environment, every time I install SINGA, numpy would be
+  reinstalled. However, the numpy is not used when I run `import numpy`
+
+  A: It could be caused by the `PYTHONPATH` environment variable which should be
+  set to empty when you are using virtual environment to avoid the conflicts
+  with the path of the virtual environment.
+
+- Q: When I run SINGA in Mac OS X, I got the error "Fatal Python error:
+  PyThreadState_Get: no current thread Abort trap: 6"
+
+  A: This error happens typically when you have multiple versions of Python in
+  your system, e.g, the one comes with the OS and the one installed by Homebrew.
+  The Python linked by SINGA must be the same as the Python interpreter. You can
+  check your interpreter by `which python` and check the Python linked by SINGA
+  via `otool -L <path to _singa_wrap.so>`. This problem should be resolved if
+  SINGA is installation via conda.
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/issue-tracking.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/issue-tracking.md
new file mode 100644
index 0000000..57b3883
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/issue-tracking.md
@@ -0,0 +1,12 @@
+---
+id: version-3.0.0.rc1-issue-tracking
+title: Issue Tracking
+original_id: issue-tracking
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA used [JIRA](https://issues.apache.org/jira/browse/singa) to manage issues
+including bugs, new features and discussions.
+
+We are now moving to [Github Issues](https://github.com/apache/singa/issues).
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/mail-lists.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/mail-lists.md
new file mode 100644
index 0000000..900df27
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/mail-lists.md
@@ -0,0 +1,16 @@
+---
+id: version-3.0.0.rc1-mail-lists
+title: Project Mailing Lists
+original_id: mail-lists
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+These are the mailing lists that have been established for this project. For
+each list, there is a subscribe, unsubscribe, and an archive link.
+
+| Name        | Post                                 | Subscribe                                                        | Unsubscribe                                                          | Archive                                                                             |
+| ----------- | ------------------------------------ | ---------------------------------------------------------------- | -------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
+| Development | <de...@singa.incubator.apache.org>     | [Subscribe](mailto:dev-subscribe@singa.incubator.apache.org)     | [Unsubscribe](mailto:dev-unsubscribe@singa.incubator.apache.org.)    | [mail-archives.apache.org](http://mail-archives.apache.org/mod_mbox/singa-dev/)     |
+| Commits     | <co...@singa.incubator.apache.org> | [Subscribe](mailto:commits-subscribe@singa.incubator.apache.org) | [Unsubscribe](mailto:commits-unsubscribe@singa.incubator.apache.org) | [mail-archives.apache.org](http://mail-archives.apache.org/mod_mbox/singa-commits/) |
+| Security    | <se...@singa.apache.org>          | private                                                          | private                                                              | private                                                                             |
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/onnx.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/onnx.md
new file mode 100644
index 0000000..8533693
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/onnx.md
@@ -0,0 +1,762 @@
+---
+id: version-3.0.0.rc1-onnx
+title: ONNX
+original_id: onnx
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+ONNX is an open format built to represent machine learning models, which enables
+an ability to transfer trained models between different deep learning
+frameworks. We have integrated the main functionality of ONNX into SINGA, and
+several basic operators have been supported. More operators are being
+developing.
+
+The supported [ONNX
+version}(https://github.com/onnx/onnx/blob/master/docs/Versioning.md) os SINGA
+is:
+
+| ONNX version | File format version | Opset version ai.onnx | Opset version ai.onnx.ml | Opset version ai.onnx.training |
+| ------------ | ------------------- | --------------------- | ------------------------ | ------------------------------ |
+| 1.6.0        | 6                   | 11                    | 2                        | -                              |
+
+## General usage
+
+The onnx in SINGA has supported the basic functionality, and please refer the
+following tutorials for general usage:
+
+### Loading an ONNX Model into SINGA
+
+This part introduces how to import and prepare a SINGA model from a ONNX model.
+After you load a ONNX model by `onnx.load`, you need to update the model's
+batchsize, since for most model, they uses a placeholder to represent its
+batchsize. We give an example here, as `update_batch_size`. You only needs to
+update the batchsize of input and output, the shape of inner tensor will be
+inferred automatically.
+
+Then, you can prepare the SINGA model by using `sonnx.prepare`. This function
+iteraters and translates all the nodes within the ONNX model's graph to SINGA
+operators, loads all stored weights and infers each intermediate tensor's shape.
+For the device used, please refer to the `device` section.
+
+```python3
+import onnx
+from singa import device
+from singa import sonnx
+
+def update_batch_size(onnx_model, batch_size):
+    model_input = onnx_model.graph.input[0]
+    model_input.type.tensor_type.shape.dim[0].dim_value = batch_size
+    model_output = onnx_model.graph.output[0]
+    model_output.type.tensor_type.shape.dim[0].dim_value = batch_size
+    return onnx_model
+
+
+model_path = "PATH/To/ONNX/MODEL"
+onnx_model = onnx.load(model_path)
+
+# set batch size
+onnx_model = update_batch_size(onnx_model, 1)
+
+# prepare the model
+dev = device.create_cuda_gpu()
+sg_ir = sonnx.prepare(onnx_model, device=dev)
+```
+
+### Inferernce SINGA model
+
+After you load and prepare a SINGA model, you can do the inference by calling
+`sg_ir.run` as the following code. The input and output must be SINGA `Tensor`,
+and since SINGA model returns the output as a list, so if you only have one
+output, you just take the first element from the output as `forward` of `Infer`
+class.
+
+```python3
+class Infer:
+
+
+    def __init__(self, sg_ir):
+        self.sg_ir = sg_ir
+
+    def forward(self, x):
+        return sg_ir.run([x])[0]
+
+
+data = get_dataset()
+x = tensor.Tensor(device=dev, data=data)
+
+model = Infer(sg_ir)
+y = model.forward(x)
+```
+
+### Saving an ONNX Model from SINGA
+
+Now, if you have a SINGA model, you can export it as ONNX model as following:
+
+```python3
+sonnx.to_onnx([x], [y])
+```
+
+### Re-training a ONNX model
+
+You also can re-training a ONNX model after you load it into SINGA as following
+code. Please node you should set all tensors of the SINGA model to enable them
+to store gradient by `tens.requires_grad = True` and `tens.stores_grad = True`.
+
+```python3
+class Infer:
+
+    def __init__(self, sg_ir):
+        self.sg_ir = sg_ir
+        for idx, tens in sg_ir.tensor_map.items():
+            # allow the tensors to be updated
+            tens.requires_grad = True
+            tens.stores_grad = True
+
+    def forward(self, x):
+        return sg_ir.run([x])[0]
+
+autograd.training = False
+model = Infer(sg_ir)
+
+# then you training the model like normal
+```
+
+### Transfer-learning a ONNX model
+
+You also can append some layers to the end of ONNX model to do transfer-learning
+like following. The `last_layers` means you cut the ONNX layers from [0,
+last_layers]. Then you can append more layers by the normal SINGA model.
+
+```python3
+class Trans:
+
+    def __init__(self, sg_ir, last_layers):
+        self.sg_ir = sg_ir
+        self.last_layers = last_layers
+        self.append_linear1 = autograd.Linear(500, 128, bias=False)
+        self.append_linear2 = autograd.Linear(128, 32, bias=False)
+        self.append_linear3 = autograd.Linear(32, 10, bias=False)
+
+    def forward(self, x):
+        y = sg_ir.run([x], last_layers=self.last_layers)[0]
+        y = self.append_linear1(y)
+        y = autograd.relu(y)
+        y = self.append_linear2(y)
+        y = autograd.relu(y)
+        y = self.append_linear3(y)
+        y = autograd.relu(y)
+        return y
+
+autograd.training = False
+model = Trans(sg_ir, -1)
+
+# then you training the model like normal
+```
+
+## Example: ONNX mnist on SINGA
+
+This part introduces the usage of SINGA ONNX by using the mnist example. In this
+section, the examples of how to export, load, inference, re-training, and
+transfer-learning the minist model are displayed. You can try this part
+[here](https://colab.research.google.com/drive/1-YOfQqqw3HNhS8WpB8xjDQYutRdUdmCq).
+
+### Load dataset
+
+Firstly, you need to import some necessary libraries and define some auxiliary
+functions for downloading and preprocessing the dataset:
+
+```python
+import os
+import urllib.request
+import gzip
+import numpy as np
+import codecs
+
+from singa import device
+from singa import tensor
+from singa import opt
+from singa import autograd
+from singa import sonnx
+import onnx
+
+
+def load_dataset():
+    train_x_url = 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz'
+    train_y_url = 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
+    valid_x_url = 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz'
+    valid_y_url = 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
+    train_x = read_image_file(check_exist_or_download(train_x_url)).astype(
+        np.float32)
+    train_y = read_label_file(check_exist_or_download(train_y_url)).astype(
+        np.float32)
+    valid_x = read_image_file(check_exist_or_download(valid_x_url)).astype(
+        np.float32)
+    valid_y = read_label_file(check_exist_or_download(valid_y_url)).astype(
+        np.float32)
+    return train_x, train_y, valid_x, valid_y
+
+
+def check_exist_or_download(url):
+
+    download_dir = '/tmp/'
+
+    name = url.rsplit('/', 1)[-1]
+    filename = os.path.join(download_dir, name)
+    if not os.path.isfile(filename):
+        print("Downloading %s" % url)
+        urllib.request.urlretrieve(url, filename)
+    return filename
+
+
+def read_label_file(path):
+    with gzip.open(path, 'rb') as f:
+        data = f.read()
+        assert get_int(data[:4]) == 2049
+        length = get_int(data[4:8])
+        parsed = np.frombuffer(data, dtype=np.uint8, offset=8).reshape(
+            (length))
+        return parsed
+
+
+def get_int(b):
+    return int(codecs.encode(b, 'hex'), 16)
+
+
+def read_image_file(path):
+    with gzip.open(path, 'rb') as f:
+        data = f.read()
+        assert get_int(data[:4]) == 2051
+        length = get_int(data[4:8])
+        num_rows = get_int(data[8:12])
+        num_cols = get_int(data[12:16])
+        parsed = np.frombuffer(data, dtype=np.uint8, offset=16).reshape(
+            (length, 1, num_rows, num_cols))
+        return parsed
+
+
+def to_categorical(y, num_classes):
+    y = np.array(y, dtype="int")
+    n = y.shape[0]
+    categorical = np.zeros((n, num_classes))
+    categorical[np.arange(n), y] = 1
+    categorical = categorical.astype(np.float32)
+    return categorical
+```
+
+### MNIST model
+
+Then you can define a class called **CNN** to construct the mnist model which
+consists of several convolution, pooling, fully connection and relu layers. You
+can also define a function to calculate the **accuracy** of our result. Finally,
+you can define a **train** and a **test** function to handle the training and
+prediction process.
+
+```python
+class CNN:
+    def __init__(self):
+        self.conv1 = autograd.Conv2d(1, 20, 5, padding=0)
+        self.conv2 = autograd.Conv2d(20, 50, 5, padding=0)
+        self.linear1 = autograd.Linear(4 * 4 * 50, 500, bias=False)
+        self.linear2 = autograd.Linear(500, 10, bias=False)
+        self.pooling1 = autograd.MaxPool2d(2, 2, padding=0)
+        self.pooling2 = autograd.MaxPool2d(2, 2, padding=0)
+
+    def forward(self, x):
+        y = self.conv1(x)
+        y = autograd.relu(y)
+        y = self.pooling1(y)
+        y = self.conv2(y)
+        y = autograd.relu(y)
+        y = self.pooling2(y)
+        y = autograd.flatten(y)
+        y = self.linear1(y)
+        y = autograd.relu(y)
+        y = self.linear2(y)
+        return y
+
+
+def accuracy(pred, target):
+    y = np.argmax(pred, axis=1)
+    t = np.argmax(target, axis=1)
+    a = y == t
+    return np.array(a, "int").sum() / float(len(t))
+
+
+def train(model,
+          x,
+          y,
+          epochs=1,
+          batch_size=64,
+          dev=device.get_default_device()):
+    batch_number = x.shape[0] // batch_size
+
+    for i in range(epochs):
+        for b in range(batch_number):
+            l_idx = b * batch_size
+            r_idx = (b + 1) * batch_size
+
+            x_batch = tensor.Tensor(device=dev, data=x[l_idx:r_idx])
+            target_batch = tensor.Tensor(device=dev, data=y[l_idx:r_idx])
+
+            output_batch = model.forward(x_batch)
+            # onnx_model = sonnx.to_onnx([x_batch], [y])
+            # print('The model is:\n{}'.format(onnx_model))
+
+            loss = autograd.softmax_cross_entropy(output_batch, target_batch)
+            accuracy_rate = accuracy(tensor.to_numpy(output_batch),
+                                     tensor.to_numpy(target_batch))
+
+            sgd = opt.SGD(lr=0.001)
+            for p, gp in autograd.backward(loss):
+                sgd.update(p, gp)
+            sgd.step()
+
+            if b % 1e2 == 0:
+                print("acc %6.2f loss, %6.2f" %
+                      (accuracy_rate, tensor.to_numpy(loss)[0]))
+    print("training completed")
+    return x_batch, output_batch
+
+def test(model, x, y, batch_size=64, dev=device.get_default_device()):
+    batch_number = x.shape[0] // batch_size
+
+    result = 0
+    for b in range(batch_number):
+        l_idx = b * batch_size
+        r_idx = (b + 1) * batch_size
+
+        x_batch = tensor.Tensor(device=dev, data=x[l_idx:r_idx])
+        target_batch = tensor.Tensor(device=dev, data=y[l_idx:r_idx])
+
+        output_batch = model.forward(x_batch)
+        result += accuracy(tensor.to_numpy(output_batch),
+                           tensor.to_numpy(target_batch))
+
+    print("testing acc %6.2f" % (result / batch_number))
+```
+
+### Train mnist model and export it to onnx
+
+Now, you can train the mnist model and export its onnx model by calling the
+**soonx.to_onnx** function.
+
+```python
+def make_onnx(x, y):
+    return sonnx.to_onnx([x], [y])
+
+# create device
+dev = device.create_cuda_gpu()
+#dev = device.get_default_device()
+# create model
+model = CNN()
+# load data
+train_x, train_y, valid_x, valid_y = load_dataset()
+# normalization
+train_x = train_x / 255
+valid_x = valid_x / 255
+train_y = to_categorical(train_y, 10)
+valid_y = to_categorical(valid_y, 10)
+# do training
+autograd.training = True
+x, y = train(model, train_x, train_y, dev=dev)
+onnx_model = make_onnx(x, y)
+# print('The model is:\n{}'.format(onnx_model))
+
+# Save the ONNX model
+model_path = os.path.join('/', 'tmp', 'mnist.onnx')
+onnx.save(onnx_model, model_path)
+print('The model is saved.')
+```
+
+### Inference
+
+After you export the onnx model, you can find a file called **mnist.onnx** in
+the '/tmp' directory, this model, therefore, can be imported by other libraries.
+Now, if you want to import this onnx model into singa again and do the inference
+using the validation dataset, you can define a class called **Infer**, the
+forward function of Infer will be called by the test function to do inference
+for validation dataset. By the way, you should set the label of training to
+**False** to fix the gradient of autograd operators.
+
+When import the onnx model, you need to call **onnx.load** to load the onnx
+model firstly. Then the onnx model will be fed into the **soonx.prepare** to
+parse and initiate to a singa model(**sg_ir** in the code). The sg_ir contains a
+singa graph within it, and then you can run an step of inference by feeding
+input to its run function.
+
+```python
+class Infer:
+    def __init__(self, sg_ir):
+        self.sg_ir = sg_ir
+        for idx, tens in sg_ir.tensor_map.items():
+            # allow the tensors to be updated
+            tens.requires_grad = True
+            tens.stores_grad= True
+            sg_ir.tensor_map[idx] = tens
+
+    def forward(self, x):
+        return sg_ir.run([x])[0] # we can run one step of inference by feeding input
+
+# load the ONNX model
+onnx_model = onnx.load(model_path)
+sg_ir = sonnx.prepare(onnx_model, device=dev) # parse and initiate to a singa model
+
+# inference
+autograd.training = False
+print('The inference result is:')
+test(Infer(sg_ir), valid_x, valid_y, dev=dev)
+```
+
+### Re-training
+
+Assume after import the model, you want to re-train the model again, we can
+define a function called **re_train**. Before we call this re_train function, we
+should set the label of training to **True** to make the autograde operators
+update their gradient. And after we finish the training, we set it as **False**
+again to call the test function doing inference.
+
+```python
+def re_train(sg_ir,
+             x,
+             y,
+             epochs=1,
+             batch_size=64,
+             dev=device.get_default_device()):
+    batch_number = x.shape[0] // batch_size
+
+    new_model = Infer(sg_ir)
+
+    for i in range(epochs):
+        for b in range(batch_number):
+            l_idx = b * batch_size
+            r_idx = (b + 1) * batch_size
+
+            x_batch = tensor.Tensor(device=dev, data=x[l_idx:r_idx])
+            target_batch = tensor.Tensor(device=dev, data=y[l_idx:r_idx])
+
+            output_batch = new_model.forward(x_batch)
+
+            loss = autograd.softmax_cross_entropy(output_batch, target_batch)
+            accuracy_rate = accuracy(tensor.to_numpy(output_batch),
+                                     tensor.to_numpy(target_batch))
+
+            sgd = opt.SGD(lr=0.01)
+            for p, gp in autograd.backward(loss):
+                sgd.update(p, gp)
+            sgd.step()
+
+            if b % 1e2 == 0:
+                print("acc %6.2f loss, %6.2f" %
+                      (accuracy_rate, tensor.to_numpy(loss)[0]))
+    print("re-training completed")
+    return new_model
+
+# load the ONNX model
+onnx_model = onnx.load(model_path)
+sg_ir = sonnx.prepare(onnx_model, device=dev)
+
+# re-training
+autograd.training = True
+new_model = re_train(sg_ir, train_x, train_y, dev=dev)
+autograd.training = False
+test(new_model, valid_x, valid_y, dev=dev)
+```
+
+### Transfer learning
+
+Finally, if we want to do transfer-learning, we can define a function called
+**Trans** to append some layers after the onnx model. For demonstration, the
+code only appends several linear(fully connection) and relu after the onnx
+model. You can define a transfer_learning function to handle the training
+process of the transfer-learning model. And the label of training is the same as
+the previous one.
+
+```python
+class Trans:
+    def __init__(self, sg_ir, last_layers):
+        self.sg_ir = sg_ir
+        self.last_layers = last_layers
+        self.append_linear1 = autograd.Linear(500, 128, bias=False)
+        self.append_linear2 = autograd.Linear(128, 32, bias=False)
+        self.append_linear3 = autograd.Linear(32, 10, bias=False)
+
+    def forward(self, x):
+        y = sg_ir.run([x], last_layers=self.last_layers)[0]
+        y = self.append_linear1(y)
+        y = autograd.relu(y)
+        y = self.append_linear2(y)
+        y = autograd.relu(y)
+        y = self.append_linear3(y)
+        y = autograd.relu(y)
+        return y
+
+def transfer_learning(sg_ir,
+             x,
+             y,
+             epochs=1,
+             batch_size=64,
+             dev=device.get_default_device()):
+    batch_number = x.shape[0] // batch_size
+
+    trans_model = Trans(sg_ir, -1)
+
+    for i in range(epochs):
+        for b in range(batch_number):
+            l_idx = b * batch_size
+            r_idx = (b + 1) * batch_size
+
+            x_batch = tensor.Tensor(device=dev, data=x[l_idx:r_idx])
+            target_batch = tensor.Tensor(device=dev, data=y[l_idx:r_idx])
+            output_batch = trans_model.forward(x_batch)
+
+            loss = autograd.softmax_cross_entropy(output_batch, target_batch)
+            accuracy_rate = accuracy(tensor.to_numpy(output_batch),
+                                     tensor.to_numpy(target_batch))
+
+            sgd = opt.SGD(lr=0.07)
+            for p, gp in autograd.backward(loss):
+                sgd.update(p, gp)
+            sgd.step()
+
+            if b % 1e2 == 0:
+                print("acc %6.2f loss, %6.2f" %
+                      (accuracy_rate, tensor.to_numpy(loss)[0]))
+    print("transfer-learning completed")
+    return trans_mode
+
+# load the ONNX model
+onnx_model = onnx.load(model_path)
+sg_ir = sonnx.prepare(onnx_model, device=dev)
+
+# transfer-learning
+autograd.training = True
+new_model = transfer_learning(sg_ir, train_x, train_y, dev=dev)
+autograd.training = False
+test(new_model, valid_x, valid_y, dev=dev)
+```
+
+## ONNX model zoo
+
+The [ONNX Model Zoo](https://github.com/onnx/models) is a collection of
+pre-trained, state-of-the-art models in the ONNX format contributed by community
+members. SINGA has supported several CV and NLP models now. More models are
+going to be supported soon.
+
+### Image Classification
+
+This collection of models take images as input, then classifies the major
+objects in the images into 1000 object categories such as keyboard, mouse,
+pencil, and many animals.
+
+| Model Class                                                                                    | Reference                                          | Description                                                                                                                                                                              | Link                                                                                                                                                    |
+| ---------------------------------------------------------------------------------------------- | -------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <b>[MobileNet](https://github.com/onnx/models/tree/master/vision/classification/mobilenet)</b> | [Sandler et al.](https://arxiv.org/abs/1801.04381) | Light-weight deep neural network best suited for mobile and embedded vision applications. <br>Top-5 error from paper - ~10%                                                              | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1HsixqJMIpKyEPhkbB8jy7NwNEFEAUWAf) |
+| <b>[ResNet18](https://github.com/onnx/models/tree/master/vision/classification/resnet)</b>     | [He et al.](https://arxiv.org/abs/1512.03385)      | A CNN model (up to 152 layers). Uses shortcut connections to achieve higher accuracy when classifying images. <br> Top-5 error from paper - ~3.6%                                        | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1u1RYefSsVbiP4I-5wiBKHjsT9L0FxLm9) |
+| <b>[VGG16](https://github.com/onnx/models/tree/master/vision/classification/vgg)</b>           | [Simonyan et al.](https://arxiv.org/abs/1409.1556) | Deep CNN model(up to 19 layers). Similar to AlexNet but uses multiple smaller kernel-sized filters that provides more accuracy when classifying images. <br>Top-5 error from paper - ~8% | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14kxgRKtbjPCKKsDJVNi3AvTev81Gp_Ds) |
+
+### Object Detection
+
+Object detection models detect the presence of multiple objects in an image and
+segment out areas of the image where the objects are detected.
+
+| Model Class                                                                                                       | Reference                                             | Description                                                                                                                        | Link                                                                                                                                                    |
+| ----------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <b>[Tiny YOLOv2](https://github.com/onnx/models/tree/master/vision/object_detection_segmentation/tiny_yolov2)</b> | [Redmon et al.](https://arxiv.org/pdf/1612.08242.pdf) | A real-time CNN for object detection that detects 20 different classes. A smaller version of the more complex full YOLOv2 network. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/11V4I6cRjIJNUv5ZGsEGwqHuoQEie6b1T) |
+
+### Face Analysis
+
+Face detection models identify and/or recognize human faces and emotions in
+given images.
+
+| Model Class                                                                                               | Reference                                          | Description                                                                                                                         | Link                                                                                                                                                    |
+| --------------------------------------------------------------------------------------------------------- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <b>[ArcFace](https://github.com/onnx/models/tree/master/vision/body_analysis/arcface)</b>                 | [Deng et al.](https://arxiv.org/abs/1801.07698)    | A CNN based model for face recognition which learns discriminative features of faces and produces embeddings for input face images. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qanaqUKGIDtifdzEzJOHjEj4kYzA9uJC) |
+| <b>[Emotion FerPlus](https://github.com/onnx/models/tree/master/vision/body_analysis/emotion_ferplus)</b> | [Barsoum et al.](https://arxiv.org/abs/1608.01041) | Deep CNN for emotion recognition trained on images of faces.                                                                        | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1XHtBQGRhe58PDi4LGYJzYueWBeWbO23r) |
+
+### Machine Comprehension
+
+This subset of natural language processing models that answer questions about a
+given context paragraph.
+
+| Model Class                                                                                           | Reference                                             | Description                                                                     | Link                                                                                                                                                    |
+| ----------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <b>[BERT-Squad](https://github.com/onnx/models/tree/master/text/machine_comprehension/bert-squad)</b> | [Devlin et al.](https://arxiv.org/pdf/1810.04805.pdf) | This model answers questions based on the context of the given input paragraph. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kud-lUPjS_u-TkDAzihBTw0Vqr0FjCE-) |
+
+## Supported operators
+
+The following operators are supported:
+
+- Conv
+- Relu
+- Constant
+- MaxPool
+- AveragePool
+- Softmax
+- Sigmoid
+- Add
+- MatMul
+- BatchNormalization
+- Concat
+- Flatten
+- Add
+- Gemm
+- Reshape
+- Sum
+- Cos
+- Cosh
+- Sin
+- Sinh
+- Tan
+- Tanh
+- Acos
+- Acosh
+- Asin
+- Asinh
+- Atan
+- Atanh
+- Selu
+- Elu
+- Equal
+- Less
+- Sign
+- Div
+- Sub
+- Sqrt
+- Log
+- Greater
+- HardSigmoid
+- Identity
+- Softplus
+- Softsign
+- Mean
+- Pow
+- Clip
+- PRelu
+- Mul
+- Transpose
+- Max
+- Min
+- Shape
+- And
+- Or
+- Xor
+- Not
+- Neg
+- Reciprocal
+- LeakyRelu
+- GlobalAveragePool
+- ConstantOfShape
+- Dropout
+- ReduceSum
+- ReduceMean
+- LeakyRelu
+- GlobalAveragePool
+- Squeeze
+- Unsqueeze
+- Slice
+- Ceil
+- Split
+- Gather
+- Tile
+- NonZero
+- Cast
+- OneHot
+
+### Special comments for ONNX backend
+
+- Conv, MaxPool and AveragePool
+
+  Input must be 1d`(N*C*H)` and 2d(`N*C*H*W`) shape and `dilation` must be 1.
+
+- BatchNormalization
+
+  `epsilon` is 1e-05 and cannot be changed.
+
+- Cast
+
+  Only support float32 and int32, other types are casted to these two types.
+
+- Squeeze and Unsqueeze
+
+  If you encounter errors when you `Squeeze` or `Unsqueeze` between `Tensor` and
+  Scalar, please report to us.
+
+- Empty tensor Empty tensor is illegal in SINGA.
+
+## Implementation
+
+The code of SINGA ONNX locates at `python/singa/soonx.py`. There are three main
+class, `SingaFrontend` and `SingaBackend` and `SingaRep`. `SingaFrontend`
+translates a SINGA model to ONNX model; `SingaBackend` translates a ONNX model
+to `SingaRep` object which stores all SINGA operators and tensors(the tensor in
+this doc means SINGA `Tensor`); `SingaRep` can be run like a SINGA model.
+
+### SingaFrontend
+
+The entry function of `SingaFrontend` is `singa_to_onnx_model` which also is
+called `to_onnx`. `singa_to_onnx_model` creates the ONNX model, and it also
+create a ONNX graph by using `singa_to_onnx_graph`.
+
+`singa_to_onnx_graph` accepts the output of the model, and recursively iterate
+the SINGA model's graph from the output to get all operators to form a queue.
+The input and intermediate tensors, i.e, trainable weights, of the SINGA model
+is picked up at the same time. The input is stored in `onnx_model.graph.input`;
+the output is stored in `onnx_model.graph.output`; and the trainable weights are
+stored in `onnx_model.graph.initializer`.
+
+Then the SINGA operator in the queue is translated to ONNX operators one by one.
+`_rename_operators` defines the operators name mapping between SINGA and ONNX.
+`_special_operators` defines which function to be used to translate the
+operator.
+
+In addition, some operators in SINGA has different definition with ONNX, that
+is, ONNX regards some attributes of SINGA operators as input, so
+`_unhandled_operators` defines which function to handle the special operator.
+
+Since the bool type is regarded as int32 in SINGA, `_bool_operators` defines the
+operators to be changed as bool type.
+
+### SingaBackend
+
+The entry function of `SingaBackend` is `prepare` which checks the version of
+ONNX model and call `_onnx_model_to_singa_net` then.
+
+The purpose of `_onnx_model_to_singa_net` is to get SINGA tensors and operators.
+The tensors are stored in a dictionary by their name in ONNX, and operators are
+stored in queue by the form of
+`namedtuple('SingaOps', ['name', 'op', 'handle', 'forward'])`. For each
+operator, `name` is its ONNX node name; `op` is the ONNX node; `forward` is the
+SINGA operator's forward function; `handle` is prepared for some special
+operators such as Conv and Pooling which has `handle` object.
+
+The first step of `_onnx_model_to_singa_net` is to call `_init_graph_parameter`
+to get all tensors within the model. For trainable weights, it can init SINGA
+`Tensor` from `onnx_model.graph.initializer`. Please note, the weights may also
+be stored within graph's input or a ONNX node called `Constant`, SINGA can also
+handle these.
+
+Though all weights are stored within ONNX model, the input of the model is
+unknown but its shape and type. So SINGA support two ways to init input, 1,
+generate random tensor by its shape and type, 2, allow the user to assign the
+input. The first way works fine for most models, however, for some model such as
+bert, the indices of matrix cannot be random generated otherwise it will incurs
+errors.
+
+Then, `_onnx_model_to_singa_net` iterators all nodes within ONNX graph to
+translate it to SIGNA operators. Also, `_rename_operators` defines the operators
+name mapping between SINGA and ONNX. `_special_operators` defines which function
+to be used to translate the operator. `_run_node` runs the generated SINGA model
+by its input tensors and store its output tensors for being used by later
+operators.
+
+This class finally return a `SingaRep` object and stores all SINGA tensors and
+operators within it.
+
+### SingaRep
+
+`SingaBackend` stores all SINGA tensors and operators. `run` accepts the input
+of the model and run the SINGA operators one by one following the operators
+queue. The user can use `last_layers` to decide to run the model till the last
+few layers. Set `all_outputs` as `False` to get only the final output, `True` to
+also get all the intermediate output.
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.1.0.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.1.0.md
new file mode 100644
index 0000000..8ede5f7
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.1.0.md
@@ -0,0 +1,153 @@
+---
+id: version-3.0.0.rc1-RELEASE_NOTES_0.1.0
+title: singa-incubating-0.1.0 Release Notes
+original_id: RELEASE_NOTES_0.1.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA is a general distributed deep learning platform for training big deep
+learning models over large datasets. It is designed with an intuitive
+programming model based on the layer abstraction. SINGA supports a wide variety
+of popular deep learning models.
+
+This release includes following features:
+
+- Job management
+  - [SINGA-3](https://issues.apache.org/jira/browse/SINGA-3) Use Zookeeper to
+    check stopping (finish) time of the system
+  - [SINGA-16](https://issues.apache.org/jira/browse/SINGA-16) Runtime Process
+    id Management
+  - [SINGA-25](https://issues.apache.org/jira/browse/SINGA-25) Setup glog output
+    path
+  - [SINGA-26](https://issues.apache.org/jira/browse/SINGA-26) Run distributed
+    training in a single command
+  - [SINGA-30](https://issues.apache.org/jira/browse/SINGA-30) Enhance
+    easy-to-use feature and support concurrent jobs
+  - [SINGA-33](https://issues.apache.org/jira/browse/SINGA-33) Automatically
+    launch a number of processes in the cluster
+  - [SINGA-34](https://issues.apache.org/jira/browse/SINGA-34) Support external
+    zookeeper service
+  - [SINGA-38](https://issues.apache.org/jira/browse/SINGA-38) Support
+    concurrent jobs
+  - [SINGA-39](https://issues.apache.org/jira/browse/SINGA-39) Avoid ssh in
+    scripts for single node environment
+  - [SINGA-43](https://issues.apache.org/jira/browse/SINGA-43) Remove
+    Job-related output from workspace
+  - [SINGA-56](https://issues.apache.org/jira/browse/SINGA-56) No automatic
+    launching of zookeeper service
+  - [SINGA-73](https://issues.apache.org/jira/browse/SINGA-73) Refine the
+    selection of available hosts from host list
+
+* Installation with GNU Auto tool
+  - [SINGA-4](https://issues.apache.org/jira/browse/SINGA-4) Refine
+    thirdparty-dependency installation
+  - [SINGA-13](https://issues.apache.org/jira/browse/SINGA-13) Separate
+    intermediate files of compilation from source files
+  - [SINGA-17](https://issues.apache.org/jira/browse/SINGA-17) Add root
+    permission within thirdparty/install.
+  - [SINGA-27](https://issues.apache.org/jira/browse/SINGA-27) Generate python
+    modules for proto objects
+  - [SINGA-53](https://issues.apache.org/jira/browse/SINGA-53) Add lmdb
+    compiling options
+  - [SINGA-62](https://issues.apache.org/jira/browse/SINGA-62) Remove building
+    scrips and auxiliary files
+  - [SINGA-67](https://issues.apache.org/jira/browse/SINGA-67) Add singatest
+    into build targets
+
+- Distributed training
+  - [SINGA-7](https://issues.apache.org/jira/browse/SINGA-7) Implement shared
+    memory Hogwild algorithm
+  - [SINGA-8](https://issues.apache.org/jira/browse/SINGA-8) Implement
+    distributed Hogwild
+  - [SINGA-19](https://issues.apache.org/jira/browse/SINGA-19) Slice large Param
+    objects for load-balance
+  - [SINGA-29](https://issues.apache.org/jira/browse/SINGA-29) Update NeuralNet
+    class to enable layer partition type customization
+  - [SINGA-24](https://issues.apache.org/jira/browse/SINGA-24) Implement
+    Downpour training framework
+  - [SINGA-32](https://issues.apache.org/jira/browse/SINGA-32) Implement
+    AllReduce training framework
+  - [SINGA-57](https://issues.apache.org/jira/browse/SINGA-57) Improve
+    Distributed Hogwild
+
+* Training algorithms for different model categories
+  - [SINGA-9](https://issues.apache.org/jira/browse/SINGA-9) Add Support for
+    Restricted Boltzman Machine (RBM) model
+  - [SINGA-10](https://issues.apache.org/jira/browse/SINGA-10) Add Support for
+    Recurrent Neural Networks (RNN)
+
+- Checkpoint and restore
+  - [SINGA-12](https://issues.apache.org/jira/browse/SINGA-12) Support
+    Checkpoint and Restore
+
+* Unit test
+  - [SINGA-64](https://issues.apache.org/jira/browse/SINGA-64) Add the test
+    module for utils/common
+
+- Programming model
+  - [SINGA-36](https://issues.apache.org/jira/browse/SINGA-36) Refactor job
+    configuration, driver program and scripts
+  - [SINGA-37](https://issues.apache.org/jira/browse/SINGA-37) Enable users to
+    set parameter sharing in model configuration
+  - [SINGA-54](https://issues.apache.org/jira/browse/SINGA-54) Refactor job
+    configuration to move fields in ModelProto out
+  - [SINGA-55](https://issues.apache.org/jira/browse/SINGA-55) Refactor main.cc
+    and singa.h
+  - [SINGA-61](https://issues.apache.org/jira/browse/SINGA-61) Support user
+    defined classes
+  - [SINGA-65](https://issues.apache.org/jira/browse/SINGA-65) Add an example of
+    writing user-defined layers
+
+* Other features
+  - [SINGA-6](https://issues.apache.org/jira/browse/SINGA-6) Implement
+    thread-safe singleton
+  - [SINGA-18](https://issues.apache.org/jira/browse/SINGA-18) Update API for
+    displaying performance metric
+  - [SINGA-77](https://issues.apache.org/jira/browse/SINGA-77) Integrate with
+    Apache RAT
+
+Some bugs are fixed during the development of this release
+
+- [SINGA-2](https://issues.apache.org/jira/browse/SINGA-2) Check failed:
+  zsock_connect
+- [SINGA-5](https://issues.apache.org/jira/browse/SINGA-5) Server early
+  terminate when zookeeper singa folder is not initially empty
+- [SINGA-15](https://issues.apache.org/jira/browse/SINGA-15) Fixg a bug from
+  ConnectStub function which gets stuck for connecting layer*dealer*
+- [SINGA-22](https://issues.apache.org/jira/browse/SINGA-22) Cannot find
+  openblas library when it is installed in default path
+- [SINGA-23](https://issues.apache.org/jira/browse/SINGA-23) Libtool version
+  mismatch error.
+- [SINGA-28](https://issues.apache.org/jira/browse/SINGA-28) Fix a bug from
+  topology sort of Graph
+- [SINGA-42](https://issues.apache.org/jira/browse/SINGA-42) Issue when loading
+  checkpoints
+- [SINGA-44](https://issues.apache.org/jira/browse/SINGA-44) A bug when reseting
+  metric values
+- [SINGA-46](https://issues.apache.org/jira/browse/SINGA-46) Fix a bug in
+  updater.cc to scale the gradients
+- [SINGA-47](https://issues.apache.org/jira/browse/SINGA-47) Fix a bug in data
+  layers that leads to out-of-memory when group size is too large
+- [SINGA-48](https://issues.apache.org/jira/browse/SINGA-48) Fix a bug in
+  trainer.cc that assigns the same NeuralNet instance to workers from diff
+  groups
+- [SINGA-49](https://issues.apache.org/jira/browse/SINGA-49) Fix a bug in
+  HandlePutMsg func that sets param fields to invalid values
+- [SINGA-66](https://issues.apache.org/jira/browse/SINGA-66) Fix bugs in
+  Worker::RunOneBatch function and ClusterProto
+- [SINGA-79](https://issues.apache.org/jira/browse/SINGA-79) Fix bug in
+  singatool that can not parse -conf flag
+
+Features planned for the next release
+
+- [SINGA-11](https://issues.apache.org/jira/browse/SINGA-11) Start SINGA using
+  Mesos
+- [SINGA-31](https://issues.apache.org/jira/browse/SINGA-31) Extend Blob to
+  support xpu (cpu or gpu)
+- [SINGA-35](https://issues.apache.org/jira/browse/SINGA-35) Add random number
+  generators
+- [SINGA-40](https://issues.apache.org/jira/browse/SINGA-40) Support sparse
+  Param update
+- [SINGA-41](https://issues.apache.org/jira/browse/SINGA-41) Support single node
+  single GPU training
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.2.0.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.2.0.md
new file mode 100644
index 0000000..14c7fa1
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.2.0.md
@@ -0,0 +1,82 @@
+---
+id: version-3.0.0.rc1-RELEASE_NOTES_0.2.0
+title: singa-incubating-0.2.0 Release Notes
+original_id: RELEASE_NOTES_0.2.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA is a general distributed deep learning platform for training big deep
+learning models over large datasets. It is designed with an intuitive
+programming model based on the layer abstraction. SINGA supports a wide variety
+of popular deep learning models.
+
+This release includes the following **major features**:
+
+- [Training on GPU](../docs/gpu.html) enables training of complex models on a
+  single node with multiple GPU cards.
+- [Hybrid neural net partitioning](../docs/hybrid.html) supports data and model
+  parallelism at the same time.
+- [Python wrapper](../docs/python.html) makes it easy to configure the job,
+  including neural net and SGD algorithm.
+- [RNN model and BPTT algorithm](../docs/general-rnn.html) are implemented to
+  support applications based on RNN models, e.g., GRU.
+- [Cloud software integration](../docs/distributed-training.md) includes Mesos,
+  Docker and HDFS.
+
+**More details** are listed as follows,
+
+- Programming model
+  - [SINGA-80] New Blob Level and Address Level Math Operation Interface
+  - [SINGA-82] Refactor input layers using data store abstraction
+  - [SINGA-87] Replace exclude field to include field for layer configuration
+  - [SINGA-110] Add Layer member datavec* and gradvec*
+  - [SINGA-120] Implemented GRU and BPTT (BPTTWorker)
+
+* Neuralnet layers
+  - [SINGA-91] Add SoftmaxLayer and ArgSortLayer
+  - [SINGA-106] Add dummy layer for test purpose
+  - [SINGA-120] Implemented GRU and BPTT (GRULayer and OneHotLayer)
+
+- GPU training support
+  - [SINGA-100] Implement layers using CUDNN for GPU training
+  - [SINGA-104] Add Context Class
+  - [SINGA-105] Update GUN make files for compiling cuda related code
+  - [SINGA-98] Add Support for AlexNet ImageNet Classification Model
+
+* Model/Hybrid partition
+  - [SINGA-109] Refine bridge layers
+  - [SINGA-111] Add slice, concate and split layers
+  - [SINGA-113] Model/Hybrid Partition Support
+
+- Python binding
+  - [SINGA-108] Add Python wrapper to singa
+
+* Predict-only mode
+  - [SINGA-85] Add functions for extracting features and test new data
+
+- Integrate with third-party tools
+  - [SINGA-11] Start SINGA on Apache Mesos
+  - [SINGA-78] Use Doxygen to generate documentation
+  - [SINGA-89] Add Docker support
+
+* Unit test
+  - [SINGA-95] Add make test after building
+
+- Other improvment
+  - [SINGA-84] Header Files Rearrange
+  - [SINGA-93] Remove the asterisk in the log tcp://169.254.12.152:\*:49152
+  - [SINGA-94] Move call to google::InitGoogleLogging() from Driver::Init() to
+    main()
+  - [SINGA-96] Add Momentum to Cifar10 Example
+  - [SINGA-101] Add ll (ls -l) command in .bashrc file when using docker
+  - [SINGA-114] Remove short logs in tmp directory
+  - [SINGA-115] Print layer debug information in the neural net graph file
+  - [SINGA-118] Make protobuf LayerType field id easy to assign
+  - [SIGNA-97] Add HDFS Store
+
+* Bugs fixed
+  - [SINGA-85] Fix compilation errors in examples
+  - [SINGA-90] Miscellaneous trivial bug fixes
+  - [SINGA-107] Error from loading pre-trained params for training stacked RBMs
+  - [SINGA-116] Fix a bug in InnerProductLayer caused by weight matrix sharing
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.3.0.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.3.0.md
new file mode 100644
index 0000000..7c9fcde
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_0.3.0.md
@@ -0,0 +1,43 @@
+---
+id: version-3.0.0.rc1-RELEASE_NOTES_0.3.0
+title: singa-incubating-0.3.0 Release Notes
+original_id: RELEASE_NOTES_0.3.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA is a general distributed deep learning platform for training big deep
+learning models over large datasets. It is designed with an intuitive
+programming model based on the layer abstraction. SINGA supports a wide variety
+of popular deep learning models.
+
+This release includes following features:
+
+- GPU Support
+
+  - [SINGA-131] Implement and optimize hybrid training using both CPU and GPU
+  - [SINGA-136] Support cuDNN v4
+  - [SINGA-134] Extend SINGA to run over a GPU cluster
+  - [SINGA-157] Change the priority of cudnn library and install libsingagpu.so
+
+- Remove Dependences
+
+  - [SINGA-156] Remove the dependency on ZMQ for single process training
+  - [SINGA-155] Remove zookeeper for single-process training
+
+- Python Binding
+
+  - [SINGA-126] Python Binding for Interactive Training
+
+- Other Improvements
+
+  - [SINGA-80] New Blob Level and Address Level Math Operation Interface
+  - [SINGA-130] Data Prefetching
+  - [SINGA-145] New SGD based optimization Updaters: AdaDelta, Adam, AdamMax
+
+- Bugs Fixed
+  - [SINGA-148] Race condition between Worker threads and Driver
+  - [SINGA-150] Mesos Docker container failed
+  - [SIGNA-141] Undesired Hash collision when locating process id to worker…
+  - [SINGA-149] Docker build fail
+  - [SINGA-143] The compilation cannot detect libsingagpu.so file
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.0.0.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.0.0.md
new file mode 100644
index 0000000..bf12f0d
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.0.0.md
@@ -0,0 +1,96 @@
+---
+id: version-3.0.0.rc1-RELEASE_NOTES_1.0.0
+title: singa-incubating-1.0.0 Release Notes
+original_id: RELEASE_NOTES_1.0.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA is a general distributed deep learning platform for training big deep
+learning models over large datasets. It is designed with an intuitive
+programming model based on the layer abstraction. SINGA supports a wide variety
+of popular deep learning models.
+
+This release includes following features:
+
+- Core abstractions including Tensor and Device
+  - [SINGA-207] Update Tensor functions for matrices
+  - [SINGA-205] Enable slice and concatenate operations for Tensor objects
+  - [SINGA-197] Add CNMem as a submodule in lib/
+  - [SINGA-196] Rename class Blob to Block
+  - [SINGA-194] Add a Platform singleton
+  - [SINGA-175] Add memory management APIs and implement a subclass using CNMeM
+  - [SINGA-173] OpenCL Implementation
+  - [SINGA-171] Create CppDevice and CudaDevice
+  - [SINGA-168] Implement Cpp Math functions APIs
+  - [SINGA-162] Overview of features for V1.x
+  - [SINGA-165] Add cross-platform timer API to singa
+  - [SINGA-167] Add Tensor Math function APIs
+  - [SINGA-166] light built-in logging for making glog optional
+  - [SINGA-164] Add the base Tensor class
+
+* IO components for file read/write, network and data pre-processing
+  - [SINGA-233] New communication interface
+  - [SINGA-215] Implement Image Transformation for Image Pre-processing
+  - [SINGA-214] Add LMDBReader and LMDBWriter for LMDB
+  - [SINGA-213] Implement Encoder and Decoder for CSV
+  - [SINGA-211] Add TextFileReader and TextFileWriter for CSV files
+  - [SINGA-210] Enable checkpoint and resume for v1.0
+  - [SINGA-208] Add DataIter base class and a simple implementation
+  - [SINGA-203] Add OpenCV detection for cmake compilation
+  - [SINGA-202] Add reader and writer for binary file
+  - [SINGA-200] Implement Encoder and Decoder for data pre-processing
+
+- Module components including layer classes, training algorithms and Python
+  binding
+  - [SINGA-235] Unify the engines for cudnn and singa layers
+  - [SINGA-230] OpenCL Convolution layer and Pooling layer
+  - [SINGA-222] Fixed bugs in IO
+  - [SINGA-218] Implementation for RNN CUDNN version
+  - [SINGA-204] Support the training of feed-forward neural nets
+  - [SINGA-199] Implement Python classes for SGD optimizers
+  - [SINGA-198] Change Layer::Setup API to include input Tensor shapes
+  - [SINGA-193] Add Python layers
+  - [SINGA-192] Implement optimization algorithms for SINGA v1 (nesterove,
+    adagrad, rmsprop)
+  - [SINGA-191] Add "autotune" for CudnnConvolution Layer
+  - [SINGA-190] Add prelu layer and flatten layer
+  - [SINGA-189] Generate python outputs of proto files
+  - [SINGA-188] Add Dense layer
+  - [SINGA-187] Add popular parameter initialization methods
+  - [SINGA-186] Create Python Tensor class
+  - [SINGA-184] Add Cross Entropy loss computation
+  - [SINGA-183] Add the base classes for optimizer, constraint and regularizer
+  - [SINGA-180] Add Activation layer and Softmax layer
+  - [SINGA-178] Add Convolution layer and Pooling layer
+  - [SINGA-176] Add loss and metric base classes
+  - [SINGA-174] Add Batch Normalization layer and Local Response Nomalization
+    layer.
+  - [SINGA-170] Add Dropout layer and CudnnDropout layer.
+  - [SINGA-169] Add base Layer class for V1.0
+
+* Examples
+
+  - [SINGA-232] Alexnet on Imagenet
+  - [SINGA-231] Batchnormlized VGG model for cifar-10
+  - [SINGA-228] Add Cpp Version of Convolution and Pooling layer
+  - [SINGA-227] Add Split and Merge Layer and add ResNet Implementation
+
+* Documentation
+
+  - [SINGA-239] Transfer documentation files of v0.3.0 to github
+  - [SINGA-238] RBM on mnist
+  - [SINGA-225] Documentation for installation and Cifar10 example
+  - [SINGA-223] Use Sphinx to create the website
+
+* Tools for compilation and some utility code
+  - [SINGA-229] Complete install targets
+  - [SINGA-221] Support for Travis-CI
+  - [SINGA-217] build python package with setup.py
+  - [SINGA-216] add jenkins for CI support
+  - [SINGA-212] Disable the compilation of libcnmem if USE_CUDA is OFF
+  - [SINGA-195] Channel for sending training statistics
+  - [SINGA-185] Add CBLAS and GLOG detection for singav1
+  - [SINGA-181] Add NVCC supporting for .cu files
+  - [SINGA-177] Add fully cmake supporting for the compilation of singa_v1
+  - [SINGA-172] Add CMake supporting for Cuda and Cudnn libs
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.1.0.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.1.0.md
new file mode 100644
index 0000000..2df093d
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.1.0.md
@@ -0,0 +1,57 @@
+---
+id: version-3.0.0.rc1-RELEASE_NOTES_1.1.0
+title: singa-incubating-1.1.0 Release Notes
+original_id: RELEASE_NOTES_1.1.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA is a general distributed deep learning platform for training big deep
+learning models over large datasets.
+
+This release includes following features:
+
+- Core components
+
+  - [SINGA-296] Add sign and to_host function for pysinga tensor module
+
+- Model components
+
+  - [SINGA-254] Implement Adam for V1
+  - [SINGA-264] Extend the FeedForwardNet to accept multiple inputs
+  - [SINGA-267] Add spatial mode in batch normalization layer
+  - [SINGA-271] Add Concat and Slice layers
+  - [SINGA-275] Add Cross Entropy Loss for multiple labels
+  - [SINGA-278] Convert trained caffe parameters to singa
+  - [SINGA-287] Add memory size check for cudnn convolution
+
+- Utility functions and CI
+
+  - [SINGA-242] Compile all source files into a single library.
+  - [SINGA-244] Separating swig interface and python binding files
+  - [SINGA-246] Imgtool for image augmentation
+  - [SINGA-247] Add windows support for singa
+  - [SINGA-251] Implement image loader for pysinga
+  - [SINGA-252] Use the snapshot methods to dump and load models for pysinga
+  - [SINGA-255] Compile mandatory dependent libaries together with SINGA code
+  - [SINGA-259] Add maven pom file for building java classes
+  - [SINGA-261] Add version ID into the checkpoint files
+  - [SINGA-266] Add Rafiki python toolkits
+  - [SINGA-273] Improve license and contributions
+  - [SINGA-284] Add python unittest into Jenkins and link static libs into whl
+    file
+  - [SINGA-280] Jenkins CI support
+  - [SINGA-288] Publish wheel of PySINGA generated by Jenkins to public servers
+
+- Documentation and usability
+
+  - [SINGA-263] Create Amazon Machine Image
+  - [SINGA-268] Add IPython notebooks to the documentation
+  - [SINGA-276] Create docker images
+  - [SINGA-289] Update SINGA website automatically using Jenkins
+  - [SINGA-295] Add an example of image classification using GoogleNet
+
+- Bugs fixed
+  - [SINGA-245] float as the first operand can not multiply with a tensor object
+  - [SINGA-293] Bug from compiling PySINGA on Mac OS X with multiple version of
+    Python
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.2.0.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.2.0.md
new file mode 100644
index 0000000..de1873e
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_1.2.0.md
@@ -0,0 +1,63 @@
+---
+id: version-3.0.0.rc1-RELEASE_NOTES_1.2.0
+title: singa-incubating-1.2.0 Release Notes
+original_id: RELEASE_NOTES_1.2.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA is a general distributed deep learning platform for training big deep
+learning models over large datasets.
+
+This release includes following features:
+
+- Core components
+
+  - [SINGA-290] Upgrade to Python 3
+  - [SINGA-341] Added stride functionality to tensors for CPP
+  - [SINGA-347] Create a function that supports einsum
+  - [SINGA-351] Added stride support and cudnn codes to cuda
+
+- Model components
+
+  - [SINGA-300] Add residual networks for imagenet classification
+  - [SINGA-312] Rename layer parameters
+  - [SINGA-313] Add L2 norm layer
+  - [SINGA-315] Reduce memory footprint by Python generator for parameter
+  - [SINGA-316] Add SigmoidCrossEntropy
+  - [SINGA-324] Extend RNN layer to accept variant seq length across batches
+  - [SINGA-326] Add Inception V4 for ImageNet classification
+  - [SINGA-328] Add VGG models for ImageNet classification
+  - [SINGA-329] Support layer freezing during training (fine-tuning)
+  - [SINGA-346] Update cudnn from V5 to V7
+  - [SINGA-349] Create layer operations for autograd
+  - [SINGA-363] Add DenseNet for Imagenet classification
+
+- Utility functions and CI
+
+  - [SINGA-274] Improve Debian packaging with CPack
+  - [SINGA-303] Create conda packages
+  - [SINGA-337] Add test cases for code
+  - [SINGA-348] Support autograd MLP Example
+  - [SINGA-345] Update Jenkins and fix bugs in compliation
+  - [SINGA-354] Update travis scripts to use conda-build for all platforms
+  - [SINGA-358] Consolidated RUN steps and cleaned caches in Docker containers
+  - [SINGA-359] Create alias for conda packages
+
+- Documentation and usability
+
+  - [SINGA-223] Fix side navigation menu in the website
+  - [SINGA-294] Add instructions to run CUDA unit tests on Windows
+  - [SINGA-305] Add jupyter notebooks for SINGA V1 tutorial
+  - [SINGA-319] Fix link errors on the index page
+  - [SINGA-352] Complete SINGA documentation in Chinese version
+  - [SINGA-361] Add git instructions for contributors and committers
+
+- Bugs fixed
+  - [SINGA-330] fix openblas building on i7 7700k
+  - [SINGA-331] Fix the bug of tensor division operation
+  - [SINGA-350] Error from python3 test
+  - [SINGA-356] Error using travis tool to build SINGA on mac os
+  - [SINGA-363] Fix some bugs in imagenet examples
+  - [SINGA-368] Fix the bug in Cifar10 examples
+  - [SINGA-369] the errors of examples in testing
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_2.0.0.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_2.0.0.md
new file mode 100644
index 0000000..e941954
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/releases/RELEASE_NOTES_2.0.0.md
@@ -0,0 +1,56 @@
+---
+id: version-3.0.0.rc1-RELEASE_NOTES_2.0.0
+title: singa-incubating-2.0.0 Release Notes
+original_id: RELEASE_NOTES_2.0.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA is a general distributed deep learning platform for training big deep
+learning models over large datasets.
+
+This release includes following features:
+
+- Core components
+
+  - [SINGA-434] Support tensor broadcasting
+  - [SINGA-370] Improvement to tensor reshape and various misc. changes related
+    to SINGA-341 and 351
+
+- Model components
+
+  - [SINGA-333] Add support for Open Neural Network Exchange (ONNX) format
+  - [SINGA-385] Add new python module for optimizers
+  - [SINGA-394] Improve the CPP operations via Intel MKL DNN lib
+  - [SINGA-425] Add 3 operators , Abs(), Exp() and leakyrelu(), for Autograd
+  - [SINGA-410] Add two function, set_params() and get_params(), for Autograd
+    Layer class
+  - [SINGA-383] Add Separable Convolution for autograd
+  - [SINGA-388] Develop some RNN layers by calling tiny operations like matmul,
+    addbias.
+  - [SINGA-382] Implement concat operation for autograd
+  - [SINGA-378] Implement maxpooling operation and its related functions for
+    autograd
+  - [SINGA-379] Implement batchnorm operation and its related functions for
+    autograd
+
+- Utility functions and CI
+
+  - [SINGA-432] Update depdent lib versions in conda-build config
+  - [SINGA-429] Update docker images for latest cuda and cudnn
+  - [SINGA-428] Move Docker images under Apache user name
+
+- Documentation and usability
+  - [SINGA-395] Add documentation for autograd APIs
+  - [SINGA-344] Add a GAN example
+  - [SINGA-390] Update installation.md
+  - [SINGA-384] Implement ResNet using autograd API
+  - [SINGA-352] Complete SINGA documentation in Chinese version
+
+* Bugs fixed
+  - [SINGA-431] Unit Test failed - Tensor Transpose
+  - [SINGA-422] ModuleNotFoundError: No module named "\_singa_wrap"
+  - [SINGA-418] Unsupportive type 'long' in python3.
+  - [SINGA-409] Basic `singa-cpu` import throws error
+  - [SINGA-408] Unsupportive function definition in python3
+  - [SINGA-380] Fix bugs from Reshape
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/security.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/security.md
new file mode 100644
index 0000000..7f1e3ab
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/security.md
@@ -0,0 +1,10 @@
+---
+id: version-3.0.0.rc1-security
+title: Security
+original_id: security
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+Users can report security vulnerabilities to
+[SINGA Security Team Mail List](mailto:security@singa.apache.org)
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/software-stack.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/software-stack.md
new file mode 100644
index 0000000..625d154
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/software-stack.md
@@ -0,0 +1,143 @@
+---
+id: version-3.0.0.rc1-software-stack
+title: Software Stack
+original_id: software-stack
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+SINGA's software stack includes two major levels, the low level backend classes
+and the Python interface level. Figure 1 illustrates them together with the
+hardware. The backend components provides basic data structures for deep
+learning models, hardware abstractions for scheduling and executing operations,
+and communication components for distributed training. The Python interface
+wraps some CPP data structures and provides additional high-level classes for
+neural network training, which makes it convenient to implement complex neural
+network models. Next, we introduce the software stack in a bottom-up manner.
+
+![SINGA V3 software stack](assets/singav3-sw.png) <br/> **Figure 1 - SINGA V3
+software stack.**
+
+## Low-level Backend
+
+### Device
+
+Each `Device` instance, i.e., a device, is created against one hardware device,
+e.g. a GPU or a CPU. `Device` manages the memory of the data structures, and
+schedules the operations for executing, e.g., on CUDA streams or CPU threads.
+Depending on the hardware and its programming language, SINGA have implemented
+the following specific device classes:
+
+- **CudaGPU** represents an Nvidia GPU card. The execution units are the CUDA
+  streams.
+- **CppCPU** represents a normal CPU. The execution units are the CPU threads.
+- **OpenclGPU** represents normal GPU card from both Nvidia and AMD. The
+  execution units are the CommandQueues. Given that OpenCL is compatible with
+  many hardware devices, e.g. FPGA and ARM, the OpenclGPU has the potential to
+  be extended for other devices.
+
+### Tensor
+
+`Tensor` class represents a multi-dimensional array, which stores model
+variables, e.g., the input images and feature maps from the convolution layer.
+Each `Tensor` instance (i.e. a tensor) is allocated on a a device, which manages
+the memory of the tensor and schedules the (computation) operations against
+tensors. Most machine learning algorithms could be expressed using (dense or
+sparse) the tensor abstraction and its operations. Therefore, SINGA would be
+able to run a wide range of models, including deep learning models and other
+traditional machine learning models.
+
+### Operator
+
+There are two types of operators against tensors, linear algebra operators like
+matrix multiplication, and neural network specific operators like convolution
+and pooling. The linear algebra operators are provided as `Tensor` functions and
+are implemented separately for different hardware devices
+
+- CppMath (tensor_math_cpp.h) implements the tensor operations using Cpp for
+  CppCPU
+- CudaMath (tensor_math_cuda.h) implements the tensor operations using CUDA for
+  CudaGPU
+- OpenclMath (tensor_math_opencl.h) implements the tensor operations using
+  OpenCL for OpenclGPU
+
+The neural network specific operators are also implemented separately, e.g.,
+
+- GpuConvFoward (convolution.h) implements the forward function of convolution
+  via CuDNN on Nvidia GPU.
+- CpuConvForward (convolution.h) implements the forward function of convolution
+  using CPP on CPU.
+
+Typically, users create a `Device` instance and use it to create multiple
+`Tensor` instances. When users call the Tensor functions or neural network
+operations, the corresponding implementation for the resident device will be
+invoked In other words, the implementation of operators is transparent to users.
+
+The Tensor and Device abstractions are extensible to support a wide range of
+hardware device using different programming languages. A new hardware device
+would be supported by adding a new Device subclass and the corresponding
+implementation of the operators.
+
+Optimizations in terms of speed and memory are done by the `Scheduler` and
+`MemPool` of the `Device`. For example, the `Scheduler` creates a
+[computational graph](./graph) according to the dependency of the operators.
+Then it can optimize the execution order of the operators for parallelism and
+memory sharing.
+
+### Communicator
+
+`Communicator` is to support [distributed training](./dist-train). It implements
+the communication protocols using sockets, MPI and NCCL. Typically users only
+need to call the high-level APIs like `put()` and `get()` for sending and
+receiving tensors. Communication optimization for the topology, message size,
+etc. is done internally.
+
+## Python Interface
+
+All the backend components are exposed as Python modules via SWIG. In addition,
+the following classes are added to support the implementation of complex neural
+networks.
+
+### Opt
+
+`Opt` and its subclasses implement the methods (such as SGD) for updating model
+parameter values using parameter gradients. A subclass [DistOpt](./dist-train)
+synchronizes the gradients across the workers for distributed training by
+calling methods from `Communicator`.
+
+### Operator
+
+`Operator` wraps multiple functions implemented using the Tensor or neural
+network operators from the backend. For example, the forward function and
+backward function `ReLU` compose the `ReLU` operator.
+
+### Layer
+
+`Layer` and its subclasses wraps the operators with parameters. For instance,
+convolution and linear operators  
+have weight and bias parameters. The parameters are maintained by the
+corresponding `Layer` class.
+
+### Autograd
+
+[Autograd](./autograd) implements the
+[reverse-mode automatic differentiation](https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation)
+by recording the execution of the forward functions of the operators calling the
+backward functions automatically in the reverse order. All functions can be
+buffered by the `Scheduler` to create a [computational graph](./graph) for
+efficiency and memory optimization.
+
+### Module
+
+`Module` provides an easy interface to implement new network models. You just
+need to inherit `Module` and define the forward propagation of the model by
+creating and calling the layers or operators. `Module` will do autograd and
+update the parameters via `Opt` automatically when training data is fed into it.
+
+### ONNX
+
+To support ONNX, SINGA implmenets a [sonnx](./onnx) module, which includes
+
+- SingaFrontend for saving SINGA model into onnx format.
+- SingaBackend for loading onnx format model into SINGA for training and
+  inference.
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/source-repository.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/source-repository.md
new file mode 100644
index 0000000..56fa732
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/source-repository.md
@@ -0,0 +1,24 @@
+---
+id: version-3.0.0.rc1-source-repository
+title: Source Repository
+original_id: source-repository
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+This project uses [Git](http://git-scm.com/) to manage its source code.
+Instructions on Git use can be found at http://git-scm.com/documentation .
+
+## Repository
+
+The following is a link to the online source repository.
+
+- https://gitbox.apache.org/repos/asf?p=singa.git
+
+There is a Github mirror at
+
+- https://github.com/apache/singa
+
+The code can be cloned from either repo, e.g.,
+
+    git clone https://github.com/apache/singa.git
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/team-list.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/team-list.md
new file mode 100644
index 0000000..b0166e8
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/team-list.md
@@ -0,0 +1,59 @@
+---
+id: version-3.0.0.rc1-team-list
+title: The SINGA Team
+original_id: team-list
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+A successful project requires many people to play many roles. Some members write
+code or documentation, while others are valuable as testers, submitting patches
+and suggestions.
+
+The SINGA community has developers mainly from National University of Singapore,
+Zhejiang University, NetEase, Osaka University, yzBigData, etc.
+
+## PMC
+
+| Name          | Email                  | Organization                                  |
+| ------------- | ---------------------- | --------------------------------------------- |
+| Gang Chen     | cg@apache.org          | Zhejiang University                           |
+| Anh Dinh      | dinhtta@apache.org     | Singapore University of Technology and Design |
+| Ted Dunning   | tdunning@apache.org    | Apache Software Foundation                    |
+| Jinyang Gao   | jinyang@apache.org     | DAMO Academy, Alibaba Group                   |
+| Alan Gates    | gates@apache.org       | Apache Software Foundation                    |
+| Zhaojing Luo  | zhaojing@apache.org    | National University of Singapore              |
+| Thejas Nair   | thejas@apache.org      | Apache Software Foundation                    |
+| Beng Chin Ooi | ooibc@apache.org       | National University of Singapore              |
+| Moaz Reyad    | moaz@apache.org        | Université Grenoble Alpes                     |
+| Kian-Lee Tan  | tankianlee@apache.org  | National University of Singapore              |
+| Wei Wang      | wangwei@apache.org     | National University of Singapore              |
+| Meihui Zhang  | meihuizhang@apache.org | Beijing Institute of Technology               |
+| Kaiping Zheng | kaiping@apache.org     | National University of Singapore              |
+
+## Committers
+
+| Name         | Email                   | Organization                     |
+| ------------ | ----------------------- | -------------------------------- |
+| Chonho Lee   | chonho@apache.org       | Osaka University                 |
+| Sheng Wang   | wangsh@apache.org       | DAMO Academy, Alibaba Group      |
+| Wanqi Xue    | xuewanqi@apache.org     | Nanyang Technological University |
+| Xiangrui Cai | caixr@apache.org        | Nankai University                |
+| Sai Ho Yeung | chrishkchris@apache.org | National University of Singapore |
+
+## Contributors
+
+| Name               | Email                        | Organization                                  |
+| ------------------ | ---------------------------- | --------------------------------------------- |
+| Haibo Chen         | hzchenhaibo@corp.netease.com | NetEase                                       |
+| Xin Ji             | vincent.j.xin@gmail.com      | Visenze, Singapore                            |
+| Anthony K. H. Tung | atung@comp.nus.edu.sg        | National University of Singapore              |
+| Ji Wang            | wangji@mzhtechnologies.com   | Hangzhou MZH Technologies                     |
+| Yuan Wang          | wangyuan@corp.netease.com    | NetEase                                       |
+| Wenfeng Wu         | dcswuw@gmail.com             | Freelancer, China                             |
+| Chang Yao          | yaochang2009@gmail.com       | Hangzhou MZH Technologies                     |
+| Shicheng Chen      | chengsc@comp.nus.edu.sg      | National University of Singapore              |
+| Joddiy Zhang       | joddiyzhang@gmail.com        | National University of Singapore              |
+| Shicong Lin        | dcslin@nus.edu.sg            | National University of Singapore              |
+| Kaiyuan Yang       | yangky@comp.nus.edu.sg       | National University of Singapore              |
+| Rulin Xing         | xjdkcsq3@gmail.com           | Huazhong University of Science and Technology |
diff --git a/docs-site/website/versioned_docs/version-3.0.0.rc1/tensor.md b/docs-site/website/versioned_docs/version-3.0.0.rc1/tensor.md
new file mode 100644
index 0000000..f7f0e1d
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.0.0.rc1/tensor.md
@@ -0,0 +1,241 @@
+---
+id: version-3.0.0.rc1-tensor
+title: Tensor
+original_id: tensor
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agree [...]
+
+Each Tensor instance is a multi-dimensional array allocated on a specific Device
+instance. Tensor instances store variables and provide linear algebra operations
+over different types of hardware devices without user awareness. Note that users
+need to make sure the tensor operands are allocated on the same device except
+copy functions.
+
+## Tensor Usage
+
+### Create Tensor
+
+```python
+>>> import numpy as np
+>>> from singa import tensor
+>>> tensor.from_numpy( np.asarray([[1, 0, 0], [0, 1, 0]], dtype=np.float32) )
+[[1. 0. 0.]
+ [0. 1. 0.]]
+```
+
+### Convert to numpy
+
+```python
+>>> a = np.asarray([[1, 0, 0], [0, 1, 0]], dtype=np.float32)
+>>> tensor.from_numpy(a)
+[[1. 0. 0.]
+ [0. 1. 0.]]
+>>> tensor.to_numpy(tensor.from_numpy(a))
+array([[1., 0., 0.],
+       [0., 1., 0.]], dtype=float32)
+```
+
+### Tensor Methods
+
+```python
+>>> t = tensor.from_numpy(a)
+>>> t.transpose([1,0])
+[[1. 0.]
+ [0. 1.]
+ [0. 0.]]
+```
+
+### Tensor Arithmetic Methods
+
+`tensor` is evaluated in real time.
+
+```python
+>>> t + 1
+[[2. 1. 1.]
+ [1. 2. 1.]]
+>>> t / 5
+[[0.2 0.  0. ]
+ [0.  0.2 0. ]]
+```
+
+### Tensor Functions
+
+Functions in module `singa.tensor` return new `tensor` object after applying the
+transformation defined in the function.
+
+```python
+>>> tensor.log(t+1)
+[[0.6931472 0.        0.       ]
+ [0.        0.6931472 0.       ]]
+```
+
+### Tensor on Different Devices
+
+`tensor` is created on host (CPU) by default; it can also be created on
+different hardware devices by specifying the `device`. A `tensor` could be moved
+between `device`s via `to_device()` function.
+
+```python
+>>> from singa import device
+>>> x = tensor.Tensor((2, 3), device.create_cuda_gpu())
+>>> x.gaussian(1,1)
+>>> x
+[[1.531889   1.0128608  0.12691343]
+ [2.1674204  3.083676   2.7421203 ]]
+>>> # move to host
+>>> x.to_device(device.get_default_device())
+```
+
+### Simple Neural Network Example
+
+```python
+from singa import device
+from singa import tensor
+from singa import opt
+from singa import autograd
+class MLP:
+    def __init__(self):
+        self.linear1 = autograd.Linear(3, 4)
+        self.linear2 = autograd.Linear(4, 5)
+    def forward(self, x):
+        y=self.linear1(x)
+        return self.linear2(y)
+def train(model, x, t, dev, epochs=10):
+    for i in range(epochs):
+        y = model.forward(x)
+        loss = autograd.mse_loss(y, t)
+        print("loss: ", loss)
+        sgd = opt.SGD()
+        for p, gp in autograd.backward(loss):
+            sgd.update(p, gp)
+        sgd.step()
+    print("training completed")
+if __name__ == "__main__":
+    autograd.training = True
+    model = MLP()
+    dev = device.get_default_device()
+    x = tensor.Tensor((2, 3), dev)
+    t = tensor.Tensor((2, 5), dev)
+    x.gaussian(1,1)
+    t.gaussian(1,1)
+    train(model, x, t, dev)
+```
+
+Output:
+
+```
+loss:  [4.917431]
+loss:  [2.5147934]
+loss:  [2.0670078]
+loss:  [1.9179827]
+loss:  [1.8192691]
+loss:  [1.7269677]
+loss:  [1.6308627]
+loss:  [1.52674]
+loss:  [1.4122975]
+loss:  [1.2866782]
+training completed
+```
+
+## Tensor Implementation
+
+The previous section shows the general usage of `Tensor`, the implementation
+under the hood will be covered below. First, the design of Python and C++
+tensors will be introduced. Later part will talk about how the frontend (Python)
+and backend (C++) are connected and how to extend them.
+
+### Python Tensor
+
+Python class `Tensor`, defined in `python/singa/tensor.py`, provides high level
+tensor manipulations for implementing deep learning operations (via
+[autograd](./autograd)), as well as data management by end users.
+
+It primarily works by simply wrapping around C++ tensor methods, both arithmetic
+(e.g. `sum`) and non arithmetic methods (e.g. `reshape`). Some advanced
+arithmetic operations are later introduced and implemented using pure Python
+tensor API, e.g. `tensordot`. Python Tensor APIs could be used to implement
+complex neural network operations easily with the flexible methods available.
+
+### C++ Tensor
+
+C++ class `Tensor`, defined in `include/singa/core/tensor.h`, primarily manages
+the memory that holds the data, and provides low level APIs for tensor
+manipulation. Also, it provides various arithmetic methods (e.g. `matmul`) by
+wrapping different backends (CUDA, BLAS, cuBLAS, etc.).
+
+#### Execution Context and Memory Block
+
+Two important concepts or data structures for `Tensor` are the execution context
+`device`, and the memory block `Block`.
+
+Each `Tensor` is physically stored on and managed by a hardware device,
+representing the execution context (CPU, GPU). Tensor math calculations are
+executed on the device.
+
+Tensor data in a `Block` instance, defined in `include/singa/core/common.h`.
+`Block` owns the underlying data, while tensors take ownership on the metadata
+describing the tensor, like `shape`, `strides`.
+
+#### Tensor Math Backends
+
+To leverage on the efficient math libraries provided by different backend
+hardware devices, SINGA has one set of implementations of Tensor functions for
+each supported backend.
+
+- 'tensor_math_cpp.h' implements operations using Cpp (with CBLAS) for CppCPU
+  devices.
+- 'tensor_math_cuda.h' implements operations using Cuda (with cuBLAS) for
+  CudaGPU devices.
+- 'tensor_math_opencl.h' implements operations using OpenCL for OpenclGPU
+  devices.
+
+### Exposing C++ APIs to Python
+
+SWIG(http://www.swig.org/) is a tool that can automatically convert C++ APIs
+into Python APIs. SINGA uses SWIG to expose the C++ APIs to Python. Several
+files are generated by SWIG, including `python/singa/singa_wrap.py`. The Python
+modules (e.g., `tensor`, `device` and `autograd`) imports this module to call
+the C++ APIs for implementing the Python classes and functions.
+
+```python
+import tensor
+
+t = tensor.Tensor(shape=(2, 3))
+```
+
+For example, when a Python `Tensor` instance is created as above, the `Tensor`
+class implementation creates an instance of the `Tensor` class defined in
+`singa_wrap.py`, which corresponds to the C++ `Tensor` class. For clarity, the
+`Tensor` class in `singa_wrap.py` is referred as `CTensor` in `tensor.py`.
+
+```python
+# in tensor.py
+from . import singa_wrap as singa
+
+CTensor = singa.Tensor
+```
+
+### Create New Tensor Functions
+
+With the groundwork set by the previous description, extending tensor functions
+could be done easily in a bottom up manner. For math operations, the steps are:
+
+- Declare the new API to `tensor.h`
+- Generate code using the predefined macro in `tensor.cc`, refer to
+  `GenUnaryTensorFn(Abs);` as an example.
+- Declare the template method/function in `tensor_math.h`
+- Do the real implementation at least for CPU (`tensor_math_cpp.h`) and
+  GPU(`tensor_math_cuda.h`)
+- Expose the API via SWIG by adding it into `src/api/core_tensor.i`
+- Define the Python Tensor API in `tensor.py` by calling the automatically
+  generated function in `singa_wrap.py`
+- Write unit tests where appropriate
+
+## Python API
+
+_work in progress_
+
+## CPP API
+
+_work in progress_
diff --git a/docs-site/website/versioned_sidebars/version-3.0.0.rc1-sidebars.json b/docs-site/website/versioned_sidebars/version-3.0.0.rc1-sidebars.json
new file mode 100644
index 0000000..982c16a
--- /dev/null
+++ b/docs-site/website/versioned_sidebars/version-3.0.0.rc1-sidebars.json
@@ -0,0 +1,34 @@
+{
+  "version-3.0.0.rc1-docs": {
+    "Getting Started": [
+      "version-3.0.0.rc1-installation",
+      "version-3.0.0.rc1-software-stack",
+      "version-3.0.0.rc1-examples"
+    ],
+    "Guides": [
+      "version-3.0.0.rc1-device",
+      "version-3.0.0.rc1-tensor",
+      "version-3.0.0.rc1-autograd",
+      "version-3.0.0.rc1-graph",
+      "version-3.0.0.rc1-dist-train"
+    ],
+    "Development": [
+      "version-3.0.0.rc1-download-singa",
+      "version-3.0.0.rc1-build",
+      "version-3.0.0.rc1-contribute-code",
+      "version-3.0.0.rc1-contribute-docs",
+      "version-3.0.0.rc1-how-to-release",
+      "version-3.0.0.rc1-git-workflow"
+    ]
+  },
+  "version-3.0.0.rc1-community": {
+    "Community": [
+      "version-3.0.0.rc1-source-repository",
+      "version-3.0.0.rc1-mail-lists",
+      "version-3.0.0.rc1-issue-tracking",
+      "version-3.0.0.rc1-security",
+      "version-3.0.0.rc1-team-list",
+      "version-3.0.0.rc1-history-singa"
+    ]
+  }
+}
diff --git a/docs-site/website/versions.json b/docs-site/website/versions.json
index 3aea034..0c4a16f 100644
--- a/docs-site/website/versions.json
+++ b/docs-site/website/versions.json
@@ -1,3 +1,4 @@
 [
+  "3.0.0.rc1",
   "2.0.0"
 ]