You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/09/19 09:41:06 UTC
[GitHub] lebeg closed pull request #12596: [v1.3.x] Fix armv7 hard float

lebeg closed pull request #12596: [v1.3.x] Fix armv7 hard float
URL: https://github.com/apache/incubator-mxnet/pull/12596
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/3rdparty/tvm b/3rdparty/tvm
index 290226e1c9a..90db723d287 160000
--- a/3rdparty/tvm
+++ b/3rdparty/tvm
@@ -1 +1 @@
-Subproject commit 290226e1c9adbb3e598f9ed9184018df1c12be33
+Subproject commit 90db723d287509705dcc93fde7ab0df380b9a4e5
diff --git a/Jenkinsfile b/Jenkinsfile
index 50b86ec7190..0e4aa199a6c 100644
--- a/Jenkinsfile
+++ b/Jenkinsfile
@@ -173,12 +173,12 @@ core_logic: {
         }
       }
     },
-    'CPU: Clang 5': {
+    'CPU: Clang 6': {
       node(NODE_LINUX_CPU) {
-        ws('workspace/build-cpu-clang50') {
+        ws('workspace/build-cpu-clang60') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_cpu', 'build_ubuntu_cpu_clang50', false)
+            utils.docker_run('ubuntu_cpu', 'build_ubuntu_cpu_clang60', false)
           }
         }
       }
@@ -194,13 +194,13 @@ core_logic: {
         }
       }
     },
-    'CPU: Clang 5 MKLDNN': {
+    'CPU: Clang 6 MKLDNN': {
       node(NODE_LINUX_CPU) {
-        ws('workspace/build-cpu-mkldnn-clang50') {
+        ws('workspace/build-cpu-mkldnn-clang60') {
           timeout(time: max_time, unit: 'MINUTES') {
             utils.init_git()
-            utils.docker_run('ubuntu_cpu', 'build_ubuntu_cpu_clang50_mkldnn', false)
-            utils.pack_lib('mkldnn_cpu_clang5', mx_mkldnn_lib)
+            utils.docker_run('ubuntu_cpu', 'build_ubuntu_cpu_clang60_mkldnn', false)
+            utils.pack_lib('mkldnn_cpu_clang6', mx_mkldnn_lib)
           }
         }
       }
@@ -363,16 +363,16 @@ core_logic: {
         }
       }
     },
-    // 'ARMv7':{
-    //   node(NODE_LINUX_CPU) {
-    //     ws('workspace/build-ARMv7') {
-    //       timeout(time: max_time, unit: 'MINUTES') {
-    //         utils.init_git()
-    //         utils.docker_run('armv7', 'build_armv7', false)
-    //       }
-    //     }
-    //   }
-    // },
+    'ARMv7':{
+      node(NODE_LINUX_CPU) {
+        ws('workspace/build-ARMv7') {
+          timeout(time: max_time, unit: 'MINUTES') {
+            utils.init_git()
+            utils.docker_run('armv7', 'build_armv7', false)
+          }
+        }
+      }
+    },
     'ARMv6':{
       node(NODE_LINUX_CPU) {
         ws('workspace/build-ARMv6') {
diff --git a/NEWS.md b/NEWS.md
index 461bb6d2d15..b9770caab95 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -1,5 +1,196 @@
 MXNet Change Log
 ================
+## 1.3.0
+
+### New Features - Gluon RNN layers are now HybridBlocks
+- In this release, Gluon RNN layers such as `gluon.rnn.RNN`, `gluon.rnn.LSTM`, `gluon.rnn.GRU` becomes `HybridBlock`s as part of [gluon.rnn improvements project](https://github.com/apache/incubator-mxnet/projects/11) (#11482).
+- This is the result of newly available fused RNN operators added for CPU: LSTM([#10104](https://github.com/apache/incubator-mxnet/pull/10104)), vanilla RNN([#11399](https://github.com/apache/incubator-mxnet/pull/11399)), GRU([#10311](https://github.com/apache/incubator-mxnet/pull/10311))
+- Now many dynamic networks that are based on Gluon RNN layers can now be completely hybridized, exported, and used in the inference APIs in other language bindings such as R, Scala, etc.
+
+### MKL-DNN improvements
+- Introducing more functionality support for MKL-DNN as follows:
+  - Added support for more activation functions like, "sigmoid", "tanh", "softrelu". ([#10336](https://github.com/apache/incubator-mxnet/pull/10336))
+  - Added Debugging functionality: Result check ([#12069](https://github.com/apache/incubator-mxnet/pull/12069)) and Backend switch ([#12058](https://github.com/apache/incubator-mxnet/pull/12058)).
+
+### New Features - Gluon Model Zoo Pre-trained Models
+- Gluon Vision Model Zoo now provides MobileNetV2 pre-trained models (#10879) in addition to
+  AlexNet, DenseNet, Inception V3, MobileNetV1, ResNet V1 and V2, SqueezeNet 1.0 and 1.1, and VGG
+  pretrained models.
+- Updated pre-trained models provide state-of-the-art performance on all resnetv1, resnetv2, and vgg16, vgg19, vgg16_bn, vgg19_bn models (#11327 #11860 #11830).
+
+### New Features - Clojure package (experimental)
+- MXNet now supports the Clojure programming language. The MXNet Clojure package brings flexible and efficient GPU computing and state-of-art deep learning to Clojure. It enables you to write seamless tensor/matrix computation with multiple GPUs in Clojure. It also lets you construct and customize the state-of-art deep learning models in Clojure, and apply them to tasks, such as image classification and data science challenges.([#11205](https://github.com/apache/incubator-mxnet/pull/11205))
+- Checkout examples and API documentation [here](http://mxnet.incubator.apache.org/api/clojure/index.html).
+
+### New Features - Synchronized Cross-GPU Batch Norm (experimental)
+- Gluon now supports Synchronized Batch Normalization (#11502).
+- This enables stable training on large-scale networks with high memory consumption such as FCN for image segmentation.
+
+### New Features - Sparse Tensor Support for Gluon (experimental)
+- Sparse gradient support is added to `gluon.nn.Embedding`. Set `sparse_grad=True` to enable when constructing the Embedding block. ([#10924](https://github.com/apache/incubator-mxnet/pull/10924))
+- Gluon Parameter now supports "row_sparse" storage type, which reduces communication cost and memory consumption for multi-GPU training for large models. `gluon.contrib.nn.SparseEmbedding` is an example empowered by this. ([#11001](https://github.com/apache/incubator-mxnet/pull/11001), [#11429](https://github.com/apache/incubator-mxnet/pull/11429))
+- Gluon HybridBlock now supports hybridization with sparse operators ([#11306](https://github.com/apache/incubator-mxnet/pull/11306)).
+
+### New Features - Control flow operators (experimental)
+- This is the first step towards optimizing dynamic neural networks with variable computation graphs, by adding symbolic and imperative control flow operators. [Proposal](https://cwiki.apache.org/confluence/display/MXNET/Optimize+dynamic+neural+network+models+with+control+flow+operators).
+- New operators introduced: foreach([#11531](https://github.com/apache/incubator-mxnet/pull/11531)), while_loop([#11566](https://github.com/apache/incubator-mxnet/pull/11566)), cond([#11760](https://github.com/apache/incubator-mxnet/pull/11760)).
+
+### New Features - Scala API Improvements (experimental)
+- Improvements to MXNet Scala API usability([#10660](https://github.com/apache/incubator-mxnet/pull/10660), [#10787](https://github.com/apache/incubator-mxnet/pull/10787), [#10991](https://github.com/apache/incubator-mxnet/pull/10991))
+- Symbol.api and NDArray.api would bring new set of functions that have complete definition for all arguments.
+- Please see this [Type safe API design document](https://cwiki.apache.org/confluence/display/MXNET/Scala+Type-safe+API+Design+Doc) for more details.
+
+### New Features - Rounding GPU Memory Pool for dynamic networks with variable-length inputs and outputs (experimental)
+- MXNet now supports a new memory pool type for GPU memory (#11041).
+- Unlike the default memory pool requires exact size match to reuse released memory chunks, this new memory pool uses exponential-linear rounding so that similar sized memory chunks can all be reused, which is more suitable for all the workloads with dynamic-shape inputs and outputs. Set environment variable `MXNET_GPU_MEM_POOL_TYPE=Round` to enable.
+
+### New Features - Topology-aware AllReduce (experimental)
+- This features uses trees to perform the Reduce and Broadcast. It uses the idea of minimum spanning trees to do a binary tree Reduce communication pattern to improve it. This topology aware approach reduces the existing limitations for single machine communication shown by mehods like parameter server and NCCL ring reduction. It is an experimental feature ([#11591](https://github.com/apache/incubator-mxnet/pull/11591)).
+- Paper followed for implementation: [Optimal message scheduling for aggregation](https://www.sysml.cc/doc/178.pdf).
+- Set environment variable `MXNET_KVSTORE_USETREE=1` to enable.
+
+### New Features - Export MXNet models to ONNX format (experimental)
+- With this feature, now MXNet models can be exported to ONNX format([#11213](https://github.com/apache/incubator-mxnet/pull/11213)). Currently, MXNet supports ONNX v1.2.1. [API documentation](http://mxnet.incubator.apache.org/api/python/contrib/onnx.html).
+- Checkout this [tutorial](http://mxnet.incubator.apache.org/tutorials/onnx/export_mxnet_to_onnx.html) which shows how to use MXNet to ONNX exporter APIs. ONNX protobuf so that those models can be imported in other frameworks for inference.
+
+### New Features - TensorRT Runtime Integration (experimental)
+- [TensorRT](https://developer.nvidia.com/tensorrt) provides significant acceleration of model inference on NVIDIA GPUs compared to running the full graph in MxNet using unfused GPU operators. In addition to faster fp32 inference, TensorRT optimizes fp16 inference, and is capable of int8 inference (provided the quantization steps are performed). Besides increasing throughput, TensorRT significantly reduces inference latency, especially for small batches.
+- This feature in MXNet now introduces runtime integration of TensorRT into MXNet, in order to accelerate inference.([#11325](https://github.com/apache/incubator-mxnet/pull/11325))
+- Currently, its in contrib package.
+
+### New Examples - Scala
+- Refurnished Scala Examples with improved API, documentation and CI test coverage. ([#11753](https://github.com/apache/incubator-mxnet/pull/11753), [#11621](https://github.com/apache/incubator-mxnet/pull/11621) )
+- Now all Scala examples have:
+  - No bugs block in the middle
+  - Good Readme to start with
+  - with Type-safe API usage inside
+  - monitored in CI in each PR runs
+
+### Maintenance - Flaky Tests improvement effort
+- Fixed 130 flaky tests on CI. Tracked progress of the project [here](https://github.com/apache/incubator-mxnet/projects/9).
+- Add flakiness checker (#11572)
+
+### Maintenance - MXNet Model Backwards Compatibility Checker
+- This tool ([#11626](https://github.com/apache/incubator-mxnet/pull/11626)) helps in ensuring consistency and sanity while performing inference on the latest version of MXNet using models trained on older versions of MXNet.
+- This tool will help in detecting issues earlier in the development cycle which break backwards compatibility on MXNet and would contribute towards ensuring a healthy and stable release of MXNet.
+
+### Maintenance - Integrated testing for "the Straight Dope"
+- ["Deep Learning - The Straight Dope"](http://gluon.mxnet.io) is a deep learning book based on Apache MXNet Gluon that are contributed by many Gluon users.
+- Now the testing of this book is integrated in the nightly tests.
+
+### Bug-fixes
+- Fix gperftools/jemalloc and lapack warning bug. (#11110)
+- Fix mkldnn performance regression + improve test logging (#11262)
+- Fix row_sparse_param.save() (#11266)
+- Fix trainer init_kvstore (#11266)
+- Fix axis Bug in MKLDNN Softmax (#11335)
+- Fix 'AttributeError: '_thread._local' object has no attribute 'value'' on distributed processing applications (#11332)
+- Fix recordfile dataset with multi worker (#11370)
+- Manually check node existence in CachedOp (#11545)
+- Javadoc fix (#11239)
+- Fix bugs in MKLDNN operators to handle the kAddTo request (#11129)
+- Fix InferStorage for sparse fallback in FullyConnected (#11498)
+- Fix batchnorm problem with sparse matrices when fix_gamma=True (#11656)
+- Fix rnn layer save (#11776)
+- Fix BucketSentenceIter bug related to #11430 (#11580)
+- Fix for _backward_softsign activation (#11827)
+- Fix a bug in CachedOp. (#11675)
+- Fix quantization divide by zero errors (#11833)
+- Refactor R optimizers to fix memory leak (#11374)
+- Avoid use of troublesome cudnnFind() results when grad_req='add' (#11338)
+- Fix shared memory with gluon dataloader, add option pin_memory (#11908)
+- Fix quantized graph pass bug (#11937)
+- Fix MXPredReshape in the c_predict_api (#11493)
+- Fix the topk regression issue (#12197)
+- Fix image-classification example and add missing optimizers w/ momentum support (#11826)
+
+### Performance Improvements
+- Added static allocation and static shape for HybridBloc gluon (#11320)
+- Fix RecordIO augmentation speed (#11474)
+- Improve sparse pull performance for gluon trainer (#11429)
+- CTC operator performance improvement from HawkAaron/MXNet-CTC (#11834)
+- Improve performance of broadcast ops backward pass (#11252)
+- Improved numerical stability as a result of using stable L2 norm (#11573)
+- Accelerate the performance of topk for GPU and CPU side (#12085 #10997 ; This changes the behavior of topk when nan values occur in the input) 
+- Support for dot(dns, csr) = dns and dot(dns, csr.T) = dns on CPU ([#11113](https://github.com/apache/incubator-mxnet/pull/11113))
+- Performance improvement for Batch Dot on CPU from mshadow ([mshadow PR#342](https://github.com/dmlc/mshadow/pull/342))
+
+### API Changes
+- Allow Scala users to specify data/label names for NDArrayIter (#11256)
+- Allow user to define unknown token symbol to rnn encode_sentences() (#10461)
+- Added count_include_pad argument for Avg Pooling (#11021)
+- Add standard ResNet data augmentation for ImageRecordIter (#11027)
+- Add seed_aug parameter for ImageRecordIter to fix random seed for default augmentation (#11247)
+- Add support for accepting MXNet NDArrays in ColorNormalizeAug (#11606)
+- Enhancement of take operator (#11326)
+- Add temperature parameter in Softmax operator (#11466)
+- Add support for 1D inputs in leaky relu (#11850)
+- Add verify_ssl option to gluon.utils.download (#11546)
+
+### Other features
+- Added ccache reporting to CI (#11322)
+- Restructure dockcross dockerfiles to fix caching (#11302)
+- Added tests for MKLDNN backward operators  (#11232)
+- Add elemwise_add/sub between rsp and rsp on GPU (#11179)
+- Add clip_global_norm(row_sparse_grad) (#11266)
+- Add subgraph storage type inference to CachedOp  (#11306)
+- Enable support for dense weight and sparse grad Adagrad updates (#11355)
+- Added Histogram Operator (#10931)
+- Added Matthew's Correlation Coefficient to metrics (#10524)
+- Added support for add_n(dense, csr, dense) = dense on CPU & GPU (#11330)
+- Added support for add_n(any combination longer than 4 with at least one dense storage) = dense on CPU & GPU (#11330)
+- L1 Normalization (#11229)
+- Add support for int64 data type in CSVIter (#11446)
+- Add test for new int64 type in CSVIter (#11499)
+- Add sample ratio for ROI Align (#11145)
+- Shape and Size Operator (#10889)
+- Add HybidSequentialRNNCell, which can be nested in HybridBlock (#11003)
+- Support for a bunch of unary functions for csr matrices (#11559)
+- Added NDArrayCollector to dispose intermediate allocated NDArrays automatically (#11751)
+- Added the diag() operator (#11643)
+- Added broadcast_like operator (#11820)
+- Allow Partial shape infer for Slice (#11406)
+- Added support to profile kvstore server during distributed training  (#11215)
+- Add function for GPU Memory Query to C API (#12083)
+- Generalized reshape_like operator to be more flexible (#11928)
+- Add support for selu activation function (#12059)
+- Add support for accepting NDArray as input to Module predict API (#12166)
+- Add DataDesc type for the Scala Package (#11844)
+
+### Usability Improvements
+- Added NDArray auto-collector for Scala (#11751, #12232)
+- Added docs for mx.initializer.Constant (#10637)
+- Added build from souce instructions on windows (#11276)
+- Added a tutorial explaining how to use the profiler (#11274)
+- Added two tutorials on Learning Rate Schedules (#11296)
+- Added a tutorial for mixed precision training with float16 (#10391)
+- Create CPP test for concat MKLDNN operator (#11371)
+- Update large word language model example (#11405)
+- MNIST Examples for Scala new API (#11250)
+- Updated installation info to have latest packages and more clarity (#11503)
+- GAN MNIST Examples for Scala new API (#11547)
+- Added Learning Rate Finder tutorial (#11304)
+- Fix Installation instructions for R bindings on Linux systems. (#11590)
+- Integration Test for Scala (#11596)
+- Documentation enhancement for optimizers (#11657)
+- Update rcnn example (#11373)
+- Gluon ModelZoo, Gluon examples for Perl APIs (#11642)
+- Fix R installation in CI (#11761,#11755, #11768, #11805, #11954, #11976)
+- CNN Examples for Scala new API (#11292)
+- Custom Operator Example for Scala (#11401)
+- Added detailed doc about global pool layers in Gluon (#11832)
+- Updated MultiTask example to use new infer api (#11605)
+- Added logistic regression tutorial (#11651)
+- Added Support for integer type in ImageIter (#11864)
+- Added depth_to_space and space_to_depth operators (#11587)
+- Increased operator support for ONNX to MXNet importer (#11856)
+- Add linux and macos MKLDNN Building Instruction (#11049)
+- Add download utility for Scala APIs (#11866)
+- Improving documentation and error messages for Async distributed training with Gluon (#11910)
+- Added NeuralStyle Example for Scala (#11621)
+
+For more information and examples, see [full release notes](https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.3.0+Release+Notes)
+
 ## 1.2.0
 ### New Features - Added Scala Inference APIs
 - Implemented new [Scala Inference APIs](https://cwiki.apache.org/confluence/display/MXNET/MXNetScalaInferenceAPI) which offer an easy-to-use, Scala Idiomatic and thread-safe high level APIs for performing predictions with deep learning models trained with MXNet (#9678). Implemented a new ImageClassifier class which provides APIs for classification tasks on a Java BufferedImage using a pre-trained model you provide (#10054). Implemented a new ObjectDetector class which provides APIs for object and boundary detections on a Java BufferedImage using a pre-trained model you provide (#10229).
@@ -20,7 +211,7 @@ MXNet Change Log
 - Added support for distributed mixed precision training with FP16. It supports storing of master copy of weights in float32 with the multi_precision mode of optimizers (#10183). Improved speed of float16 operations on x86 CPU by 8 times through F16C instruction set. Added support for more operators to work with FP16 inputs (#10125, #10078, #10169). Added a tutorial on using mixed precision with FP16 (#10391).
 
 ### New Features - Added Profiling Enhancements
-- Enhanced built-in profiler to support native Intel:registered: VTune:tm: Amplifier objects such as Task, Frame, Event, Counter and Marker from both C++ and Python -- which is also visible in the Chrome tracing view(#8972). Added Runtime tracking of symbolic and imperative operators as well as memory and API calls. Added Tracking and dumping of aggregate profiling data. Profiler also no longer affects runtime performance when not in use. 
+- Enhanced built-in profiler to support native Intel:registered: VTune:tm: Amplifier objects such as Task, Frame, Event, Counter and Marker from both C++ and Python -- which is also visible in the Chrome tracing view(#8972). Added Runtime tracking of symbolic and imperative operators as well as memory and API calls. Added Tracking and dumping of aggregate profiling data. Profiler also no longer affects runtime performance when not in use.
 
 ### Breaking Changes
 - Changed Namespace for MXNet scala from `ml.dmlc.mxnet` to `org.apache.mxnet` (#10284).
@@ -51,7 +242,7 @@ MXNet Change Log
 - Fixed a bug that was causing training metrics to be printed as NaN sometimes (#10437).
 - Fixed a crash with non positive reps for tile ops (#10417).
 
-### Performance Improvements 
+### Performance Improvements
 - On average, after the MKL-DNN change, the inference speed of MXNet + MKLDNN outperforms MXNet + OpenBLAS by a factor of 32, outperforms MXNet + MKLML by 82% and outperforms MXNet + MKLML with the experimental flag by 8%. The experiments were run for the image classifcation example, for different networks and different batch sizes.
 - Improved sparse SGD, sparse AdaGrad and sparse Adam optimizer speed on GPU by 30x (#9561, #10312, #10293, #10062).
 - Improved `sparse.retain` performance on CPU by 2.5x (#9722)
@@ -156,7 +347,7 @@ For more information and examples, see [full release notes](https://cwiki.apache
 - Added `axis` argument to `SequenceLast`, `SequenceMask` and `SequenceReverse` operators (#9306)
 - Added `lazy_update` option for standard `SGD` & `Adam` optimizer with `row_sparse` gradients (#9468, #9189)
 - Added `select` option in `Block.collect_params` to support regex (#9348)
-- Added support for (one-to-one and sequence-to-one) inference on explicit unrolled RNN models in R (#9022) 
+- Added support for (one-to-one and sequence-to-one) inference on explicit unrolled RNN models in R (#9022)
 ### Deprecations
 - The Scala API name space is still called `ml.dmlc`. The name space is likely be changed in a future release to `org.apache` and might brake existing applications and scripts (#9579, #9324)
 ### Performance Improvements
@@ -202,10 +393,10 @@ For more information and examples, see [full release notes](https://cwiki.apache
   - MXNet now compiles and runs on NVIDIA Jetson TX2 boards with GPU acceleration.
   - You can install the python MXNet package on a Jetson board by running - `$ pip install mxnet-jetson-tx2`.
 ### New Features - Sparse Tensor Support [General Availability]
-  - Added more sparse operators: `contrib.SparseEmbedding`, `sparse.sum` and `sparse.mean`. 
+  - Added more sparse operators: `contrib.SparseEmbedding`, `sparse.sum` and `sparse.mean`.
   - Added `asscipy()` for easier conversion to scipy.
   - Added `check_format()` for sparse ndarrays to check if the array format is valid.
-### Bug-fixes  
+### Bug-fixes
   - Fixed a[-1] indexing doesn't work on `NDArray`.
   - Fixed `expand_dims` if axis < 0.
   - Fixed a bug that causes topk to produce incorrect result on large arrays.
@@ -217,9 +408,9 @@ For more information and examples, see [full release notes](https://cwiki.apache
 ### Doc Updates
   - Added a security best practices document under FAQ section.
   - Fixed License Headers including restoring copyright attributions.
-  - Documentation updates. 
+  - Documentation updates.
   - Links for viewing source.
- 
+
  For more information and examples, see [full release notes](https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incubating%29+1.0+Release+Notes)
 
 
@@ -227,15 +418,15 @@ For more information and examples, see [full release notes](https://cwiki.apache
 ### Bug-fixes
   - Added GPU support for the `syevd` operator which ensures that there is GPU support for all linalg-operators.
   - Bugfix for `syevd` on CPU such that it works for `float32`.
-  - Fixed API call when `OMP_NUM_THREADS` environment variable is set. 
+  - Fixed API call when `OMP_NUM_THREADS` environment variable is set.
   - Fixed `MakeNonlossGradNode` bug.
-  - Fixed bug related to passing `dtype` to `array()`. 
+  - Fixed bug related to passing `dtype` to `array()`.
   - Fixed some minor bugs for sparse distributed training.
-  - Fixed a bug on `Slice` accessing uninitialized memory in `param.begin` in the file `matrix_op-inl.h`. 
+  - Fixed a bug on `Slice` accessing uninitialized memory in `param.begin` in the file `matrix_op-inl.h`.
   - Fixed `gluon.data.RecordFileDataset`.
   - Fixed a bug that caused `autograd` to crash on some networks.
-  
-  
+
+
 ## 0.12.0
 ### Performance
   - Added full support for NVIDIA Volta GPU Architecture and CUDA 9. Training CNNs is up to 3.5x faster than Pascal when using float16 precision.
@@ -243,7 +434,7 @@ For more information and examples, see [full release notes](https://cwiki.apache
   - Improved ImageRecordIO image loading performance and added indexed RecordIO support.
   - Added better openmp thread management to improve CPU performance.
 ### New Features - Gluon
-  - Added enhancements to the Gluon package, a high-level interface designed to be easy to use while keeping most of the flexibility of low level API. Gluon supports both imperative and symbolic programming, making it easy to train complex models imperatively with minimal impact on performance. Neural networks (and other machine learning models) can be defined and trained with `gluon.nn` and `gluon.rnn` packages. 
+  - Added enhancements to the Gluon package, a high-level interface designed to be easy to use while keeping most of the flexibility of low level API. Gluon supports both imperative and symbolic programming, making it easy to train complex models imperatively with minimal impact on performance. Neural networks (and other machine learning models) can be defined and trained with `gluon.nn` and `gluon.rnn` packages.
   - Added new loss functions - `SigmoidBinaryCrossEntropyLoss`, `CTCLoss`, `HuberLoss`, `HingeLoss`, `SquaredHingeLoss`, `LogisticLoss`, `TripletLoss`.
   - `gluon.Trainer` now allows reading and setting learning rate with `trainer.learning_rate` property.
   - Added API `HybridBlock.export` for exporting gluon models to MXNet format.
@@ -256,7 +447,7 @@ For more information and examples, see [full release notes](https://cwiki.apache
   - Added `mx.autograd.grad` and experimental second order gradient support (most operators don't support second order gradient yet).
   - Autograd now supports cross-device graphs. Use `x.copyto(mx.gpu(i))` and `x.copyto(mx.cpu())` to do computation on multiple devices.
 ### New Features - Sparse Tensor Support
-  - Added support for sparse matrices. 
+  - Added support for sparse matrices.
   - Added limited cpu support for two sparse formats in `Symbol` and `NDArray` - `CSRNDArray` and `RowSparseNDArray`.
   - Added a sparse dot product operator and many element-wise sparse operators.
   - Added a data iterator for sparse data input - `LibSVMIter`.
@@ -266,7 +457,7 @@ For more information and examples, see [full release notes](https://cwiki.apache
   - Added limited support for fancy indexing, which allows you to very quickly access and modify complicated subsets of an array's values. `x[idx_arr0, idx_arr1, ..., idx_arrn]` is now supported. Features such as combining and slicing are planned for the next release. Checkout master to get a preview.
   - Random number generators in `mx.nd.random.*` and `mx.sym.random.*` now support both CPU and GPU.
   - `NDArray` and `Symbol` now supports "fluent" methods. You can now use `x.exp()` etc instead of `mx.nd.exp(x)` or `mx.sym.exp(x)`.
-  - Added `mx.rtc.CudaModule` for writing and running CUDA kernels from python. 
+  - Added `mx.rtc.CudaModule` for writing and running CUDA kernels from python.
   - Added `multi_precision` option to optimizer for easier float16 training.
   - Better support for IDE auto-completion. IDEs like PyCharm can now correctly parse mxnet operators.
 ### API Changes
@@ -314,14 +505,14 @@ For more information and examples, see [full release notes](https://cwiki.apache
 
 
 ## 0.10.0
-- Overhauled documentation for commonly used Python APIs, Installation instructions, Tutorials, HowTos and MXNet Architecture.  
-- Updated mxnet.io for improved readability.  
-- Pad operator now support reflection padding.  
-- Fixed a memory corruption error in threadedengine.  
-- Added CTC loss layer to contrib package. See mx.contrib.sym.ctc_loss.  
-- Added new sampling operators for several distributions (normal,uniform,gamma,exponential,negative binomial).  
+- Overhauled documentation for commonly used Python APIs, Installation instructions, Tutorials, HowTos and MXNet Architecture.
+- Updated mxnet.io for improved readability.
+- Pad operator now support reflection padding.
+- Fixed a memory corruption error in threadedengine.
+- Added CTC loss layer to contrib package. See mx.contrib.sym.ctc_loss.
+- Added new sampling operators for several distributions (normal,uniform,gamma,exponential,negative binomial).
 - Added documentation for experimental RNN APIs.
- 
+
 ## 0.9.3
 - Move symbolic API to NNVM @tqchen
   - Most front-end C API are backward  compatible
diff --git a/README.md b/README.md
index 3d570ee1f77..23b9d329d1d 100644
--- a/README.md
+++ b/README.md
@@ -33,6 +33,7 @@ How to Contribute
 
 What's New
 ----------
+* [Version 1.3.0 Release](https://github.com/apache/incubator-mxnet/releases/tag/1.3.0) - MXNet 1.3.0 Release.
 * [Version 1.2.0 Release](https://github.com/apache/incubator-mxnet/releases/tag/1.2.0) - MXNet 1.2.0 Release.
 * [Version 1.1.0 Release](https://github.com/apache/incubator-mxnet/releases/tag/1.1.0) - MXNet 1.1.0 Release.
 * [Version 1.0.0 Release](https://github.com/apache/incubator-mxnet/releases/tag/1.0.0) - MXNet 1.0.0 Release.
diff --git a/ci/docker/Dockerfile.build.armv7 b/ci/docker/Dockerfile.build.armv7
index 6316270f9cf..9a23a5dbefe 100755
--- a/ci/docker/Dockerfile.build.armv7
+++ b/ci/docker/Dockerfile.build.armv7
@@ -18,7 +18,7 @@
 #
 # Dockerfile to build MXNet for Android ARMv7
 
-FROM dockcross/linux-armv7
+FROM mxnetci/dockcross-linux-armv7:09182018
 
 ENV ARCH armv7l
 ENV HOSTCC gcc
diff --git a/ci/docker/install/ubuntu_clang.sh b/ci/docker/install/ubuntu_clang.sh
index 39a5600ce9d..40761716933 100755
--- a/ci/docker/install/ubuntu_clang.sh
+++ b/ci/docker/install/ubuntu_clang.sh
@@ -21,11 +21,11 @@
 # the whole docker cache for the image
 
 set -ex
-# Install clang 3.9 (the same version as in XCode 8.*) and 5.0 (latest major release)
+# Install clang 3.9 (the same version as in XCode 8.*) and 6.0 (latest major release)
 wget -O - http://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add - && \
     apt-add-repository "deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-3.9 main" && \
-    apt-add-repository "deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-5.0 main" && \
+    apt-add-repository "deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-6.0 main" && \
     apt-get update && \
-    apt-get install -y clang-3.9 clang-5.0 && \
+    apt-get install -y clang-3.9 clang-6.0 && \
     clang-3.9 --version && \
-    clang-5.0 --version
+    clang-6.0 --version
diff --git a/ci/docker/install/ubuntu_tvm.sh b/ci/docker/install/ubuntu_tvm.sh
index 4f5cb4251ad..d7e093ac7c8 100755
--- a/ci/docker/install/ubuntu_tvm.sh
+++ b/ci/docker/install/ubuntu_tvm.sh
@@ -25,7 +25,7 @@ cd tvm
 # This is a stable tag that support MXNet TVM bridge.
 # We use this since support for mxnet bridge just checked
 # into master and there is yet a version tag
-git checkout 30eaf463e34d7c301357c31a010945d11df16537
+git checkout 1c97eaf622095700d045ffef2320fb21911b485e
 
 cp make/config.mk
 echo USE_CUDA=1 >> config.mk
diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
index 35311396e34..1e38ec48e6c 100755
--- a/ci/docker/runtime_functions.sh
+++ b/ci/docker/runtime_functions.sh
@@ -349,11 +349,11 @@ build_ubuntu_cpu_clang39() {
         -j$(nproc)
 }
 
-build_ubuntu_cpu_clang50() {
+build_ubuntu_cpu_clang60() {
     set -ex
 
-    export CXX=clang++-5.0
-    export CC=clang-5.0
+    export CXX=clang++-6.0
+    export CC=clang-6.0
 
     build_ccache_wrappers
 
@@ -381,11 +381,11 @@ build_ubuntu_cpu_clang39_mkldnn() {
         -j$(nproc)
 }
 
-build_ubuntu_cpu_clang50_mkldnn() {
+build_ubuntu_cpu_clang60_mkldnn() {
     set -ex
 
-    export CXX=clang++-5.0
-    export CC=clang-5.0
+    export CXX=clang++-6.0
+    export CC=clang-6.0
 
     build_ccache_wrappers
 
diff --git a/docs/api/python/contrib/onnx.md b/docs/api/python/contrib/onnx.md
index d7c34ec1e01..f8210ad6a00 100644
--- a/docs/api/python/contrib/onnx.md
+++ b/docs/api/python/contrib/onnx.md
@@ -22,10 +22,9 @@ This document describes all the ONNX-MXNet APIs.
 .. autosummary::
     :nosignatures:
 
-    mxnet.contrib.onnx.import_model
-    mxnet.contrib.onnx.get_model_metadata
-    mxnet.contrib.onnx.import_to_gluon
-    mxnet.contrib.onnx.export_model
+    mxnet.contrib.onnx.onnx2mx.import_model
+    mxnet.contrib.onnx.onnx2mx.import_to_gluon
+    mxnet.contrib.onnx.mx2onnx.export_model
 ```
 
 ## ONNX Tutorials
@@ -33,8 +32,9 @@ This document describes all the ONNX-MXNet APIs.
 ```eval_rst
 .. toctree::
    :maxdepth: 1
-   
+
    /tutorials/onnx/super_resolution.md
+   /tutorials/onnx/export_mxnet_to_onnx.md
    /tutorials/onnx/inference_on_onnx_model.md
    /tutorials/onnx/fine_tuning_gluon.md
 ```
@@ -42,7 +42,7 @@ This document describes all the ONNX-MXNet APIs.
 ## ONNX Examples
 
 * Face Recognition with [ArcFace](https://github.com/onnx/models/tree/master/models/face_recognition/ArcFace)
-* Image Classification with [MobileNet](https://github.com/onnx/models/tree/master/models/image_classification/mobilenet), [ResNet](https://github.com/onnx/models/tree/master/models/image_classification/resnet), [SqueezeNet](https://github.com/onnx/models/tree/master/models/image_classification/squeezenet), [VGG](https://github.com/onnx/models/tree/master/models/image_classification/vgg) 
+* Image Classification with [MobileNet](https://github.com/onnx/models/tree/master/models/image_classification/mobilenet), [ResNet](https://github.com/onnx/models/tree/master/models/image_classification/resnet), [SqueezeNet](https://github.com/onnx/models/tree/master/models/image_classification/squeezenet), [VGG](https://github.com/onnx/models/tree/master/models/image_classification/vgg)
 
 ## API Reference
 
@@ -50,11 +50,12 @@ This document describes all the ONNX-MXNet APIs.
 
 ```eval_rst
 
-.. automodule:: mxnet.contrib.onnx.import_model
-.. automodule:: mxnet.contrib.onnx.get_model_metadata
-.. automodule:: mxnet.contrib.onnx.import_to_gluon
-.. automodule:: mxnet.contrib.onnx.export_model
-
+.. automodule:: mxnet.contrib.onnx.onnx2mx.import_model
+    :members: import_model, get_model_metadata
+.. automodule:: mxnet.contrib.onnx.onnx2mx.import_to_gluon
+    :members: import_to_gluon
+.. automodule:: mxnet.contrib.onnx.mx2onnx.export_model
+    :members: export_model
 ```
 
 <script>auto_index("api-reference");</script>
diff --git a/docs/api/python/ndarray/sparse.md b/docs/api/python/ndarray/sparse.md
index 85d33b193a6..2ade059a70c 100644
--- a/docs/api/python/ndarray/sparse.md
+++ b/docs/api/python/ndarray/sparse.md
@@ -16,7 +16,7 @@ This document lists the routines of the *n*-dimensional sparse array package:
 ```
 
 The `CSRNDArray` and `RowSparseNDArray` API, defined in the `ndarray.sparse` package, provides
-imperative sparse tensor operations on **CPU**.
+imperative sparse tensor operations.
 
 An `CSRNDArray` inherits from `NDArray`, and represents a two-dimensional, fixed-size array in compressed sparse row format.
 
@@ -63,16 +63,13 @@ A detailed tutorial is available at
 
 ```eval_rst
 
-.. note:: ``mxnet.ndarray.sparse.RowSparseNDArray`` and ``mxnet.ndarray.sparse.CSRNDArray`` DO NOT support the ``mxnet.gluon`` high-level interface yet.
-
 .. note:: ``mxnet.ndarray.sparse`` is similar to ``mxnet.ndarray`` in some aspects. But the differences are not negligible. For instance:
 
-   - Only a subset of operators in ``mxnet.ndarray`` have specialized implementations in ``mxnet.ndarray.sparse``.
-     Operators such as Convolution and broadcasting do not have sparse implementations yet.
+   - Only a subset of operators in ``mxnet.ndarray`` have efficient sparse implementations in ``mxnet.ndarray.sparse``.
+   - If an operator do not occur in the ``mxnet.ndarray.sparse`` namespace, that means the operator does not have an efficient sparse implementation yet. If sparse inputs are passed to such an operator, it will convert inputs to the dense format and fallback to the already available dense implementation.
    - The storage types (``stype``) of sparse operators' outputs depend on the storage types of inputs.
      By default the operators not available in ``mxnet.ndarray.sparse`` infer "default" (dense) storage type for outputs.
      Please refer to the [API Reference](#api-reference) section for further details on specific operators.
-   - GPU support for ``mxnet.ndarray.sparse`` is experimental. Only a few sparse operators are supported on GPU such as ``sparse.dot``.
 
 .. note:: ``mxnet.ndarray.sparse.CSRNDArray`` is similar to ``scipy.sparse.csr_matrix`` in some aspects. But they differ in a few aspects:
 
@@ -559,7 +556,6 @@ We summarize the interface for each class in the following sections.
     sgd_update
     sgd_mom_update
     adam_update
-    ftrl_update
     adagrad_update
 ```
 
diff --git a/docs/api/python/symbol/sparse.md b/docs/api/python/symbol/sparse.md
index d26ba07853d..cd8272cedd7 100644
--- a/docs/api/python/symbol/sparse.md
+++ b/docs/api/python/symbol/sparse.md
@@ -16,7 +16,7 @@ This document lists the routines of the sparse symbolic expression package:
 ```
 
 The `Sparse Symbol` API, defined in the `symbol.sparse` package, provides
-sparse neural network graphs and auto-differentiation on CPU.
+sparse neural network graphs and auto-differentiation.
 
 The storage type of a variable is speficied by the `stype` attribute of the variable.
 The storage type of a symbolic expression is inferred based on the storage types of the variables and the operators.
@@ -43,12 +43,11 @@ array([ 1.,  1.],
 .. note:: most operators provided in ``mxnet.symbol.sparse`` are similar to those in
    ``mxnet.symbol`` although there are few differences:
 
-   - Only a subset of operators in ``mxnet.symbol`` have specialized implementations in ``mxnet.symbol.sparse``.
-     Operators such as reduction and broadcasting do not have sparse implementations yet.
+   - Only a subset of operators in ``mxnet.symbol`` have efficient sparse implementations in ``mxnet.symbol.sparse``.
+   - If an operator do not occur in the ``mxnet.symbol.sparse`` namespace, that means the operator does not have an efficient sparse implementation yet. If sparse inputs are passed to such an operator, it will convert inputs to the dense format and fallback to the already available dense implementation.
    - The storage types (``stype``) of sparse operators' outputs depend on the storage types of inputs.
      By default the operators not available in ``mxnet.symbol.sparse`` infer "default" (dense) storage type for outputs.
      Please refer to the API reference section for further details on specific operators.
-   - GPU support for ``mxnet.symbol.sparse`` is experimental.
 
 ```
 
diff --git a/docs/tutorials/control_flow/ControlFlowTutorial.md b/docs/tutorials/control_flow/ControlFlowTutorial.md
new file mode 100644
index 00000000000..9e4c66f8521
--- /dev/null
+++ b/docs/tutorials/control_flow/ControlFlowTutorial.md
@@ -0,0 +1,388 @@
+# Hybridize Gluon models with control flows.
+
+MXNet currently provides three control flow operators: `cond`, `foreach` and `while_loop`. Like other MXNet operators, they all have a version for NDArray and a version for Symbol. These two versions have exactly the same semantics. We can take advantage of this and use them in Gluon to hybridize models.
+
+In this tutorial, we use a few examples to demonstrate the use of control flow operators in Gluon and show how a model that requires control flow is hybridized.
+
+## Prepare running the code
+
+
+```python
+import mxnet as mx
+from mxnet.gluon import HybridBlock
+```
+
+## foreach
+`foreach` is a for loop that iterates over the first dimension of the input data (it can be an array or a list of arrays). It is defined with the following signature:
+
+```python
+foreach(body, data, init_states, name) => (outputs, states)
+```
+
+It runs the Python function defined in `body` for every slice from the input arrays. The signature of the `body` function is defined as follows:
+
+```python
+body(data, states) => (outputs, states)
+```
+
+The inputs of the `body` function have two parts: `data` is a slice of an array (if there is only one input array in `foreach`) or a list of slices (if there are a list of input arrays); `states` are the arrays from the previous iteration. The outputs of the `body` function also have two parts: `outputs` is an array or a list of arrays; `states` is the computation states of the current iteration. `outputs` from all iterations are concatenated as the outputs of `foreach`.
+
+The following pseudocode illustrates the execution of `foreach`.
+
+```python
+def foreach(body, data, init_states):
+    states = init_states
+    outs = []
+
+    for i in range(data.shape[0]):
+        s = data[i]
+        out, states = body(s, states)
+        outs.append(out)
+    outs = mx.nd.stack(*outs)
+    return outs, states
+```
+
+### Example 1: `foreach` works like map
+`foreach` can work like a map function of a functional language. In this case, the states of `foreach` can be an empty list, which means the computation doesn't carry computation states across iterations.
+
+In this example, we use `foreach` to increase each element's value of an array by one.
+
+
+```python
+data = mx.nd.arange(5)
+print(data)
+```
+
+    
+    [ 0.  1.  2.  3.  4.]
+    <NDArray 5 @cpu(0)>
+
+
+
+```python
+def add1(data, _):
+    return data + 1, []
+
+class Map(HybridBlock):
+    def hybrid_forward(self, F, data):
+        out, _ = F.contrib.foreach(add1, data, [])
+        return out
+    
+map_layer = Map()
+out = map_layer(data)
+print(out)
+```
+
+    
+    [[ 1.]
+     [ 2.]
+     [ 3.]
+     [ 4.]
+     [ 5.]]
+    <NDArray 5x1 @cpu(0)>
+
+
+We can hybridize the block and run the computation again. It should generate the same result.
+
+
+```python
+map_layer.hybridize()
+out = map_layer(data)
+print(out)
+```
+
+    
+    [[ 1.]
+     [ 2.]
+     [ 3.]
+     [ 4.]
+     [ 5.]]
+    <NDArray 5x1 @cpu(0)>
+
+
+### Example 2: `foreach` works like scan
+`foreach` can work like a scan function in a functional language. In this case, the outputs of the Python function is an empty list.
+
+
+```python
+def sum(data, state):
+    return [], state + data
+
+class Scan(HybridBlock):
+    def hybrid_forward(self, F, data):
+        _, state = F.contrib.foreach(sum, data, F.zeros((1)))
+        return state
+scan_layer = Scan()
+state = scan_layer(data)
+print(data)
+print(state)
+```
+
+    
+    [ 0.  1.  2.  3.  4.]
+    <NDArray 5 @cpu(0)>
+    
+    [ 10.]
+    <NDArray 1 @cpu(0)>
+
+
+
+```python
+scan_layer.hybridize()
+state = scan_layer(data)
+print(state)
+```
+
+    
+    [ 10.]
+    <NDArray 1 @cpu(0)>
+
+
+### Example 3: `foreach` with both outputs and states
+This is probably the most common use case of `foreach`. We extend the previous scan example and return both output and states.
+
+
+```python
+def sum(data, state):
+    return state + data, state + data
+
+class ScanV2(HybridBlock):
+    def hybrid_forward(self, F, data):
+        out, state = F.contrib.foreach(sum, data, F.zeros((1)))
+        return out, state
+scan_layer = ScanV2()
+out, state = scan_layer(data)
+print(out)
+print(state)
+```
+
+    
+    [[  0.]
+     [  1.]
+     [  3.]
+     [  6.]
+     [ 10.]]
+    <NDArray 5x1 @cpu(0)>
+    
+    [ 10.]
+    <NDArray 1 @cpu(0)>
+
+
+
+```python
+scan_layer.hybridize()
+out, state = scan_layer(data)
+print(out)
+print(state)
+```
+
+    
+    [[  0.]
+     [  1.]
+     [  3.]
+     [  6.]
+     [ 10.]]
+    <NDArray 5x1 @cpu(0)>
+    
+    [ 10.]
+    <NDArray 1 @cpu(0)>
+
+
+### Example 4: use `foreach` to run an RNN on a variable-length sequence
+Previous examples illustrate `foreach` with simple use cases. Here we show an example of processing variable-length sequences with `foreach`. The same idea is used by `dynamic_rnn` in TensorFlow for processing variable-length sequences.
+
+
+```python
+class DynamicRNNLayer(HybridBlock):
+    def __init__(self, cell, prefix=None, params=None):
+        super(DynamicRNNLayer, self).__init__(prefix=prefix, params=params)
+        self.cell = cell
+    def hybrid_forward(self, F, inputs, begin_state, valid_length):
+        states = begin_state
+        zeros = []
+        for s in states:
+            zeros.append(F.zeros_like(s))
+        # the last state is the iteration number.
+        states.append(F.zeros((1)))
+        def loop_body(inputs, states):
+            cell_states = states[:-1]
+            # Get the iteration number from the states.
+            iter_no = states[-1]
+            out, new_states = self.cell(inputs, cell_states)
+            # Copy the old state if we have reached the end of a sequence.
+            for i, state in enumerate(cell_states):
+                new_states[i] = F.where(F.broadcast_greater(valid_length, iter_no),
+                                        new_states[i], state)
+            new_states.append(iter_no + 1)
+            return out, new_states
+
+        outputs, states = F.contrib.foreach(loop_body, inputs, states)
+        outputs = F.SequenceMask(outputs, sequence_length=valid_length,
+                                 use_sequence_length=True, axis=0)
+        # the last state is the iteration number. We don't need it.
+        return outputs, states[:-1]
+
+
+seq_len = 10
+batch_size = 2
+input_size = 5
+hidden_size = 6
+
+rnn_data = mx.nd.normal(loc=0, scale=1, shape=(seq_len, batch_size, input_size))
+init_states = [mx.nd.normal(loc=0, scale=1, shape=(batch_size, hidden_size)) for i in range(2)]
+valid_length = mx.nd.round(mx.nd.random.uniform(low=1, high=10, shape=(batch_size))) 
+
+lstm = DynamicRNNLayer(mx.gluon.rnn.LSTMCell(hidden_size))
+lstm.initialize()
+res, states = lstm(rnn_data, [x for x in init_states], valid_length)
+
+lstm.hybridize()
+res, states = lstm(rnn_data, [x for x in init_states], valid_length)
+```
+
+## while_loop
+`while_loop` defines a while loop. It has the following signature:
+
+```python
+while_loop(cond, body, loop_vars, max_iterations, name) => (outputs, states)
+```
+
+Instead of running over the first dimension of an array, `while_loop` checks a condition function in every iteration and runs a `body` function for computation. The signature of the `body` function is defined as follows:
+
+```python
+body(state1, state2, ...) => (outputs, states)
+```
+
+The inputs of the `body` function in `while_loop` are a little different from the one in `foreach`. It has a variable number of input arguments. Each input argument is a loop variable and the number of arguments is determined by the number of loop variables. The outputs of the `body` function also have two parts: `outputs` is an array or a list of arrays; `states` are loop variables and will be passed to the next iteration as inputs of `body`. Like `foreach`, both `outputs` and `states` can be an empty list. `outputs` from all iterations are concatenated as the outputs of `while_loop`.
+
+### Example 5: scan with while_loop
+`while_loop` is more general than `foreach`. We can also use it to iterate over an array and sum all of its values together. In this example, instead of summing over the entire array, we only sum over the first 4 elements.
+
+**Note**: the output arrays of the current implementation of `while_loop` is determined by `max_iterations`. As such, even though the while loop in this example runs 4 iterations, it still outputs an array of 5 elements. The last element in the output array is actually filled with an arbitrary value.
+
+
+```python
+class ScanV2(HybridBlock):
+    def hybrid_forward(self, F, data):
+        def sum(state, i):
+            s = state + data[i]
+            return s, [s, i + 1]
+
+        def sum_cond(state, i):
+            return i < 4
+
+        out, state = F.contrib.while_loop(sum_cond, sum,
+                                          [F.zeros((1)), F.zeros((1))], max_iterations=5)
+        return out, state
+scan_layer = ScanV2()
+out, state = scan_layer(data)
+print(out)
+print(state)
+```
+
+    
+    [[ 0.]
+     [ 1.]
+     [ 3.]
+     [ 6.]
+     [ 0.]]
+    <NDArray 5x1 @cpu(0)>
+    [
+    [ 6.]
+    <NDArray 1 @cpu(0)>, 
+    [ 4.]
+    <NDArray 1 @cpu(0)>]
+
+
+## cond
+`cond` defines an if condition. It has the following signature:
+
+```python
+cond(pred, then_func, else_func, name)
+```
+
+`cond` checks `pred`, which is a symbol or an NDArray with one element. If its value is true, it calls `then_func`. Otherwise, it calls `else_func`. The signature of `then_func` and `else_func` are as follows:
+
+```python
+func() => [outputs]
+```
+
+`cond` requires all outputs from `then_func` and `else_func` have the same number of Symbols/NDArrays with the same shapes and data types.
+
+### Example 6: skip RNN computation with cond
+Example 4 shows how to process a batch with sequences of different lengths. It performs computation for all steps but discards some of the computation results.
+
+In this example, we show how to skip computation after we have reached the end of a sequence, whose length is indicated by `length`. The code below only works for a batch with one sequence.
+
+
+```python
+class SkipRNNCell(HybridBlock):
+    def __init__(self, cell, prefix=None, params=None):
+        super(SkipRNNCell, self).__init__(prefix=prefix, params=params)
+        self.cell = cell
+    def hybrid_forward(self, F, i, length, data, states):
+        def run_rnn():
+            return self.cell(data, states)
+
+        def copy_states():
+            return F.zeros_like(data), states
+        out, state = F.contrib.cond(i < length, run_rnn, copy_states)
+        return out, state
+
+class RNNLayer(HybridBlock):
+    def __init__(self, cell, prefix=None, params=None):
+        super(RNNLayer, self).__init__(prefix=prefix, params=params)
+        self.cell = SkipRNNCell(cell)
+    def hybrid_forward(self, F, length, data, init_states):
+        def body(data, states):
+            i = states[0]
+            out, states = self.cell(i, length, data, states[1])
+            return out, [i + 1, states]
+        print()
+        out, state = F.contrib.foreach(body, data, [F.zeros((1)), init_states])
+        return out, state
+
+
+seq_len = 5
+batch_size = 1
+input_size = 3
+hidden_size = 3
+
+rnn_data = mx.nd.normal(loc=0, scale=1, shape=(seq_len, batch_size, input_size))
+init_states = [mx.nd.normal(loc=0, scale=1, shape=(batch_size, hidden_size)) for i in range(2)]
+
+cell = mx.gluon.rnn.LSTMCell(hidden_size)
+layer = RNNLayer(cell)
+layer.initialize()
+
+out, states = layer(mx.nd.array([3]), rnn_data, init_states)
+print(rnn_data)
+print(out)
+```
+
+    ()
+    
+    [[[-1.25296438  0.387312   -0.41055229]]
+    
+     [[ 1.28453672  0.21001032 -0.08666432]]
+    
+     [[ 1.46422136 -1.30581355  0.9344402 ]]
+    
+     [[ 0.5380863  -0.16038011  0.84187603]]
+    
+     [[-1.00553632  3.13221502 -0.4358989 ]]]
+    <NDArray 5x1x3 @cpu(0)>
+    
+    [[[-0.02620504  0.1605694   0.29636264]]
+    
+     [[-0.00474182  0.08719197  0.17757624]]
+    
+     [[ 0.00631597  0.04674901  0.12468992]]
+    
+     [[ 0.          0.          0.        ]]
+    
+     [[ 0.          0.          0.        ]]]
+    <NDArray 5x1x3 @cpu(0)>
+
+
+<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
index ae0851425be..1b32333bded 100644
--- a/docs/tutorials/index.md
+++ b/docs/tutorials/index.md
@@ -96,6 +96,7 @@ Select API:&nbsp;
     * [Fine-Tuning a pre-trained ImageNet model with a new dataset](/faq/finetune.html)
     * [Large-Scale Multi-Host Multi-GPU Image Classification](/tutorials/vision/large_scale_classification.html)
     * [Importing an ONNX model into MXNet](/tutorials/onnx/super_resolution.html)
+    * [Hybridize Gluon models with control flows](/tutorials/control_flow/ControlFlowTutorial.html)
 * API Guides
     * Core APIs
         * NDArray
diff --git a/docs/tutorials/onnx/export_mxnet_to_onnx.md b/docs/tutorials/onnx/export_mxnet_to_onnx.md
new file mode 100644
index 00000000000..a9c03bed8b1
--- /dev/null
+++ b/docs/tutorials/onnx/export_mxnet_to_onnx.md
@@ -0,0 +1,134 @@
+
+# Exporting MXNet model to ONNX format
+
+[Open Neural Network Exchange (ONNX)](https://github.com/onnx/onnx) provides an open source format for AI models. It defines an extensible computation graph model, as well as definitions of built-in operators and standard data types.
+
+In this tutorial, we will show how you can save MXNet models to the ONNX format.
+
+MXNet-ONNX operators coverage and features are updated regularly. Visit the [ONNX operator coverage](https://cwiki.apache.org/confluence/display/MXNET/ONNX+Operator+Coverage) page for the latest information.
+
+In this tutorial, we will learn how to use MXNet to ONNX exporter on pre-trained models.
+
+## Prerequisites
+
+To run the tutorial you will need to have installed the following python modules:
+- [MXNet >= 1.3.0](http://mxnet.incubator.apache.org/install/index.html)
+- [onnx]( https://github.com/onnx/onnx#installation) v1.2.1 (follow the install guide)
+
+*Note:* MXNet-ONNX importer and exporter follows version 7 of ONNX operator set which comes with ONNX v1.2.1.
+
+
+```python
+import mxnet as mx
+import numpy as np
+from mxnet.contrib import onnx as onnx_mxnet
+import logging
+logging.basicConfig(level=logging.INFO)
+```
+
+## Downloading a model from the MXNet model zoo
+
+We download the pre-trained ResNet-18 [ImageNet](http://www.image-net.org/) model from the [MXNet Model Zoo](http://data.mxnet.io/models/imagenet/).
+We will also download synset file to match labels.
+
+```python
+# Download pre-trained resnet model - json and params by running following code.
+path='http://data.mxnet.io/models/imagenet/'
+[mx.test_utils.download(path+'resnet/18-layers/resnet-18-0000.params'),
+ mx.test_utils.download(path+'resnet/18-layers/resnet-18-symbol.json'),
+ mx.test_utils.download(path+'synset.txt')]
+```
+
+Now, we have downloaded ResNet-18 symbol, params and synset file on the disk.
+
+## MXNet to ONNX exporter API
+
+Let us describe the MXNet's `export_model` API. 
+
+```python
+help(onnx_mxnet.export_model)
+```
+
+```python
+Help on function export_model in module mxnet.contrib.onnx.mx2onnx.export_model:
+
+export_model(sym, params, input_shape, input_type=<type 'numpy.float32'>, onnx_file_path=u'model.onnx', verbose=False)
+    Exports the MXNet model file, passed as a parameter, into ONNX model.
+    Accepts both symbol,parameter objects as well as json and params filepaths as input.
+    Operator support and coverage - https://cwiki.apache.org/confluence/display/MXNET/ONNX
+    
+    Parameters
+    ----------
+    sym : str or symbol object
+        Path to the json file or Symbol object
+    params : str or symbol object
+        Path to the params file or params dictionary. (Including both arg_params and aux_params)
+    input_shape : List of tuple
+        Input shape of the model e.g [(1,3,224,224)]
+    input_type : data type
+        Input data type e.g. np.float32
+    onnx_file_path : str
+        Path where to save the generated onnx file
+    verbose : Boolean
+        If true will print logs of the model conversion
+    
+    Returns
+    -------
+    onnx_file_path : str
+        Onnx file path
+```
+
+`export_model` API can accept the MXNet model in one of the following two ways.
+
+1. MXNet sym, params objects:
+    * This is useful if we are training a model. At the end of training, we just need to invoke the `export_model` function and provide sym and params objects as inputs with other attributes to save the model in ONNX format.
+2. MXNet's exported json and params files:
+    * This is useful if we have pre-trained models and we want to convert them to ONNX format.
+
+Since we have downloaded pre-trained model files, we will use the `export_model` API by passing the path for symbol and params files.
+
+## How to use MXNet to ONNX exporter API
+
+We will use the downloaded pre-trained model files (sym, params) and define input variables.
+
+```python
+# Downloaded input symbol and params files
+sym = './resnet-18-symbol.json'
+params = './resnet-18-0000.params'
+
+# Standard Imagenet input - 3 channels, 224*224
+input_shape = (1,3,224,224)
+
+# Path of the output file
+onnx_file = './mxnet_exported_resnet50.onnx'
+```
+
+We have defined the input parameters required for the `export_model` API. Now, we are ready to covert the MXNet model into ONNX format.
+
+```python
+# Invoke export model API. It returns path of the converted onnx model
+converted_model_path = onnx_mxnet.export_model(sym, params, [input_shape], np.float32, onnx_file)
+```
+
+This API returns path of the converted model which you can later use to import the model into other frameworks.
+
+## Check validity of ONNX model
+
+Now we can check validity of the converted ONNX model by using ONNX checker tool. The tool will validate the model by checking if the content contains valid protobuf:
+
+```python
+from onnx import checker
+import onnx
+
+# Load onnx model
+model_proto = onnx.load(converted_model_path)
+
+# Check if converted ONNX protobuf is valid
+checker.check_graph(model_proto.graph)
+```
+
+If the converted protobuf format doesn't qualify to ONNX proto specifications, the checker will throw errors, but in this case it successfully passes. 
+
+This method confirms exported model protobuf is valid. Now, the model is ready to be imported in other frameworks for inference!
+    
+<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
diff --git a/docs/tutorials/sparse/csr.md b/docs/tutorials/sparse/csr.md
index c2842ac16bd..0aede1ab431 100644
--- a/docs/tutorials/sparse/csr.md
+++ b/docs/tutorials/sparse/csr.md
@@ -512,9 +512,7 @@ Note that in the file the column indices are expected to be sorted in ascending
 
 ### GPU Support
 
-By default, `CSRNDArray` operators are executed on CPU. In MXNet, GPU support for `CSRNDArray` is experimental with only a few sparse operators such as [dot](https://mxnet.incubator.apache.org/api/python/ndarray/sparse.html#mxnet.ndarray.sparse.dot).
-
-To create a `CSRNDArray` on a GPU, we need to explicitly specify the context:
+By default, `CSRNDArray` operators are executed on CPU. To create a `CSRNDArray` on a GPU, we need to explicitly specify the context:
 
 **Note** If a GPU is not available, an error will be reported in the following section. In order to execute it a cpu, set `gpu_device` to `mx.cpu()`.
 
diff --git a/docs/tutorials/sparse/row_sparse.md b/docs/tutorials/sparse/row_sparse.md
index c4cab75df54..27cc0d3d903 100644
--- a/docs/tutorials/sparse/row_sparse.md
+++ b/docs/tutorials/sparse/row_sparse.md
@@ -541,12 +541,7 @@ Note that only [mxnet.optimizer.SGD](https://mxnet.incubator.apache.org/api/pyth
 
 ### GPU Support
 
-By default, RowSparseNDArray operators are executed on CPU. In MXNet, GPU support for RowSparseNDArray is limited
-to a few sparse operators such as [sgd_update](https://mxnet.incubator.apache.org/api/python/ndarray/sparse.html#mxnet.ndarray.sparse.sgd_update),
-[dot](https://mxnet.incubator.apache.org/api/python/ndarray/sparse.html#mxnet.ndarray.sparse.dot) and
-[Embedding](https://mxnet.incubator.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.Embedding).
-
-To create a RowSparseNDArray on gpu, we need to explicitly specify the context:
+By default, RowSparseNDArray operators are executed on CPU. To create a RowSparseNDArray on gpu, we need to explicitly specify the context:
 
 **Note** If a GPU is not available, an error will be reported in the following section. In order to execute it on a cpu, set gpu_device to mx.cpu().
 
diff --git a/docs/tutorials/sparse/train.md b/docs/tutorials/sparse/train.md
index 7472fcd14ca..fde4c0e6552 100644
--- a/docs/tutorials/sparse/train.md
+++ b/docs/tutorials/sparse/train.md
@@ -314,7 +314,7 @@ assert metric.get()[1] < 1, "Achieved MSE (%f) is larger than expected (1.0)" %
 
 ### Training the model with multiple machines or multiple devices
 
-To train a sparse model with multiple machines, you need to call `prepare` before `forward`, or `save_checkpoint`.
+Distributed training with `row_sparse` weights and gradients are supported in MXNet, which significantly reduces communication cost for large models. To train a sparse model with multiple machines, you need to call `prepare` before `forward`, or `save_checkpoint`.
 Please refer to the example in [mxnet/example/sparse/linear_classification](https://github.com/apache/incubator-mxnet/tree/master/example/sparse/linear_classification)
 for more details.
 
diff --git a/python/mxnet/contrib/onnx/mx2onnx/export_model.py b/python/mxnet/contrib/onnx/mx2onnx/export_model.py
index 0dbfdc1d7b9..33292bf664a 100644
--- a/python/mxnet/contrib/onnx/mx2onnx/export_model.py
+++ b/python/mxnet/contrib/onnx/mx2onnx/export_model.py
@@ -18,7 +18,7 @@
 # coding: utf-8
 #pylint: disable-msg=too-many-arguments
 
-"""export function"""
+"""Exports an MXNet model to the ONNX model format"""
 from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
diff --git a/python/mxnet/contrib/onnx/onnx2mx/import_model.py b/python/mxnet/contrib/onnx/onnx2mx/import_model.py
index 4e4d7863755..e190c3bdadc 100644
--- a/python/mxnet/contrib/onnx/onnx2mx/import_model.py
+++ b/python/mxnet/contrib/onnx/onnx2mx/import_model.py
@@ -16,7 +16,7 @@
 # under the License.
 
 # coding: utf-8
-"""import function"""
+"""Functions for importing ONNX models to MXNet and for checking metadata"""
 # pylint: disable=no-member
 
 from .import_onnx import GraphProto
@@ -72,6 +72,7 @@ def get_model_metadata(model_file):
             'output_tensor_data' : <list of tuples representing the shape of the output
                                     of the model>
         }
+
     """
     graph = GraphProto()
     try:
diff --git a/scala-package/core/src/main/scala/org/apache/mxnet/NDArrayCollector.scala b/scala-package/core/src/main/scala/org/apache/mxnet/NDArrayCollector.scala
index 3952b73cfb0..0b7f9af705f 100644
--- a/scala-package/core/src/main/scala/org/apache/mxnet/NDArrayCollector.scala
+++ b/scala-package/core/src/main/scala/org/apache/mxnet/NDArrayCollector.scala
@@ -133,6 +133,10 @@ class NDArrayCollector private(private val autoDispose: Boolean = true,
    * If the return type of scope is <em>NDArray</em> or <em>NDArrayFuncReturn</em>,
    * it is smart enough NOT to collect or dispose the returned NDArray. <br />
    * However in other cases, it is users' responsibility NOT to leak allocated NDArrays outside.
+   * <br />
+   * We might switch to try -with-resources statement (by AutoCloseable in Java 1.7+)
+   * and deprecate this method later, thus it is marked as Experimental.
+   *
    * @param codeBlock code block to be executed within the scope.
    * @tparam T return type of the function <em>codeBlock</em>.
    * @return The result of function <em>codeBlock</em>.
diff --git a/scala-package/core/src/main/scala/org/apache/mxnet/annotation/Experimental.scala b/scala-package/core/src/main/scala/org/apache/mxnet/annotation/Experimental.scala
index 147d651fb04..d63194d48bc 100644
--- a/scala-package/core/src/main/scala/org/apache/mxnet/annotation/Experimental.scala
+++ b/scala-package/core/src/main/scala/org/apache/mxnet/annotation/Experimental.scala
@@ -21,7 +21,7 @@ import java.lang.annotation.{ElementType, Retention, Target, _}
 
 /**
   * Experimental: there is a comparably high chance that
-  * the API will undergo some kind of changes
+  * the API will be changed or removed.
   */
 @Retention(RetentionPolicy.RUNTIME)
 @Target(Array(ElementType.TYPE, ElementType.FIELD, ElementType.METHOD, ElementType.PARAMETER,
diff --git a/src/operator/tensor/control_flow_op.h b/src/operator/tensor/control_flow_op.h
index 94e65109c35..e9aa9f63fae 100644
--- a/src/operator/tensor/control_flow_op.h
+++ b/src/operator/tensor/control_flow_op.h
@@ -189,6 +189,7 @@ inline bool WhereOpShape(const nnvm::NodeAttrs& attrs,
     return true;
   } else if ((*in_attrs)[0].ndim() == 1) {
     CHECK_EQ((*in_attrs)[0].Size(), static_cast<size_t>(tshape[0]));
+    return true;
   }
   return false;
 }
diff --git a/tests/nightly/straight_dope/test_notebooks_single_gpu.py b/tests/nightly/straight_dope/test_notebooks_single_gpu.py
index a60498c8786..5eeb52f516e 100644
--- a/tests/nightly/straight_dope/test_notebooks_single_gpu.py
+++ b/tests/nightly/straight_dope/test_notebooks_single_gpu.py
@@ -35,11 +35,13 @@
     'chapter02_supervised-learning/environment',
     'chapter03_deep-neural-networks/kaggle-gluon-kfold',
     'chapter04_convolutional-neural-networks/deep-cnns-alexnet',  # > 10 mins.
+    'chapter05_recurrent-neural-networks/rnns-gluon', # > 10 mins.
     'chapter06_optimization/gd-sgd-scratch',  # Overflow warning is intended.
     'chapter06_optimization/gd-sgd-gluon',  # Overflow warning is intended.
     'chapter07_distributed-learning/multiple-gpus-scratch',
     'chapter07_distributed-learning/multiple-gpus-gluon',
     'chapter07_distributed-learning/training-with-multiple-machines',
+    'chapter08_computer-vision/visual-question-answer', # > 10 mins.
     'chapter11_recommender-systems/intro-recommender-systems',  # Early draft, non-working.
     'chapter12_time-series/intro-forecasting-gluon',
     'chapter12_time-series/intro-forecasting-2-gluon',
@@ -176,9 +178,6 @@ def test_lstm_scratch(self):
     def test_gru_scratch(self):
         assert _test_notebook('chapter05_recurrent-neural-networks/gru-scratch')
 
-    def test_rnns_gluon(self):
-        assert _test_notebook('chapter05_recurrent-neural-networks/rnns-gluon')
-
     # Chapter 6
 
     def test_optimization_intro(self):
@@ -228,9 +227,6 @@ def test_object_detection(self):
     def test_fine_tuning(self):
         assert _test_notebook('chapter08_computer-vision/fine-tuning')
 
-    def test_visual_question_answer(self):
-        assert _test_notebook('chapter08_computer-vision/visual-question-answer')
-
     # Chapter 9
 
     def test_tree_lstm(self):
diff --git a/tests/python/unittest/test_operator.py b/tests/python/unittest/test_operator.py
index e1e5c9e61c2..5e5e956691f 100644
--- a/tests/python/unittest/test_operator.py
+++ b/tests/python/unittest/test_operator.py
@@ -4507,6 +4507,14 @@ def test_invalid_shape():
                                              y=mx.nd.array([[8,9],[10,11],[12,13]]),
                                              condition=mx.nd.array([1,0])), MXNetError)
 
+    def test_1d_cond():
+        cond = mx.nd.array([1, 0, 1])
+        x = mx.nd.array([[2, 3], [4, 5], [6, 7]])
+        y = mx.nd.array([[7, 8], [9, 10], [10, 11]])
+        expect_out = np.array([[2, 3], [9, 10], [6, 7]])
+        out = mx.nd.where(cond, x, y).asnumpy()
+        assert(expect_out.all() == out.all())
+
     test_where_helper((5, 9), True)
     test_where_helper((5, 9), False)
     test_where_helper((5, 7, 9), True)
@@ -4518,6 +4526,7 @@ def test_invalid_shape():
     test_where_numeric_gradient((5, 7, 9), True)
     test_where_numeric_gradient((5, 7, 9), False)
     test_invalid_shape()
+    test_1d_cond()
 
 @with_seed()
 def test_new_softmax():
diff --git a/tests/tutorials/test_tutorials.py b/tests/tutorials/test_tutorials.py
index 22d00c181b6..503df017ffe 100644
--- a/tests/tutorials/test_tutorials.py
+++ b/tests/tutorials/test_tutorials.py
@@ -124,6 +124,9 @@ def test_nlp_cnn():
 def test_onnx_super_resolution():
     assert _test_tutorial_nb('onnx/super_resolution')
 
+def test_onnx_export_mxnet_to_onnx():
+    assert _test_tutorial_nb('onnx/export_mxnet_to_onnx')
+
 def test_onnx_fine_tuning_gluon():
     assert _test_tutorial_nb('onnx/fine_tuning_gluon')
 
@@ -180,3 +183,6 @@ def test_vision_large_scale_classification():
 
 def test_vision_cnn_visualization():
     assert _test_tutorial_nb('vision/cnn_visualization')
+
+def test_control_flow():
+    assert _test_tutorial_nb('control_flow/ControlFlowTutorial')


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services