You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/07/02 22:50:20 UTC

[GitHub] [incubator-mxnet] thomelane commented on a change in pull request #15396: [TUTORIAL] Gluon and Sparse NDArray

thomelane commented on a change in pull request #15396: [TUTORIAL] Gluon and Sparse NDArray
URL: https://github.com/apache/incubator-mxnet/pull/15396#discussion_r299715527
 
 

 ##########
 File path: docs/tutorials/sparse/train_gluon.md
 ##########
 @@ -0,0 +1,469 @@
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
+
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
+
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
+
+
+# Sparse NDArrays with Gluon
+
+When working on machine learning problems, you may encounter situations where the input data is sparse (i.e. the majority of values are zero). One example of this is in recommendation systems. You could have millions of user and product features, but only a few of these features are present for each sample. Without special treatment, the sheer magnitude of the feature space can lead to out-of-memory situations and cause significant slowdowns when training and making predictions.
+
+MXNet supports a number of sparse storage types (often called 'stype' for short) for these situations. In this tutorial, we'll start by generating some sparse data, write it to disk in the LibSVM format and then read back using the [`LibSVMIter`](https://mxnet.incubator.apache.org/api/python/io/io.html) for training. We use the Gluon API to train the model and leverage sparse storage types such as [`CSRNDArray`](https://mxnet.incubator.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) and [`RowSparseNDArray`](https://mxnet.incubator.apache.org/api/python/ndarray/sparse.html?highlight=rowsparsendarray#mxnet.ndarray.sparse.RowSparseNDArray) to maximise performance and memory efficiency.
+
+
+```python
+import mxnet as mx
+import numpy as np
+import time
+```
+
+### Generating Sparse Data
+
+You will most likely have a sparse dataset in mind already if you're reading this tutorial, but let's create a dummy dataset to use in the examples that follow. Using `rand_ndarray` we will generate 1000 samples, each with 1,000,000 features of which 99.999% of values will be zero (i.e. 10 non-zero features for each sample). We take this as our input data for training and calculate a label based on an arbitrary rule: whether the feature sum is higher than average.
+
+
+```python
+num_samples = 1000
+num_features = 1000000
+data = mx.test_utils.rand_ndarray((num_samples, num_features), stype='csr', density=0.00001)
+# generate label: 1 if row sum above average, 0 otherwise.
+label = data.sum(axis=1) > data.sum(axis=1).mean()
+```
+
+
+```python
+print(type(data))
+print(data[:10].asnumpy())
+print('{:,.0f} elements'.format(np.product(data.shape)))
+print('{:,.0f} non-zero elements'.format(data.data.size))
+```
+
+    <class 'mxnet.ndarray.sparse.CSRNDArray'>
+    [[0. 0. 0. ... 0. 0. 0.]
+     [0. 0. 0. ... 0. 0. 0.]
+     [0. 0. 0. ... 0. 0. 0.]
+     ...
+     [0. 0. 0. ... 0. 0. 0.]
+     [0. 0. 0. ... 0. 0. 0.]
+     [0. 0. 0. ... 0. 0. 0.]]
+    1,000,000,000 elements
+    10,000 non-zero elements
+
+
+Our storage type is CSR (Compressed Sparse Row) which is the ideal type for sparse data along multiple axes. See [this in-depth tutorial](https://mxnet.incubator.apache.org/versions/master/tutorials/sparse/csr.html) for more information. Just to confirm the generation process ran correctly, we can see that the vast majority of values are indeed zero. One of the first questions to ask would be how much memory is saved by storing this data in a [`CSRNDArray`](https://mxnet.incubator.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) versus a standard [`NDArray`](https://mxnet.incubator.apache.org/versions/master/api/python/ndarray/sparse.html?highlight=ndarray#module-mxnet.ndarray). Since sparse arrays are constructed from many components (e.g. `data`, `indices` and `indptr`) we define a function called `get_nbytes` to calculate the number of bytes taken in memory to store an array. We compare the same data stored in a standard [`NDArray`](https://mxnet.incubator.apache.org/versions/master/api/python/ndarray/sparse.html?highlight=ndarray#module-mxnet.ndarray) (with `data.tostype('default')`) to the [`CSRNDArray`](https://mxnet.incubator.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray).
+
+
+```python
+def get_nbytes(array):
+    fn = lambda a: a.size * np.dtype(a).itemsize
+    if isinstance(array, mx.ndarray.sparse.CSRNDArray):
+        return fn(array.data) + fn(array.indices) + fn(array.indptr)
+    elif isinstance(array, mx.ndarray.sparse.RowSparseNDArray):
+        return fn(array.data) + fn(array.indices)
+    elif isinstance(array, mx.ndarray.NDArray):
+        return fn(array)
+    else:
+        TypeError('{} not supported'.format(type(array)))
+```
+
+
+```python
+print('NDarray:', get_nbytes(data.tostype('default'))/1000000, 'MBs')
+print('CSRNDArray', get_nbytes(data)/1000000, 'MBs')
+```
+
+    NDarray: 4000.0 MBs
+    CSRNDArray 0.128008 MBs
+
+
+Given the extremely high sparsity of the data, we observe a huge memory saving here! 0.13 MBs versus 4 GBs: ~30,000 times smaller. You can experiment with the amount of sparsity and see how these two storage types compare. When the number of non-zero values increases, this difference will reduce. And when the number of non-zero values exceeds ~1/3 you will find that this sparse storage type take more memory than dense! So use wisely.
+
+### Writing Sparse Data
+
+Since there is such a large size difference between dense and sparse storage formats here, we ideally want to store the data on disk in a sparse storage format too. MXNet supports a format called LibSVM and has a data iterator called [`LibSVMIter`](https://mxnet.incubator.apache.org/api/python/io/io.html?highlight=libsvmiter) specifically for data formatted this way.
+
+A LibSVM file has a row for each sample, and each row starts with the label: in this case `0.0` or `1.0` since we have a classification task. After this we have a variable number of `key:value` pairs separated by spaces, where the key is column/feature index and the value is the value of that feature. When working with your own sparse data in a custom format you should try to convert your data into this format. We define a `save_as_libsvm` function to save the `data` ([`CSRNDArray`](https://mxnet.incubator.apache.org/versions/master/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray)) and `label` (`NDArray`) to disk in LibSVM format.
+
+
+```python
+def save_as_libsvm(filepath, data, label):
+    with open(filepath, 'w') as openfile:
+        for row_idx in range(data.shape[0]):
+            data_sample = data[row_idx]
+            label_sample = label[row_idx]
+            col_idxs = data_sample.indices.asnumpy().tolist()
+            values = data_sample.data.asnumpy().tolist()
+            label_str = str(label_sample.asscalar())
+            value_strs = ['{}:{}'.format(idx, value) for idx, value in zip(col_idxs, values)]
+            value_str = " ".join(value_strs)
+            sample_str = '{} {}\n'.format(label_str, value_str)
+            openfile.write(sample_str)
+```
+
+
+```python
+filepath = 'dataset.libsvm'
+save_as_libsvm(filepath, data, label)
+```
+
+We have now written the `data` and `label` to disk, and can inspect the first 10 lines of the file:
+
+
+```python
+with open(filepath, 'r') as openfile:
+    lines = [openfile.readline() for _ in range(10)]
+for line in lines:
+    print(line[:80] + '...' if len(line) > 80 else line)
+```
+
+    0.0 35454:0.22486156225204468 80954:0.39130592346191406 81941:0.1988530308008194...
+    1.0 37029:0.5980494618415833 52916:0.15797750651836395 71623:0.32251599431037903...
+    1.0 89962:0.47770974040031433 216426:0.21326342225074768 271027:0.18589609861373...
+    1.0 7071:0.9432336688041687 81664:0.7788773775100708 117459:0.8166475296020508 4...
+    0.0 380966:0.16906292736530304 394363:0.7987179756164551 458442:0.56873309612274...
+    0.0 89361:0.9099966287612915 141813:0.5927085280418396 282489:0.293381005525589 ...
+    0.0 150427:0.4747847020626068 169376:0.2603490948677063 179377:0.237988427281379...
+    0.0 49774:0.2822582423686981 91245:0.5794865489006042 102970:0.7004560232162476 ...
+    1.0 97133:0.0024336236529052258 109855:0.9895315766334534 116765:0.2465638816356...
+    0.0 803440:0.4020800292491913
+    
+
+
+Some storage overhead is introduced by serializing the data as characters (with spaces and colons). `dataset.libsvm` is 250 KBs but the original `data` and `label` were 132 KBs combined. Compared with the 4GB dense `NDArray` though, this isn't a huge issue.
+
+### Reading Sparse Data
+
+Using [`LibSVMIter`](https://mxnet.incubator.apache.org/api/python/io/io.html?highlight=libsvmiter), we can quickly and easily load data into batches ready for training. Although Gluon [`Dataset`](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/data.html?highlight=dataset#mxnet.gluon.data.Dataset)s can be written to return sparse arrays, Gluon [`DataLoader`](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/data.html?highlight=dataloader#mxnet.gluon.data.DataLoader)s currently convert each sample to dense before stacking up to create the batch. As a result, [`LibSVMIter`](https://mxnet.incubator.apache.org/api/python/io/io.html?highlight=libsvmiter) is the recommended method of loading sparse data in batches.
 
 Review comment:
   Works on main website, it doesn't work on beta, but this should be fixed rather than worked around.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services