You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by zh...@apache.org on 2018/01/18 04:18:27 UTC

[incubator-mxnet] branch master updated: Glossary takes token_indexer and token_embedding in its constructor (#9471)

This is an automated email from the ASF dual-hosted git repository.

zhasheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-mxnet.git


The following commit(s) were added to refs/heads/master by this push:
     new 49846f0  Glossary takes token_indexer and token_embedding in its constructor (#9471)
49846f0 is described below

commit 49846f02dae06878cf7145e7b6dbdce7aa2a8219
Author: Aston Zhang <as...@amazon.com>
AuthorDate: Wed Jan 17 20:18:24 2018 -0800

    Glossary takes token_indexer and token_embedding in its constructor (#9471)
    
    * Glossary takes token_indexer and token_embedding in its constructor
    
    * Fix dead links
    
    * update docstr
---
 docs/api/python/index.md                   |   9 +
 docs/api/python/text/text.md               | 455 +++++++++++++++++++++++++++++
 python/mxnet/contrib/text/embedding.py     |  25 +-
 python/mxnet/contrib/text/glossary.py      |  58 ++--
 python/mxnet/contrib/text/indexer.py       |   4 +-
 tests/python/unittest/test_contrib_text.py |  30 +-
 6 files changed, 521 insertions(+), 60 deletions(-)

diff --git a/docs/api/python/index.md b/docs/api/python/index.md
index 75ff186..7a3ad7c 100644
--- a/docs/api/python/index.md
+++ b/docs/api/python/index.md
@@ -98,6 +98,15 @@ imported by running:
    io/io.md
 ```
 
+## Text API
+
+```eval_rst
+.. toctree::
+   :maxdepth: 1
+
+   text/text.md
+```
+
 ## Image API
 
 ```eval_rst
diff --git a/docs/api/python/text/text.md b/docs/api/python/text/text.md
new file mode 100644
index 0000000..3b70b76
--- /dev/null
+++ b/docs/api/python/text/text.md
@@ -0,0 +1,455 @@
+# Text API
+
+## Overview
+
+The mxnet.contrib.text APIs refer to classes and functions related to text data
+processing, such as bulding indices and loading pre-trained embedding vectors
+for text tokens and storing them in the `mxnet.ndarray.NDArray` format.
+
+```eval_rst
+.. warning:: This package contains experimental APIs and may change in the near future.
+```
+
+This document lists the text APIs in mxnet:
+
+```eval_rst
+.. autosummary::
+    :nosignatures:
+
+    mxnet.contrib.text.glossary
+    mxnet.contrib.text.embedding
+    mxnet.contrib.text.indexer
+    mxnet.contrib.text.utils
+```
+
+All the code demonstrated in this document assumes that the following modules
+or packages are imported.
+
+```python
+>>> from mxnet import gluon
+>>> from mxnet import nd
+>>> from mxnet.contrib import text
+>>> import collections
+
+```
+
+### Look up pre-trained word embeddings for indexed words
+
+As a common use case, let us look up pre-trained word embedding vectors for
+indexed words in just a few lines of code. To begin with, we can create a
+fastText word embedding object by specifying the embedding name `fasttext` and
+the pre-trained file `wiki.simple.vec`.
+
+```python
+>>> fasttext_simple = text.embedding.TokenEmbedding.create('fasttext',
+...     pretrained_file_name='wiki.simple.vec')
+
+```
+
+Suppose that we have a simple text data set in the string format. We can count
+word frequency in the data set.
+
+```python
+>>> text_data = " hello world \n hello nice world \n hi world \n"
+>>> counter = text.utils.count_tokens_from_str(text_data)
+
+```
+
+The obtained `counter` has key-value pairs whose keys are words and values are
+word frequencies. Suppose that we want to build indices for all the keys in
+`counter` and load the defined fastText word embedding for all such indexed
+words. First, we need a TokenIndexer object with `counter` as its argument
+
+```python
+>>> token_indexer = text.indexer.TokenIndexer(counter)
+
+```
+
+Then, we can create a Glossary object by specifying `token_indexer` and `fasttext_simple` as its
+arguments.
+
+```python
+>>> glossary = text.glossary.Glossary(token_indexer, fasttext_simple)
+
+```
+
+Now we are ready to look up the fastText word embedding vectors for indexed
+words.
+
+```python
+>>> glossary.get_vecs_by_tokens(['hello', 'world'])
+
+[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
+    ...
+   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
+ [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
+    ...
+   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
+<NDArray 2x300 @cpu(0)>
+
+```
+
+### Use `glossary` in `gluon`
+
+To demonstrate how to use a glossary with the loaded word embedding in the
+`gluon` package, let us first obtain indices of the words 'hello' and 'world'.
+
+```python
+>>> glossary.to_indices(['hello', 'world'])
+[2, 1]
+
+```
+
+We can obtain the vector representation for the words 'hello' and 'world'
+by specifying their indices (2 and 1) and the `glossary.idx_to_vec` in
+`mxnet.gluon.nn.Embedding`.
+ 
+```python
+>>> layer = gluon.nn.Embedding(len(glossary), glossary.vec_len)
+>>> layer.initialize()
+>>> layer.weight.set_data(glossary.idx_to_vec)
+>>> layer(nd.array([2, 1]))
+
+[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
+    ...
+   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
+ [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
+    ...
+   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
+<NDArray 2x300 @cpu(0)>
+
+```
+
+
+## Glossary
+
+The glossary provides indexing and embedding for text tokens in a glossary. For
+each indexed token in a glossary, an embedding vector will be associated with
+it. Such embedding vectors can be loaded from externally hosted or custom
+pre-trained token embedding files, such as via instances of
+[`TokenEmbedding`](#mxnet.contrib.text.embedding.TokenEmbedding). 
+The input counter whose keys are
+candidate indices may be obtained via
+[`count_tokens_from_str`](#mxnet.contrib.text.utils.count_tokens_from_str).
+
+```eval_rst
+.. currentmodule:: mxnet.contrib.text.glossary
+.. autosummary::
+    :nosignatures:
+
+    Glossary
+```
+
+To get all the valid names for pre-trained embeddings and files, we can use
+[`TokenEmbedding.get_embedding_and_pretrained_file_names`](#mxnet.contrib.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names).
+
+```python
+>>> text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names()
+{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt',
+'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.840B.300d.txt',
+'glove.twitter.27B.25d.txt', 'glove.twitter.27B.50d.txt',
+'glove.twitter.27B.100d.txt', 'glove.twitter.27B.200d.txt'],
+'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec']}
+
+```
+
+To begin with, we can create a fastText word embedding object by specifying the
+embedding name `fasttext` and the pre-trained file `wiki.simple.vec`.
+
+```python
+>>> fasttext_simple = text.embedding.TokenEmbedding.create('fasttext',
+...     pretrained_file_name='wiki.simple.vec')
+
+```
+
+Suppose that we have a simple text data set in the string format. We can count
+word frequency in the data set.
+
+```python
+>>> text_data = " hello world \n hello nice world \n hi world \n"
+>>> counter = text.utils.count_tokens_from_str(text_data)
+
+```
+
+The obtained `counter` has key-value pairs whose keys are words and values are
+word frequencies. Suppose that we want to build indices for the most frequent 2
+keys in `counter` and load the defined fastText word embedding for all these
+2 words. 
+
+```python
+>>> token_indexer = text.indexer.TokenIndexer(counter, most_freq_count=2)
+>>> glossary = text.glossary.Glossary(token_indexer, fasttext_simple)
+
+```
+
+Now we are ready to look up the fastText word embedding vectors for indexed
+words.
+
+```python
+>>> glossary.get_vecs_by_tokens(['hello', 'world'])
+
+[[  3.95669997e-01   2.14540005e-01  -3.53889987e-02  -2.42990002e-01
+    ...
+   -7.54180014e-01  -3.14429998e-01   2.40180008e-02  -7.61009976e-02]
+ [  1.04440004e-01  -1.08580001e-01   2.72119999e-01   1.32990003e-01
+    ...
+   -3.73499990e-01   5.67310005e-02   5.60180008e-01   2.90190000e-02]]
+<NDArray 2x300 @cpu(0)>
+
+```
+
+We can also access properties such as `token_to_idx` (mapping tokens to
+indices), `idx_to_token` (mapping indices to tokens), and `vec_len`
+(length of each embedding vector).
+
+```python
+>>> glossary.token_to_idx
+{'<unk>': 0, 'world': 1, 'hello': 2, 'hi': 3, 'nice': 4}
+>>> glossary.idx_to_token
+['<unk>', 'world', 'hello', 'hi', 'nice']
+>>> len(glossary)
+5
+>>> glossary.vec_len
+300
+
+```
+
+If a token is unknown to `glossary`, its embedding vector is initialized
+according to the default specification in `fasttext_simple` (all elements are
+0).
+
+```python
+
+>>> glossary.get_vecs_by_tokens('unknownT0kEN')
+
+[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
+  ...
+  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
+<NDArray 300 @cpu(0)>
+
+```
+
+## Text token embedding
+
+The text token embedding builds indices for text tokens. Such indexed tokens can
+be used by instances of [`TokenEmbedding`](#mxnet.contrib.text.embedding.TokenEmbedding)
+and [`Glossary`](#mxnet.contrib.text.glossary.Glossary).
+
+To load token embeddings from an externally hosted pre-trained token embedding
+file, such as those of GloVe and FastText, use
+[`TokenEmbedding.create(embedding_name, pretrained_file_name)`](#mxnet.contrib.text.embedding.TokenEmbedding.create).
+To get all the available `embedding_name` and `pretrained_file_name`, use
+[`TokenEmbedding.get_embedding_and_pretrained_file_names()`](#mxnet.contrib.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names).
+
+Alternatively, to load embedding vectors from a custom pre-trained text token
+embedding file, use [`CustomEmbedding`](#mxnet.contrib.text.embedding.CustomEmbedding).
+
+
+```eval_rst
+.. currentmodule:: mxnet.contrib.text.embedding
+.. autosummary::
+    :nosignatures:
+
+    TokenEmbedding
+    GloVe
+    FastText
+    CustomEmbedding
+```
+
+To get all the valid names for pre-trained embeddings and files, we can use
+[`TokenEmbedding.get_embedding_and_pretrained_file_names`](#mxnet.contrib.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names).
+
+```python
+>>> text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names()
+{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt',
+'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.840B.300d.txt',
+'glove.twitter.27B.25d.txt', 'glove.twitter.27B.50d.txt',
+'glove.twitter.27B.100d.txt', 'glove.twitter.27B.200d.txt'],
+'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec']}
+
+```
+
+To begin with, we can create a GloVe word embedding object by specifying the
+embedding name `glove` and the pre-trained file `glove.6B.50d.txt`. The
+argument `init_unknown_vec` specifies default vector representation for any
+unknown token.
+
+```python
+>>> glove_6b_50d = text.embedding.TokenEmbedding.create('glove',
+...     pretrained_file_name='glove.6B.50d.txt', init_unknown_vec=nd.zeros)
+
+```
+
+We can access properties such as `token_to_idx` (mapping tokens to indices),
+`idx_to_token` (mapping indices to tokens), `vec_len` (length of each embedding
+vector), and `unknown_token` (representation of any unknown token, default
+value is '<unk>').
+
+```python
+>>> glove_6b_50d.token_to_idx['hi']
+11084
+>>> glove_6b_50d.idx_to_token[11084]
+'hi'
+>>> glove_6b_50d.vec_len
+50
+>>> glove_6b_50d.unknown_token
+'<unk>'
+
+```
+
+For every unknown token, if its representation '<unk>' is encountered in the
+pre-trained token embedding file, index 0 of property `idx_to_vec` maps to the
+pre-trained token embedding vector loaded from the file; otherwise, index 0 of
+property `idx_to_vec` maps to the default token embedding vector specified via
+`init_unknown_vec` (set to nd.zeros here). Since the pre-trained file
+does not have a vector for the token '<unk>', index 0 has to map to an
+additional token '<unk>' and the number of tokens in the embedding is 400,001.
+
+
+```python
+>>> len(glove_6b_50d)
+400001
+>>> glove_6b_50d.idx_to_vec[0]
+
+[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
+  ...
+  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
+<NDArray 50 @cpu(0)>
+>>> glove_6b_50d.get_vecs_by_tokens('unknownT0kEN')
+
+[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
+  ...
+  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
+<NDArray 50 @cpu(0)>
+>>> glove_6b_50d.get_vecs_by_tokens(['unknownT0kEN', 'unknownT0kEN'])
+
+[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
+   ...
+   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
+ [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
+   ...
+   0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
+<NDArray 2x50 @cpu(0)>
+
+```
+
+
+### Implement a new text token embedding
+
+For ``optimizer``, create a subclass of
+[`TokenEmbedding`](#mxnet.contrib.text.embedding.TokenEmbedding).
+Also add ``@TokenEmbedding.register`` before this class. See
+[`embedding.py`](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/contrib/text/embedding.py)
+for examples.
+
+
+## Text token indexer
+
+The text token indexer builds indices for text tokens. Such indexed tokens can
+be used by instances of [`TokenEmbedding`](#mxnet.contrib.text.embedding.TokenEmbedding)
+and [`Glossary`](#mxnet.contrib.text.glossary.Glossary). The input
+counter whose keys are candidate indices may be obtained via
+[`count_tokens_from_str`](#mxnet.contrib.text.utils.count_tokens_from_str).
+
+
+```eval_rst
+.. currentmodule:: mxnet.contrib.text.indexer
+.. autosummary::
+    :nosignatures:
+
+    TokenIndexer
+```
+
+Suppose that we have a simple text data set in the string format. We can count
+word frequency in the data set.
+
+```python
+>>> text_data = " hello world \n hello nice world \n hi world \n"
+>>> counter = text.utils.count_tokens_from_str(text_data)
+
+```
+
+The obtained `counter` has key-value pairs whose keys are words and values are
+word frequencies. Suppose that we want to build indices for the 2 most frequent
+keys in `counter` with the unknown token representation '<UnK>' and a reserved
+token '<pad>'.
+
+```python
+>>> token_indexer = text.indexer.TokenIndexer(counter, most_freq_count=2,
+...     unknown_token='<UnK>', reserved_tokens=['<pad>'])
+
+```
+
+We can access properties such as `token_to_idx` (mapping tokens to indices),
+`idx_to_token` (mapping indices to tokens), `vec_len` (length of each embedding
+vector), and `unknown_token` (representation of any unknown token) and
+`reserved_tokens`.
+
+```python
+>>> token_indexer = text.indexer.TokenIndexer(counter, most_freq_count=2,
+...     unknown_token='<UnK>', reserved_tokens=['<pad>'])
+
+```
+
+```python
+>>> token_indexer.token_to_idx
+{'<UnK>': 0, '<pad>': 1, 'world': 2, 'hello': 3}
+>>> token_indexer.idx_to_token
+['<UnK>', '<pad>', 'world', 'hello']
+>>> token_indexer.unknown_token
+'<UnK>'
+>>> token_indexer.reserved_tokens
+['<pad>']
+>>> len(token_indexer)
+4
+```
+
+Besides the specified unknown token '<UnK>' and reserved_token '<pad>' are
+indexed, the 2 most frequent words 'world' and 'hello' are also indexed.
+
+
+
+## Text utilities
+
+The following functions provide utilities for text data processing.
+
+```eval_rst
+.. currentmodule:: mxnet.contrib.text.utils
+.. autosummary::
+    :nosignatures:
+
+    count_tokens_from_str
+```
+
+
+
+
+## API Reference
+
+<script type="text/javascript" src='../../_static/js/auto_module_index.js'></script>
+
+```eval_rst
+
+.. automodule:: mxnet.contrib.text.glossary
+.. autoclass:: mxnet.contrib.text.glossary.Glossary
+    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
+
+.. automodule:: mxnet.contrib.text.embedding
+.. autoclass:: mxnet.contrib.text.embedding.TokenEmbedding
+    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens, register, create, get_embedding_and_pretrained_file_names
+.. autoclass:: mxnet.contrib.text.embedding.GloVe
+    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
+.. autoclass:: mxnet.contrib.text.embedding.FastText
+    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
+.. autoclass:: mxnet.contrib.text.embedding.CustomEmbedding
+    :members: get_vecs_by_tokens, update_token_vectors, to_indices, to_tokens
+
+.. automodule:: mxnet.contrib.text.indexer
+.. autoclass:: mxnet.contrib.text.indexer.TokenIndexer
+    :members: to_indices, to_tokens
+
+.. automodule:: mxnet.contrib.text.utils
+    :members: count_tokens_from_str
+
+```
+<script>auto_index("api-reference");</script>
\ No newline at end of file
diff --git a/python/mxnet/contrib/text/embedding.py b/python/mxnet/contrib/text/embedding.py
index adba867..54635f1 100644
--- a/python/mxnet/contrib/text/embedding.py
+++ b/python/mxnet/contrib/text/embedding.py
@@ -45,7 +45,7 @@ class TokenEmbedding(indexer.TokenIndexer):
     `TokenEmbedding.get_embedding_and_pretrained_file_names()`.
 
     Alternatively, to load embedding vectors from a custom pre-trained token embedding file, use
-    :class:`~mxnet.text.embedding.CustomEmbedding`.
+    :class:`~mxnet.contrib.text.embedding.CustomEmbedding`.
 
     For every unknown token, if its representation `self.unknown_token` is encountered in the
     pre-trained token embedding file, index 0 of `self.idx_to_vec` maps to the pre-trained token
@@ -56,7 +56,7 @@ class TokenEmbedding(indexer.TokenIndexer):
     first-encountered token embedding vector will be loaded and the rest will be skipped.
 
     For the same token, its index and embedding vector may vary across different instances of
-    :class:`~mxnet.text.embedding.TokenEmbedding`.
+    :class:`~mxnet.contrib.text.embedding.TokenEmbedding`.
 
 
     Properties
@@ -298,16 +298,16 @@ class TokenEmbedding(indexer.TokenIndexer):
 
 
         Once an embedding is registered, we can create an instance of this embedding with
-        :func:`~mxnet.text.embedding.TokenEmbedding.create`.
+        :func:`~mxnet.contrib.text.embedding.TokenEmbedding.create`.
 
 
         Examples
         --------
-        >>> @mxnet.text.embedding.TokenEmbedding.register
-        ... class MyTextEmbed(mxnet.text.embedding.TokenEmbedding):
+        >>> @mxnet.contrib.text.embedding.TokenEmbedding.register
+        ... class MyTextEmbed(mxnet.contrib.text.embedding.TokenEmbedding):
         ...     def __init__(self, pretrained_file_name='my_pretrain_file'):
         ...         pass
-        >>> embed = mxnet.text.embedding.TokenEmbedding.create('MyTokenEmbed')
+        >>> embed = mxnet.contrib.text.embedding.TokenEmbedding.create('MyTokenEmbed')
         >>> print(type(embed))
         <class '__main__.MyTokenEmbed'>
         """
@@ -317,13 +317,13 @@ class TokenEmbedding(indexer.TokenIndexer):
 
     @staticmethod
     def create(embedding_name, **kwargs):
-        """Creates an instance of :class:`~mxnet.text.embedding.TokenEmbedding`.
+        """Creates an instance of :class:`~mxnet.contrib.text.embedding.TokenEmbedding`.
 
 
         Creates a token embedding instance by loading embedding vectors from an externally hosted
         pre-trained token embedding file, such as those of GloVe and FastText. To get all the valid
         `embedding_name` and `pretrained_file_name`, use
-        `mxnet.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names()`.
+        `mxnet.contrib.text.embedding.TokenEmbedding.get_embedding_and_pretrained_file_names()`.
 
 
         Parameters
@@ -334,7 +334,7 @@ class TokenEmbedding(indexer.TokenIndexer):
 
         Returns
         -------
-        :class:`~mxnet.text.glossary.TokenEmbedding`:
+        :class:`~mxnet.contrib.text.glossary.TokenEmbedding`:
             A token embedding instance that loads embedding vectors from an externally hosted
             pre-trained token embedding file.
         """
@@ -367,8 +367,8 @@ class TokenEmbedding(indexer.TokenIndexer):
 
         To load token embedding vectors from an externally hosted pre-trained token embedding file,
         such as those of GloVe and FastText, one should use
-        `mxnet.text.embedding.TokenEmbedding.create(embedding_name, pretrained_file_name)`. This
-        method returns all the valid names of `pretrained_file_name` for the specified
+        `mxnet.contrib.text.embedding.TokenEmbedding.create(embedding_name, pretrained_file_name)`.
+        This method returns all the valid names of `pretrained_file_name` for the specified
         `embedding_name`. If `embedding_name` is set to None, this method returns all the valid
         names of `embedding_name` with associated `pretrained_file_name`.
 
@@ -386,7 +386,8 @@ class TokenEmbedding(indexer.TokenIndexer):
             for the specified token embedding name (`embedding_name`). If the text embeding name is
             set to None, returns a dict mapping each valid token embedding name to a list of valid
             pre-trained files (`pretrained_file_name`). They can be plugged into
-            `mxnet.text.embedding.TokenEmbedding.create(embedding_name, pretrained_file_name)`.
+            `mxnet.contrib.text.embedding.TokenEmbedding.create(embedding_name,
+            pretrained_file_name)`.
         """
 
         text_embedding_reg = registry.get_registry(TokenEmbedding)
diff --git a/python/mxnet/contrib/text/glossary.py b/python/mxnet/contrib/text/glossary.py
index 2fd46a3..40f3258 100644
--- a/python/mxnet/contrib/text/glossary.py
+++ b/python/mxnet/contrib/text/glossary.py
@@ -16,12 +16,14 @@
 # under the License.
 
 # coding: utf-8
+# pylint: disable=super-init-not-called
 
 """Index text tokens and load their embeddings."""
 from __future__ import absolute_import
 from __future__ import print_function
 
 from . import embedding
+from . import indexer
 from ... import ndarray as nd
 
 
@@ -31,35 +33,16 @@ class Glossary(embedding.TokenEmbedding):
 
     For each indexed token in a glossary, an embedding vector will be associated with it. Such
     embedding vectors can be loaded from externally hosted or custom pre-trained token embedding
-    files, such as via instances of :class:`~mxnet.text.embedding.TokenEmbedding`.
+    files, such as via instances of :class:`~mxnet.contrib.text.embedding.TokenEmbedding`.
 
 
     Parameters
     ----------
-    counter : collections.Counter or None, default None
-        Counts text token frequencies in the text data. Its keys will be indexed according to
-        frequency thresholds such as `most_freq_count` and `min_freq`. Keys of `counter`,
-        `unknown_token`, and values of `reserved_tokens` must be of the same hashable type.
-        Examples: str, int, and tuple.
+    token_indexer : :class:`~mxnet.contrib.text.indexer.TokenIndexer`
+        It contains the indexed tokens to load, where each token is associated with an index.
     token_embeddings : instance or list of :class:`~TokenEmbedding`
         One or multiple pre-trained token embeddings to load. If it is a list of multiple
         embeddings, these embedding vectors will be concatenated for each token.
-    most_freq_count : None or int, default None
-        The maximum possible number of the most frequent tokens in the keys of `counter` that can be
-        indexed. Note that this argument does not count any token from `reserved_tokens`. If this
-        argument is None or larger than its largest possible value restricted by `counter` and
-        `reserved_tokens`, this argument becomes positive infinity.
-    min_freq : int, default 1
-        The minimum frequency required for a token in the keys of `counter` to be indexed.
-    unknown_token : hashable object, default '<unk>'
-        The representation for any unknown token. In other words, any unknown token will be indexed
-        as the same representation. Keys of `counter`, `unknown_token`, and values of
-        `reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple.
-    reserved_tokens : list of hashable objects or None, default None
-        A list of reserved tokens that will always be indexed, such as special symbols representing
-        padding, beginning of sentence, and end of sentence. It cannot contain `unknown_token`, or
-        duplicate reserved tokens. Keys of `counter`, `unknown_token`, and values of
-        `reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple.
 
 
     Properties
@@ -80,23 +63,30 @@ class Glossary(embedding.TokenEmbedding):
         embedding vector. The largest valid index maps to the initialized embedding vector for every
         reserved token, such as an unknown_token token and a padding token.
     """
-    def __init__(self, counter, token_embeddings, most_freq_count=None, min_freq=1,
-                 unknown_token='<unk>', reserved_tokens=None):
+    def __init__(self, token_indexer, token_embeddings):
+
+        # Sanity checks.
+        assert isinstance(token_indexer, indexer.TokenIndexer), \
+            'The argument `token_indexer` must be an instance of ' \
+            'mxnet.contrib.text.indexer.TokenIndexer.'
 
         if not isinstance(token_embeddings, list):
             token_embeddings = [token_embeddings]
 
-        # Sanity checks.
         for embed in token_embeddings:
             assert isinstance(embed, embedding.TokenEmbedding), \
-                'The parameter `token_embeddings` must be an instance or a list of instances ' \
-                'of `mxnet.text.embedding.TextEmbed` whose embedding vectors will be loaded or ' \
-                'concatenated-then-loaded to map to the indexed tokens.'
-
-        # Index tokens from keys of `counter` and reserved tokens.
-        super(Glossary, self).__init__(counter=counter, most_freq_count=most_freq_count,
-                                       min_freq=min_freq, unknown_token=unknown_token,
-                                       reserved_tokens=reserved_tokens)
+                'The argument `token_embeddings` must be an instance or a list of instances ' \
+                'of `mxnet.contrib.text.embedding.TextEmbedding` whose embedding vectors will be' \
+                'loaded or concatenated-then-loaded to map to the indexed tokens.'
+
+        # Index tokens.
+        self._token_to_idx = token_indexer.token_to_idx.copy() \
+            if token_indexer.token_to_idx is not None else None
+        self._idx_to_token = token_indexer.idx_to_token[:] \
+            if token_indexer.idx_to_token is not None else None
+        self._unknown_token = token_indexer.unknown_token
+        self._reserved_tokens = token_indexer.reserved_tokens[:] \
+            if token_indexer.reserved_tokens is not None else None
 
         # Set _idx_to_vec so that indices of tokens from keys of `counter` are
         # associated with token embedding vectors from `token_embeddings`.
@@ -109,7 +99,7 @@ class Glossary(embedding.TokenEmbedding):
         Parameters
         ----------
         token_embeddings : an instance or a list of instances of
-            :class:`~mxnet.text.embedding.TokenEmbedding`
+            :class:`~mxnet.contrib.text.embedding.TokenEmbedding`
             One or multiple pre-trained token embeddings to load. If it is a list of multiple
             embeddings, these embedding vectors will be concatenated for each token.
         """
diff --git a/python/mxnet/contrib/text/indexer.py b/python/mxnet/contrib/text/indexer.py
index 409dfb0..1add7cf 100644
--- a/python/mxnet/contrib/text/indexer.py
+++ b/python/mxnet/contrib/text/indexer.py
@@ -32,8 +32,8 @@ class TokenIndexer(object):
 
 
     Build indices for the unknown token, reserved tokens, and input counter keys. Indexed tokens can
-    be used by instances of :class:`~mxnet.text.embedding.TokenEmbedding`, such as instances of
-    :class:`~mxnet.text.glossary.Glossary`.
+    be used by instances of :class:`~mxnet.contrib.text.embedding.TokenEmbedding`, such as instances
+    of :class:`~mxnet.contrib.text.glossary.Glossary`.
 
 
     Parameters
diff --git a/tests/python/unittest/test_contrib_text.py b/tests/python/unittest/test_contrib_text.py
index 99423aa..dc0e7bc 100644
--- a/tests/python/unittest/test_contrib_text.py
+++ b/tests/python/unittest/test_contrib_text.py
@@ -422,8 +422,9 @@ def test_glossary_with_one_embed():
 
     counter = Counter(['a', 'b', 'b', 'c', 'c', 'c', 'some_word$'])
 
-    g1 = text.glossary.Glossary(counter, my_embed, most_freq_count=None, min_freq=1,
-                                unknown_token='<unk>', reserved_tokens=['<pad>'])
+    i1 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1, unknown_token='<unk>',
+                                   reserved_tokens=['<pad>'])
+    g1 = text.glossary.Glossary(i1, my_embed)
 
     assert g1.token_to_idx == {'<unk>': 0, '<pad>': 1, 'c': 2, 'b': 3, 'a': 4, 'some_word$': 5}
     assert g1.idx_to_token == ['<unk>', '<pad>', 'c', 'b', 'a', 'some_word$']
@@ -546,8 +547,9 @@ def test_glossary_with_two_embeds():
 
     counter = Counter(['a', 'b', 'b', 'c', 'c', 'c', 'some_word$'])
 
-    g1 = text.glossary.Glossary(counter, [my_embed1, my_embed2], most_freq_count=None, min_freq=1,
-                                unknown_token='<unk>', reserved_tokens=None)
+    i1 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1, unknown_token='<unk>',
+                                   reserved_tokens=None)
+    g1 = text.glossary.Glossary(i1, [my_embed1, my_embed2])
 
     assert g1.token_to_idx == {'<unk>': 0, 'c': 1, 'b': 2, 'a': 3, 'some_word$': 4}
     assert g1.idx_to_token == ['<unk>', 'c', 'b', 'a', 'some_word$']
@@ -599,8 +601,9 @@ def test_glossary_with_two_embeds():
     my_embed4 = text.embedding.CustomEmbedding(pretrain_file_path4, elem_delim,
                                                unknown_token='<unk2>')
 
-    g2 = text.glossary.Glossary(counter, [my_embed3, my_embed4], most_freq_count=None, min_freq=1,
-                                unknown_token='<unk>', reserved_tokens=None)
+    i2 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1, unknown_token='<unk>',
+                                   reserved_tokens=None)
+    g2 = text.glossary.Glossary(i2, [my_embed3, my_embed4])
     assert_almost_equal(g2.idx_to_vec.asnumpy(),
                         np.array([[1.1, 1.2, 1.3, 1.4, 1.5,
                                    0.11, 0.12, 0.13, 0.14, 0.15],
@@ -614,8 +617,9 @@ def test_glossary_with_two_embeds():
                                    0.11, 0.12, 0.13, 0.14, 0.15]])
                         )
 
-    g3 = text.glossary.Glossary(counter, [my_embed3, my_embed4], most_freq_count=None, min_freq=1,
-                                unknown_token='<unk1>', reserved_tokens=None)
+    i3 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1,
+                                   unknown_token='<unk1>', reserved_tokens=None)
+    g3 = text.glossary.Glossary(i3, [my_embed3, my_embed4])
     assert_almost_equal(g3.idx_to_vec.asnumpy(),
                         np.array([[1.1, 1.2, 1.3, 1.4, 1.5,
                                    0.11, 0.12, 0.13, 0.14, 0.15],
@@ -629,8 +633,9 @@ def test_glossary_with_two_embeds():
                                    0.11, 0.12, 0.13, 0.14, 0.15]])
                         )
 
-    g4 = text.glossary.Glossary(counter, [my_embed3, my_embed4],most_freq_count=None, min_freq=1,
-                                unknown_token='<unk2>', reserved_tokens=None)
+    i4 = text.indexer.TokenIndexer(counter, most_freq_count=None, min_freq=1,
+                                   unknown_token='<unk2>', reserved_tokens=None)
+    g4 = text.glossary.Glossary(i4, [my_embed3, my_embed4])
     assert_almost_equal(g4.idx_to_vec.asnumpy(),
                         np.array([[1.1, 1.2, 1.3, 1.4, 1.5,
                                    0.11, 0.12, 0.13, 0.14, 0.15],
@@ -646,8 +651,9 @@ def test_glossary_with_two_embeds():
 
     counter2 = Counter(['b', 'b', 'c', 'c', 'c', 'some_word$'])
 
-    g5 = text.glossary.Glossary(counter2, [my_embed3, my_embed4], most_freq_count=None, min_freq=1,
-                                unknown_token='a', reserved_tokens=None)
+    i5 = text.indexer.TokenIndexer(counter2, most_freq_count=None, min_freq=1, unknown_token='a',
+                                   reserved_tokens=None)
+    g5 = text.glossary.Glossary(i5, [my_embed3, my_embed4])
     assert g5.token_to_idx == {'a': 0, 'c': 1, 'b': 2, 'some_word$': 3}
     assert g5.idx_to_token == ['a', 'c', 'b', 'some_word$']
     assert_almost_equal(g5.idx_to_vec.asnumpy(),

-- 
To stop receiving notification emails like this one, please contact
['"commits@mxnet.apache.org" <co...@mxnet.apache.org>'].