You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by jx...@apache.org on 2018/01/30 06:25:43 UTC
[incubator-mxnet] branch master updated: Fix skipping error in
docstr and API docs (#9626)
This is an automated email from the ASF dual-hosted git repository.
jxie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-mxnet.git
The following commit(s) were added to refs/heads/master by this push:
new 5e0a0b0 Fix skipping error in docstr and API docs (#9626)
5e0a0b0 is described below
commit 5e0a0b0bd54cdeb92321f958bf964ddc8aca94e9
Author: Aston Zhang <22...@users.noreply.github.com>
AuthorDate: Mon Jan 29 22:25:39 2018 -0800
Fix skipping error in docstr and API docs (#9626)
* Fix skipping error in docstr
* update
---
docs/api/python/contrib/text.md | 28 ++++++++++++++--------------
python/mxnet/contrib/text/embedding.py | 8 ++++----
python/mxnet/contrib/text/vocab.py | 2 +-
3 files changed, 19 insertions(+), 19 deletions(-)
diff --git a/docs/api/python/contrib/text.md b/docs/api/python/contrib/text.md
index f203a11..8bd67d2 100644
--- a/docs/api/python/contrib/text.md
+++ b/docs/api/python/contrib/text.md
@@ -138,11 +138,11 @@ data set.
The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the 2 most frequent keys in `counter` with the unknown
-token representation '<UnK>' and a reserved token '<pad>'.
+token representation '<unk>' and a reserved token '<pad>'.
```python
->>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='<UnK>',
-... reserved_tokens=['<pad>'])
+>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='<unk>',
+... reserved_tokens=['<pad>'])
```
@@ -153,18 +153,18 @@ of any unknown token) and `reserved_tokens`.
```python
>>> my_vocab.token_to_idx
-{'<UnK>': 0, '<pad>': 1, 'world': 2, 'hello': 3}
+{'<unk>': 0, '<pad>': 1, 'world': 2, 'hello': 3}
>>> my_vocab.idx_to_token
-['<UnK>', '<pad>', 'world', 'hello']
+['<unk>', '<pad>', 'world', 'hello']
>>> my_vocab.unknown_token
-'<UnK>'
+'<unk>'
>>> my_vocab.reserved_tokens
-['<pad>']
+['<pad>']
>>> len(my_vocab)
4
```
-Besides the specified unknown token '<UnK>' and reserved_token '<pad>' are indexed, the 2 most
+Besides the specified unknown token '<unk>' and reserved_token '<pad>' are indexed, the 2 most
frequent words 'world' and 'hello' are also indexed.
@@ -259,9 +259,9 @@ We can also access properties such as `token_to_idx` (mapping tokens to indices)
```python
>>> my_embedding.token_to_idx
-{'<unk>': 0, 'world': 1, 'hello': 2}
+{'<unk>': 0, 'world': 1, 'hello': 2}
>>> my_embedding.idx_to_token
-['<unk>', 'world', 'hello']
+['<unk>', 'world', 'hello']
>>> len(my_embedding)
3
>>> my_embedding.vec_len
@@ -302,7 +302,7 @@ word embedding file, we do not need to specify any vocabulary.
We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping
indices to tokens), `vec_len` (length of each embedding vector), and `unknown_token` (representation
-of any unknown token, default value is '<unk>').
+of any unknown token, default value is '<unk>').
```python
>>> my_embedding.token_to_idx['nice']
@@ -312,15 +312,15 @@ of any unknown token, default value is '<unk>').
>>> my_embedding.vec_len
300
>>> my_embedding.unknown_token
-'<unk>'
+'<unk>'
```
-For every unknown token, if its representation '<unk>' is encountered in the pre-trained token
+For every unknown token, if its representation '<unk>' is encountered in the pre-trained token
embedding file, index 0 of property `idx_to_vec` maps to the pre-trained token embedding vector
loaded from the file; otherwise, index 0 of property `idx_to_vec` maps to the default token
embedding vector specified via `init_unknown_vec` (set to nd.zeros here). Since the pre-trained file
-does not have a vector for the token '<unk>', index 0 has to map to an additional token '<unk>' and
+does not have a vector for the token '<unk>', index 0 has to map to an additional token '<unk>' and
the number of tokens in the embedding is 111,052.
diff --git a/python/mxnet/contrib/text/embedding.py b/python/mxnet/contrib/text/embedding.py
index 4fc6aac..961fbb0 100644
--- a/python/mxnet/contrib/text/embedding.py
+++ b/python/mxnet/contrib/text/embedding.py
@@ -646,12 +646,12 @@ class CustomEmbedding(_TokenEmbedding):
This is to load embedding vectors from a user-defined pre-trained text embedding file.
- Denote by '<ed>' the argument `elem_delim`. Denote by <v_ij> the j-th element of the token
- embedding vector for <token_i>, the expected format of a custom pre-trained token embedding file
+ Denote by '[ed]' the argument `elem_delim`. Denote by [v_ij] the j-th element of the token
+ embedding vector for [token_i], the expected format of a custom pre-trained token embedding file
is:
- '<token_1><ed><v_11><ed><v_12><ed>...<ed><v_1k>\\\\n<token_2><ed><v_21><ed><v_22><ed>...<ed>
- <v_2k>\\\\n...'
+ '[token_1][ed][v_11][ed][v_12][ed]...[ed][v_1k]\\\\n[token_2][ed][v_21][ed][v_22][ed]...[ed]
+ [v_2k]\\\\n...'
where k is the length of the embedding vector `vec_len`.
diff --git a/python/mxnet/contrib/text/vocab.py b/python/mxnet/contrib/text/vocab.py
index 04c3326..9e44acb 100644
--- a/python/mxnet/contrib/text/vocab.py
+++ b/python/mxnet/contrib/text/vocab.py
@@ -52,7 +52,7 @@ class Vocabulary(object):
argument has no effect.
min_freq : int, default 1
The minimum frequency required for a token in the keys of `counter` to be indexed.
- unknown_token : hashable object, default '<unk>'
+ unknown_token : hashable object, default '<unk>'
The representation for any unknown token. In other words, any unknown token will be indexed
as the same representation. Keys of `counter`, `unknown_token`, and values of
`reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple.
--
To stop receiving notification emails like this one, please contact
jxie@apache.org.