You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/01/30 06:25:43 UTC
[GitHub] piiswrong closed pull request #9626: Fix skipping error in docstr and API docs
piiswrong closed pull request #9626: Fix skipping error in docstr and API docs
URL: https://github.com/apache/incubator-mxnet/pull/9626
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/docs/api/python/contrib/text.md b/docs/api/python/contrib/text.md
index f203a117ba..8bd67d2b50 100644
--- a/docs/api/python/contrib/text.md
+++ b/docs/api/python/contrib/text.md
@@ -138,11 +138,11 @@ data set.
The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the 2 most frequent keys in `counter` with the unknown
-token representation '<UnK>' and a reserved token '<pad>'.
+token representation '<unk>' and a reserved token '<pad>'.
```python
->>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='<UnK>',
-... reserved_tokens=['<pad>'])
+>>> my_vocab = text.vocab.Vocabulary(counter, most_freq_count=2, unknown_token='<unk>',
+... reserved_tokens=['<pad>'])
```
@@ -153,18 +153,18 @@ of any unknown token) and `reserved_tokens`.
```python
>>> my_vocab.token_to_idx
-{'<UnK>': 0, '<pad>': 1, 'world': 2, 'hello': 3}
+{'<unk>': 0, '<pad>': 1, 'world': 2, 'hello': 3}
>>> my_vocab.idx_to_token
-['<UnK>', '<pad>', 'world', 'hello']
+['<unk>', '<pad>', 'world', 'hello']
>>> my_vocab.unknown_token
-'<UnK>'
+'<unk>'
>>> my_vocab.reserved_tokens
-['<pad>']
+['<pad>']
>>> len(my_vocab)
4
```
-Besides the specified unknown token '<UnK>' and reserved_token '<pad>' are indexed, the 2 most
+Besides the specified unknown token '<unk>' and reserved_token '<pad>' are indexed, the 2 most
frequent words 'world' and 'hello' are also indexed.
@@ -259,9 +259,9 @@ We can also access properties such as `token_to_idx` (mapping tokens to indices)
```python
>>> my_embedding.token_to_idx
-{'<unk>': 0, 'world': 1, 'hello': 2}
+{'<unk>': 0, 'world': 1, 'hello': 2}
>>> my_embedding.idx_to_token
-['<unk>', 'world', 'hello']
+['<unk>', 'world', 'hello']
>>> len(my_embedding)
3
>>> my_embedding.vec_len
@@ -302,7 +302,7 @@ word embedding file, we do not need to specify any vocabulary.
We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping
indices to tokens), `vec_len` (length of each embedding vector), and `unknown_token` (representation
-of any unknown token, default value is '<unk>').
+of any unknown token, default value is '<unk>').
```python
>>> my_embedding.token_to_idx['nice']
@@ -312,15 +312,15 @@ of any unknown token, default value is '<unk>').
>>> my_embedding.vec_len
300
>>> my_embedding.unknown_token
-'<unk>'
+'<unk>'
```
-For every unknown token, if its representation '<unk>' is encountered in the pre-trained token
+For every unknown token, if its representation '<unk>' is encountered in the pre-trained token
embedding file, index 0 of property `idx_to_vec` maps to the pre-trained token embedding vector
loaded from the file; otherwise, index 0 of property `idx_to_vec` maps to the default token
embedding vector specified via `init_unknown_vec` (set to nd.zeros here). Since the pre-trained file
-does not have a vector for the token '<unk>', index 0 has to map to an additional token '<unk>' and
+does not have a vector for the token '<unk>', index 0 has to map to an additional token '<unk>' and
the number of tokens in the embedding is 111,052.
diff --git a/python/mxnet/contrib/text/embedding.py b/python/mxnet/contrib/text/embedding.py
index 4fc6aacf67..961fbb02a8 100644
--- a/python/mxnet/contrib/text/embedding.py
+++ b/python/mxnet/contrib/text/embedding.py
@@ -646,12 +646,12 @@ class CustomEmbedding(_TokenEmbedding):
This is to load embedding vectors from a user-defined pre-trained text embedding file.
- Denote by '<ed>' the argument `elem_delim`. Denote by <v_ij> the j-th element of the token
- embedding vector for <token_i>, the expected format of a custom pre-trained token embedding file
+ Denote by '[ed]' the argument `elem_delim`. Denote by [v_ij] the j-th element of the token
+ embedding vector for [token_i], the expected format of a custom pre-trained token embedding file
is:
- '<token_1><ed><v_11><ed><v_12><ed>...<ed><v_1k>\\\\n<token_2><ed><v_21><ed><v_22><ed>...<ed>
- <v_2k>\\\\n...'
+ '[token_1][ed][v_11][ed][v_12][ed]...[ed][v_1k]\\\\n[token_2][ed][v_21][ed][v_22][ed]...[ed]
+ [v_2k]\\\\n...'
where k is the length of the embedding vector `vec_len`.
diff --git a/python/mxnet/contrib/text/vocab.py b/python/mxnet/contrib/text/vocab.py
index 04c3326841..9e44acb101 100644
--- a/python/mxnet/contrib/text/vocab.py
+++ b/python/mxnet/contrib/text/vocab.py
@@ -52,7 +52,7 @@ class Vocabulary(object):
argument has no effect.
min_freq : int, default 1
The minimum frequency required for a token in the keys of `counter` to be indexed.
- unknown_token : hashable object, default '<unk>'
+ unknown_token : hashable object, default '<unk>'
The representation for any unknown token. In other words, any unknown token will be indexed
as the same representation. Keys of `counter`, `unknown_token`, and values of
`reserved_tokens` must be of the same hashable type. Examples: str, int, and tuple.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services