You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by zh...@apache.org on 2018/03/27 18:28:26 UTC
[incubator-mxnet] 01/02: Add gluon.text vocab/embedding demo (#18)

This is an automated email from the ASF dual-hosted git repository.

zhasheng pushed a commit to branch nlp_toolkit
in repository https://gitbox.apache.org/repos/asf/incubator-mxnet.git

commit 14d7499431e4e90efdebc832250f316c81dd019b
Author: Aston Zhang <22...@users.noreply.github.com>
AuthorDate: Mon Mar 26 23:03:14 2018 -0700

    Add gluon.text vocab/embedding demo (#18)
    
    * Add word embedding example
    
    * clean
    
    * Add text descriptions
---
 example/gluon/word_embedding.ipynb   | 1049 ++++++++++++++++++++++++++++++++++
 python/mxnet/gluon/text/embedding.py |    6 +
 2 files changed, 1055 insertions(+)

diff --git a/example/gluon/word_embedding.ipynb b/example/gluon/word_embedding.ipynb
new file mode 100644
index 0000000..f3c3217
--- /dev/null
+++ b/example/gluon/word_embedding.ipynb
@@ -0,0 +1,1049 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Using Pre-trained Word Embeddings"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we introduce how to use pre-trained word embeddings via `mxnet.gluon.text`. \n",
+    "\n",
+    "The used GloVe and fastText word embeddings in this tutorial are from the following sources:\n",
+    "\n",
+    "* GloVe project website：https://nlp.stanford.edu/projects/glove/\n",
+    "* fastText project website：https://fasttext.cc/\n",
+    "\n",
+    "Let us first import the following packages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:34.447895Z",
+     "start_time": "2018-03-27T00:03:33.503038Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from mxnet import gluon\n",
+    "from mxnet import nd\n",
+    "from mxnet.gluon import text\n",
+    "from collections import Counter"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Creating Vocabulary with Word Embeddings\n",
+    "\n",
+    "As a common use case, let us index words, attach pre-trained word embeddings for them, and use such embeddings in `gluon` in just a few lines of code.\n",
+    "\n",
+    "### Creating Vocabulary from Data Sets\n",
+    "\n",
+    "To begin with, suppose that we have a simple text data set in the string format. We can count word frequency in the data set."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:34.453636Z",
+     "start_time": "2018-03-27T00:03:34.449760Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "data = \" hello world \\n hello nice world \\n hi world \\n\"\n",
+    "counter = text.utils.count_tokens_from_str(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. This allows us to filter out infrequent words via `Vocabulary` arguments such as `max_size` and `min_freq`. Suppose that we want to build indices for all the keys in counter. We need a `Vocabulary` instance with counter as its argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:34.459747Z",
+     "start_time": "2018-03-27T00:03:34.456473Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "vocab = text.vocab.Vocabulary(counter)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To attach word embedding to indexed words in `vocab`, let us go on to create a fastText word embedding instance by specifying the embedding name `fasttext` and the pre-trained file name `wiki.simple.vec`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.199585Z",
+     "start_time": "2018-03-27T00:03:34.462702Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/astonz/WorkDocs/Programs/git_repo/mxnet/python/mxnet/gluon/text/embedding.py:264: UserWarning: At line 1 of the pre-trained token embedding file: token 111051 with 1-dimensional vector [300.0] is likely a header and is skipped.\n",
+      "  'skipped.' % (line_num, token, elems))\n"
+     ]
+    }
+   ],
+   "source": [
+    "fasttext_simple = text.embedding.create('fasttext', file_name='wiki.simple.vec')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So we can attach word embedding `fasttext_simple` to indexed words in `vocab`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.214582Z",
+     "start_time": "2018-03-27T00:03:53.201953Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "vocab.set_embedding(fasttext_simple)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To see other pre-trained file names under the fastText word embedding, we can use `text.embedding.get_file_names`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.240556Z",
+     "start_time": "2018-03-27T00:03:53.217839Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['crawl-300d-2M.vec',\n",
+       " 'wiki.aa.vec',\n",
+       " 'wiki.ab.vec',\n",
+       " 'wiki.ace.vec',\n",
+       " 'wiki.ady.vec']"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "text.embedding.get_file_names('fasttext')[:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The created vocabulary `vocab` includes four different words and a special unknown token. Let us check the size of `vocab`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.250542Z",
+     "start_time": "2018-03-27T00:03:53.243313Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "5"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(vocab)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By default, the vector of any token that is unknown to `vocab` is a zero vector. Its length is equal to the vector dimension of the fastText word embeddings: 300."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.262146Z",
+     "start_time": "2018-03-27T00:03:53.253051Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(300,)"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab.embedding['beautiful'].shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The first five elements of the vector of any unknown token are zeros."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.273198Z",
+     "start_time": "2018-03-27T00:03:53.264987Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\n",
+       "[ 0.  0.  0.  0.  0.]\n",
+       "<NDArray 5 @cpu(0)>"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab.embedding['beautiful'][:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us check the shape of the vectors of words 'hello' and 'world' from `vocab`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.283862Z",
+     "start_time": "2018-03-27T00:03:53.276282Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(2, 300)"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab.embedding['hello', 'world'].shape"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-26T23:29:07.340108Z",
+     "start_time": "2018-03-26T23:29:07.334790Z"
+    }
+   },
+   "source": [
+    "We can access the first five elements of the vectors of 'hello' and 'world'."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.296482Z",
+     "start_time": "2018-03-27T00:03:53.287022Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\n",
+       "[[ 0.39567     0.21454    -0.035389   -0.24299    -0.095645  ]\n",
+       " [ 0.10444    -0.10858     0.27212     0.13299    -0.33164999]]\n",
+       "<NDArray 2x5 @cpu(0)>"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab.embedding['hello', 'world'][:, :5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Using Pre-trained Word Embeddings in  `gluon.nn.Embedding`\n",
+    "\n",
+    "To demonstrate how to use pre-trained word embeddings in the `gluon` package, let us first obtain indices of the words 'hello' and 'world'."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.306574Z",
+     "start_time": "2018-03-27T00:03:53.300400Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[2, 1]"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "vocab['hello', 'world']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can obtain the vectors for the words 'hello' and 'world' by specifying their indices (2 and 1) and the weight matrix `vocab.embedding.idx_to_vec` in `gluon.nn.Embedding`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.327785Z",
+     "start_time": "2018-03-27T00:03:53.309979Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\n",
+       "[[ 0.39567     0.21454    -0.035389   -0.24299    -0.095645  ]\n",
+       " [ 0.10444    -0.10858     0.27212     0.13299    -0.33164999]]\n",
+       "<NDArray 2x5 @cpu(0)>"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "input_dim, output_dim = vocab.embedding.idx_to_vec.shape\n",
+    "layer = gluon.nn.Embedding(input_dim, output_dim)\n",
+    "layer.initialize()\n",
+    "layer.weight.set_data(vocab.embedding.idx_to_vec)\n",
+    "layer(nd.array([2, 1]))[:, :5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Creating Vocabulary from Pre-trained Word Embeddings\n",
+    "\n",
+    "We can also create vocabulary by using vocabulary of pre-trained word embeddings, such as GloVe. Below are a few pre-trained file names under the GloVe word embedding."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:03:53.338638Z",
+     "start_time": "2018-03-27T00:03:53.330822Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['glove.42B.300d.txt',\n",
+       " 'glove.6B.50d.txt',\n",
+       " 'glove.6B.100d.txt',\n",
+       " 'glove.6B.200d.txt',\n",
+       " 'glove.6B.300d.txt']"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "text.embedding.get_file_names('glove')[:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For simplicity of demonstration, we use a smaller word embedding file, such as the 50-dimensional one. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:04.229138Z",
+     "start_time": "2018-03-27T00:03:53.341827Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "glove_6b50d = text.embedding.create('glove', file_name='glove.6B.50d.txt')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we create vocabulary by using all the tokens from `glove_6b50d`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:06.032364Z",
+     "start_time": "2018-03-27T00:04:04.231212Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "vocab = text.vocab.Vocabulary(Counter(glove_6b50d.idx_to_token))\n",
+    "vocab.set_embedding(glove_6b50d)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below shows the size of `vocab` including a special unknown token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:06.042843Z",
+     "start_time": "2018-03-27T00:04:06.034933Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "400001"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(vocab.idx_to_token)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can access attributes of `vocab`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:06.056449Z",
+     "start_time": "2018-03-27T00:04:06.046106Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "71421\n",
+      "beautiful\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(vocab['beautiful'])\n",
+    "print(vocab.idx_to_token[71421])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Applications of Word Embeddings\n",
+    "\n",
+    "To apply word embeddings, we need to define cosine similarity. It can compare similarity of two vectors."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:06.067188Z",
+     "start_time": "2018-03-27T00:04:06.059379Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from mxnet import nd\n",
+    "def cos_sim(x, y):\n",
+    "    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The range of cosine similarity between two vectors is between -1 and 1. The larger the value, the similarity between two vectors."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:06.272263Z",
+     "start_time": "2018-03-27T00:04:06.070098Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "[ 1.]\n",
+      "<NDArray 1 @cpu(0)>\n",
+      "\n",
+      "[-1.]\n",
+      "<NDArray 1 @cpu(0)>\n"
+     ]
+    }
+   ],
+   "source": [
+    "x = nd.array([1, 2])\n",
+    "y = nd.array([10, 20])\n",
+    "z = nd.array([-1, -2])\n",
+    "\n",
+    "print(cos_sim(x, y))\n",
+    "print(cos_sim(x, z))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Word Similarity\n",
+    "\n",
+    "Given an input word, we can find the nearest $k$ words from the vocabulary (400,000 words excluding the unknown token) by similarity. The similarity between any pair of words can be represented by the cosine similarity of their vectors."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:06.292283Z",
+     "start_time": "2018-03-27T00:04:06.274721Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def norm_vecs_by_row(x):\n",
+    "    return x / nd.sqrt(nd.sum(x * x, axis=1)).reshape((-1,1))\n",
+    "\n",
+    "def get_knn(vocab, k, word):\n",
+    "    word_vec = vocab.embedding[word].reshape((-1, 1))\n",
+    "    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
+    "    dot_prod = nd.dot(vocab_vecs, word_vec)\n",
+    "    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+2, ret_typ='indices')\n",
+    "    indices = [int(i.asscalar()) for i in indices]\n",
+    "    # Remove unknown and input tokens.\n",
+    "    return vocab.to_tokens(indices[2:])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us find the 5 most similar words of 'baby' from the vocabulary (size: 400,000 words)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:06.687950Z",
+     "start_time": "2018-03-27T00:04:06.295771Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['babies', 'boy', 'girl', 'newborn', 'pregnant']"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_knn(vocab, 5, 'baby')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can verify the cosine similarity of vectors of 'baby' and 'babies'."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:06.698920Z",
+     "start_time": "2018-03-27T00:04:06.691103Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\n",
+       "[ 0.83871299]\n",
+       "<NDArray 1 @cpu(0)>"
+      ]
+     },
+     "execution_count": 23,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us find the 5 most similar words of 'computers' from the vocabulary."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:07.084357Z",
+     "start_time": "2018-03-27T00:04:06.702292Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['computer', 'phones', 'pcs', 'machines', 'devices']"
+      ]
+     },
+     "execution_count": 24,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_knn(vocab, 5, 'computers')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us find the 5 most similar words of 'run' from the vocabulary."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:07.504323Z",
+     "start_time": "2018-03-27T00:04:07.087221Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['running', 'runs', 'went', 'start', 'ran']"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_knn(vocab, 5, 'run')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us find the 5 most similar words of 'beautiful' from the vocabulary."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:07.967072Z",
+     "start_time": "2018-03-27T00:04:07.507039Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['lovely', 'gorgeous', 'wonderful', 'charming', 'beauty']"
+      ]
+     },
+     "execution_count": 26,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_knn(vocab, 5, 'beautiful')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Word Analogy\n",
+    "\n",
+    "We can also apply pre-trained word embeddings to the word analogy problem. For instance, \"man : woman :: son : daughter\" is an analogy. The word analogy completion problem is defined as: for analogy 'a : b :: c : d', given teh first three words 'a', 'b', 'c', find 'd'. The idea is to find the most similar word vector for vec('c') + (vec('b')-vec('a')).\n",
+    "\n",
+    "In this example, we will find words by analogy from the 400,000 indexed words in `vocab`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:08.040101Z",
+     "start_time": "2018-03-27T00:04:07.973776Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def get_top_k_by_analogy(vocab, k, word1, word2, word3):\n",
+    "    word_vecs = vocab.embedding[word1, word2, word3]\n",
+    "    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))\n",
+    "    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
+    "    dot_prod = nd.dot(vocab_vecs, word_diff)\n",
+    "    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+1, ret_typ='indices')\n",
+    "    indices = [int(i.asscalar()) for i in indices]\n",
+    "\n",
+    "    # Filter out unknown tokens.\n",
+    "    if vocab.to_tokens(indices[0]) == vocab.unknown_token:\n",
+    "        return vocab.to_tokens(indices[1:])\n",
+    "    else:\n",
+    "        return vocab.to_tokens(indices[:-1])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Complete word analogy 'man : woman :: son :'."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:08.519697Z",
+     "start_time": "2018-03-27T00:04:08.051060Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['daughter']"
+      ]
+     },
+     "execution_count": 28,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us verify the cosine similarity between vec('son')+vec('woman')-vec('man') and vec('daughter')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:08.535690Z",
+     "start_time": "2018-03-27T00:04:08.522548Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\n",
+       "[ 0.9658342]\n",
+       "<NDArray 1 @cpu(0)>"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def cos_sim_word_analogy(vocab, word1, word2, word3, word4):\n",
+    "    words = [word1, word2, word3, word4]\n",
+    "    vecs = vocab.embedding[words]\n",
+    "    return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])\n",
+    "\n",
+    "cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Complete word analogy 'beijing : china :: tokyo : '."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:08.939664Z",
+     "start_time": "2018-03-27T00:04:08.538918Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['japan']"
+      ]
+     },
+     "execution_count": 30,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_k_by_analogy(vocab, 1, 'beijing', 'china', 'tokyo')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Complete word analogy 'bad : worst :: big : '."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:09.319291Z",
+     "start_time": "2018-03-27T00:04:08.942078Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['biggest']"
+      ]
+     },
+     "execution_count": 31,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Complete word analogy 'do : did :: go :'."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2018-03-27T00:04:09.735225Z",
+     "start_time": "2018-03-27T00:04:09.323663Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['went']"
+      ]
+     },
+     "execution_count": 32,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/python/mxnet/gluon/text/embedding.py b/python/mxnet/gluon/text/embedding.py
index 1839212..fcbc6df 100644
--- a/python/mxnet/gluon/text/embedding.py
+++ b/python/mxnet/gluon/text/embedding.py
@@ -155,6 +155,8 @@ class TokenEmbedding(object):
 
     Properties
     ----------
+    idx_to_token : list of strs
+        A list of indexed tokens where the list indices and the token indices are aligned.
     idx_to_vec : mxnet.ndarray.NDArray
         For all the indexed tokens in this embedding, this NDArray maps each token's index to an
         embedding vector.
@@ -285,6 +287,10 @@ class TokenEmbedding(object):
             self._idx_to_vec[C.UNKNOWN_IDX] = nd.array(loaded_unknown_vec)
 
     @property
+    def idx_to_token(self):
+        return self._idx_to_token
+
+    @property
     def idx_to_vec(self):
         return self._idx_to_vec
 

-- 
To stop receiving notification emails like this one, please contact
zhasheng@apache.org.