You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/19 11:46:35 UTC
[GitHub] jegalgo closed pull request #9141: add lstm sentimental analysis tutorial

jegalgo closed pull request #9141: add lstm sentimental analysis tutorial
URL: https://github.com/apache/incubator-mxnet/pull/9141
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/example/lstm-sentimal_analysis/lstm_cell.png b/example/lstm-sentimal_analysis/lstm_cell.png
new file mode 100644
index 0000000000..442b50b6f0
Binary files /dev/null and b/example/lstm-sentimal_analysis/lstm_cell.png differ
diff --git a/example/lstm-sentimal_analysis/lstm_diagram.png b/example/lstm-sentimal_analysis/lstm_diagram.png
new file mode 100644
index 0000000000..b2cebc14e0
Binary files /dev/null and b/example/lstm-sentimal_analysis/lstm_diagram.png differ
diff --git a/example/lstm-sentimal_analysis/tutorial_lstm.ipynb b/example/lstm-sentimal_analysis/tutorial_lstm.ipynb
new file mode 100644
index 0000000000..571288096b
--- /dev/null
+++ b/example/lstm-sentimal_analysis/tutorial_lstm.ipynb
@@ -0,0 +1,491 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# LSTM: Sentimental Analysis using IMDB movie review data set\n",
+    "\n",
+    "- In this tutorial, we will create a model that determines whether movie reviews are positive or negative.\n",
+    "- This tutorial is intended for those who have a good understanding of python and a brief understanding of the LSTM model.\n",
+    "\n",
+    "\n",
+    "## Prerequisites\n",
+    "\n",
+    "You need to install \n",
+    "    - Python 3.6. https://www.python.org/downloads/\n",
+    "    - mxnet(gpu ver.) https://mxnet.incubator.apache.org/get_started/install.html\n",
+    "\n",
+    "You also need to understand the LSTM model.  \n",
+    "In this tutorial, a simple knowledge already presented on several websites is sufficient. However, if you want a deep understanding, please refer to the following article.\n",
+    "    - https://arxiv.org/pdf/1503.00185v5.pdf\n",
+    "    \n",
+    "Finally, you need to know about python, mxnet, gloun package, and so on.\n",
+    "\n",
+    "\n",
+    "## The Data\n",
+    "\n",
+    "We will use IMDB. You can download it from the following link.  \n",
+    "It is divided into 25000 train data and test data each labeled.  \n",
+    "See the data set's README for more information.\n",
+    "    - http://ai.stanford.edu/~amaas/data/sentiment/\n",
+    "    \n",
+    "We will use Stanford's Global Vector for Word Representation (GloVe) embedding for word embedding.  \n",
+    "Download glove.42B.300d.zip from the following link:\n",
+    "    - https://nlp.stanford.edu/projects/glove/\n",
+    "    \n",
+    "    \n",
+    "## Prepare the Data\n",
+    "\n",
+    "##### 1. Read the data and label it. (1 means positive and 0 means negative.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# helper function to read files\n",
+    "def read_files(folder_name):\n",
+    "    sentiments =[]\n",
+    "    filenames = os.listdir(os.curdir+\"/\"+folder_name)\n",
+    "\n",
+    "    for file in filenames:\n",
+    "        with open(folder_name+\"/\"+file, \"r\", encoding=\"utf8\") as f:\n",
+    "            data = f.read().replace(\"\\n\", \"\")\n",
+    "            sentiments.append(data)\n",
+    "\n",
+    "    return sentiments\n",
+    "\n",
+    "data_path = '../data/aclImdb'\n",
+    "\n",
+    "# prepare train data\n",
+    "train_pos_foldername = data_path + \"/train/pos\"\n",
+    "train_pos_sentiments = read_files(train_pos_foldername)\n",
+    "\n",
+    "train_neg_foldername = data_path + \"/train/neg\"\n",
+    "train_neg_sentiments = read_files(train_neg_foldername)\n",
+    "\n",
+    "train_pos_labels = [1 for _ in train_pos_sentiments]\n",
+    "train_neg_labels = [0 for _ in train_neg_sentiments]\n",
+    "train_all_labels = train_pos_labels + train_neg_labels\n",
+    "\n",
+    "train_all_sentiments = train_pos_sentiments + train_neg_sentiments\n",
+    "\n",
+    "# prepare test data\n",
+    "test_pos_foldername = data_path + \"/test/pos\"\n",
+    "test_pos_sentiments = read_files(test_pos_foldername)\n",
+    "\n",
+    "test_neg_foldername = data_path + \"/test/neg\"\n",
+    "test_neg_sentiments = read_files(test_neg_foldername)\n",
+    "\n",
+    "test_pos_labels = [1 for _ in test_pos_sentiments]\n",
+    "test_neg_labels = [0 for _ in test_neg_sentiments]\n",
+    "test_all_labels = test_pos_labels + test_neg_labels\n",
+    "\n",
+    "test_all_sentiments = train_pos_sentiments + train_neg_sentiments"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 2. We will make a dictionary from the words in the train data. Before that, let's define some helper functions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Delete various special characters from the review sentence.\n",
+    "def clear_str(sentiment):\n",
+    "    import re\n",
+    "    string = sentiment.lower().replace(\"<br />\", \" \")\n",
+    "    remove_spe_chars = re.compile(\"[^A-Za-z0-9 ]+\")\n",
+    "\n",
+    "    return re.sub(remove_spe_chars, \"\", string.lower())\n",
+    "\n",
+    "# Count the frequency of occurrence of a word.\n",
+    "def count_word(sentiments):\n",
+    "    from collections import Counter\n",
+    "    \n",
+    "    word_counter = Counter()\n",
+    "    for sentiment in sentiments:\n",
+    "        for word in (clear_str(sentiment)).split():\n",
+    "            if word not in word_counter.keys():\n",
+    "                word_counter[word] = 1\n",
+    "            else:\n",
+    "                word_counter[word] += 1\n",
+    "\n",
+    "    return word_counter\n",
+    "\n",
+    "# make dictionary\n",
+    "def create_word_index(word_counter):\n",
+    "    idx = 1\n",
+    "    word_dict = {}\n",
+    "\n",
+    "    for word in word_counter.most_common():\n",
+    "        word_dict[word[0]] = idx\n",
+    "        idx += 1\n",
+    "\n",
+    "    return word_dict \n",
+    "\n",
+    "# save file. \n",
+    "# The dictionary is very large. \n",
+    "# Therefore, it is better to do this only the first time you create it, and then save and load it as a file.\n",
+    "import _pickle as pkl\n",
+    "def save_file(file_obj, file_name):\n",
+    "    f = open(file_name, 'wb')\n",
+    "    pkl.dump(file_obj, f, -1)\n",
+    "    f.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 3. Now let's make a dictionary!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def make_dictionary(dictionary_file_name, sentiments):\n",
+    "    # if file exist, return file\n",
+    "    if os.path.exists(dictionary_file_name):\n",
+    "        f = open(dictionary_file_name, 'rb')\n",
+    "        word_dict = pkl.load(f)\n",
+    "        f.close()\n",
+    "        return word_dict\n",
+    "\n",
+    "    else:\n",
+    "        # count word\n",
+    "        word_counter = count_word(sentiments)\n",
+    "        word_dict = create_word_index(word_counter)\n",
+    "        save_file(word_dict, dictionary_file_name)\n",
+    "        return word_dict\n",
+    "    \n",
+    "\n",
+    "dictionary_file_name = 'imdb.dict.pkl'\n",
+    "word_dict = make_dictionary( dictionary_file_name, train_all_sentiments )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 4. Encode the sentence using the dictionary created and fit it into numpy format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def encode_sentences(input_file, word_dict):\n",
+    "    output_string = []\n",
+    "    for line in input_file:\n",
+    "        output_line = []\n",
+    "        for word in clear_str(line).split():\n",
+    "            if word in word_dict:\n",
+    "                output_line.append(word_dict[word])\n",
+    "        output_string.append(output_line)\n",
+    "\n",
+    "    return output_string\n",
+    "\n",
+    "train_pos_encoded = encode_sentences(train_pos_sentiments, word_dict)\n",
+    "train_neg_encoded = encode_sentences(train_neg_sentiments, word_dict)\n",
+    "train_all_encoded = train_pos_encoded + train_neg_encoded\n",
+    "\n",
+    "test_pos_encoded = encode_sentences(test_pos_sentiments, word_dict)\n",
+    "test_neg_encoded = encode_sentences(test_neg_sentiments, word_dict)\n",
+    "test_all_encoded = test_pos_encoded + test_neg_encoded\n",
+    "\n",
+    "\n",
+    "import numpy as np\n",
+    "voca_size = 10000 # total number of voca we will track\n",
+    "train_data = [np.array([i if i < (voca_size-1) else (voca_size-1) for i in s]) for s in train_all_encoded]\n",
+    "test_data = [np.array([i if i < (voca_size-1) else (voca_size-1) for i in s]) for s in test_all_encoded]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 5. We now want to show what each word means through word embedding, which is another big task. In this case, we would like to use the already created model, Stanford's Global Vector for Word Representation (Glove) embedding. You have already downloaded it from The Data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead\n",
+      "  import OpenSSL.SSL\n"
+     ]
+    }
+   ],
+   "source": [
+    "from mxnet import nd\n",
+    "def load_glove_index(path):\n",
+    "    import io\n",
+    "    f = io.open(path, encoding=\"utf8\")\n",
+    "    embeddings_index = {}\n",
+    "\n",
+    "    for line in f:\n",
+    "        values = line.split()\n",
+    "        word = values[0]\n",
+    "        coefs = np.asarray(values[1:], dtype = 'float32')\n",
+    "        embeddings_index[word] = coefs\n",
+    "    f.close()\n",
+    "    return embeddings_index\n",
+    "\n",
+    "def create_embed(filename, glove_path, num_embed):\n",
+    "    if os.path.exists(filename):\n",
+    "        f = open(filename, 'rb')\n",
+    "        embedding_matrix = pkl.load(f)\n",
+    "        f.close()\n",
+    "        return embedding_matrix\n",
+    "\n",
+    "    embedding_index = load_glove_index(glove_path)\n",
+    "    \n",
+    "    embedding_matrix = np.zeros((voca_size, num_embed))\n",
+    "    for word, i in word_dict.items():\n",
+    "        if i >= voca_size:\n",
+    "            continue\n",
+    "        embedding_vector = embedding_index.get(word)\n",
+    "        if embedding_vector is not None:\n",
+    "            embedding_matrix[i] = embedding_vector\n",
+    "    embedding_matrix = nd.array(embedding_matrix)\n",
+    "    save_file(embedding_matrix, filename)\n",
+    "    return embedding_matrix\n",
+    "\n",
+    "num_embed = 300\n",
+    "embed_matrix_filename = 'embed.metrix.pkl'\n",
+    "glove_path = '../data/glove.42B.300d.txt'\n",
+    "embedding_matrix = create_embed(embed_matrix_filename, glove_path, num_embed)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 6. Shuffle the train data. As you already know, train data is in positive-negative order. However, if learning proceeds in this order, positive will be learned first, and correct learning will not be achieved."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "x_train, x_test, y_train, y_test = train_test_split(train_data, train_all_labels, test_size=0, random_state=42)\n",
+    "\n",
+    "x_test = test_data\n",
+    "y_test = test_all_labels"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "##### 7. Match the length of the sentence and change it to nd array form of mxnet.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "seq_len = 250 # set the max word length of each movie review\n",
+    "\n",
+    "# if sentence is greater than max_len, truncates\n",
+    "# if less, pad with value\n",
+    "def pad_sequences(sentences, max_len=500, value = 0):\n",
+    "    padded_sentences = []\n",
+    "    for sentence in sentences:\n",
+    "        new_sentence = []\n",
+    "        if (len(sentence) > max_len):\n",
+    "            new_sentence = sentence[:max_len]\n",
+    "            padded_sentences.append(new_sentence)\n",
+    "        else:\n",
+    "            new_sentence = np.append(sentence, [value]*(max_len-len(sentence)))\n",
+    "            padded_sentences.append(new_sentence)\n",
+    "\n",
+    "    return padded_sentences\n",
+    "\n",
+    "import mxnet as mx\n",
+    "context = mx.gpu()\n",
+    "X_train = nd.array(pad_sequences(x_train, max_len=seq_len, value=0), context)\n",
+    "X_test = nd.array(pad_sequences(x_test, max_len=seq_len, value=0), context)\n",
+    "Y_train = nd.array(y_train, context)\n",
+    "Y_test = nd.array(y_test, context)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the Model\n",
+    "\n",
+    "As you already know, lstm unit has the following form, which saves the previous state:\n",
+    "\n",
+    "![Alt text](lstm_cell.png)\n",
+    "\n",
+    "We will build this lstm unit and build a model that determines the positive and negative by setting the sigmoid fucntion on top of it to obtain the output between 0 and 1.\n",
+    "\n",
+    "![Alt text](lstm_diagram.png)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "num_classes = 1\n",
+    "num_hidden = 16\n",
+    "learning_rate = .01\n",
+    "epochs = 10\n",
+    "batch_size = 100\n",
+    "\n",
+    "from mxnet.gluon import nn, rnn\n",
+    "\n",
+    "model = nn.Sequential()\n",
+    "with model.name_scope():\n",
+    "    model.embed = nn.Embedding(voca_size, num_embed)\n",
+    "    model.add(rnn.LSTM(num_hidden, layout = 'NTC', dropout=0.5, bidirectional=False))\n",
+    "    model.add(nn.Dense(num_classes))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Train & Evaluate the Model\n",
+    "\n",
+    "1. We will proceed initialization with the xavior initializer. It is known to have the best performance among existing initializers.\n",
+    "2. We will use adadelta as the optimizer. This is also known to be better than classic sgd."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Epoch 0. Train_acc 0.83348, Test_acc 0.82272\n",
+      "Epoch 1. Train_acc 0.87936, Test_acc 0.85492\n",
+      "Epoch 2. Train_acc 0.90864, Test_acc 0.86396\n",
+      "Epoch 3. Train_acc 0.93144, Test_acc 0.8658\n",
+      "Epoch 4. Train_acc 0.94948, Test_acc 0.86368\n",
+      "Epoch 5. Train_acc 0.95868, Test_acc 0.85964\n",
+      "Epoch 6. Train_acc 0.9522, Test_acc 0.84844\n",
+      "Epoch 7. Train_acc 0.96992, Test_acc 0.85184\n",
+      "Epoch 8. Train_acc 0.9814, Test_acc 0.8524\n",
+      "Epoch 9. Train_acc 0.98608, Test_acc 0.84772\n"
+     ]
+    }
+   ],
+   "source": [
+    "from mxnet import gluon, autograd\n",
+    "\n",
+    "model.collect_params().initialize(mx.init.Xavier(), ctx=context)\n",
+    "\n",
+    "model.embed.weight.set_data(embedding_matrix.as_in_context(context))\n",
+    "\n",
+    "trainer = gluon.Trainer(model.collect_params(), 'adadelta')\n",
+    "\n",
+    "sigmoid_cross_entropy = gluon.loss.SigmoidBCELoss()\n",
+    "\n",
+    "# helper function: evaluate accuracy\n",
+    "def eval_accuracy(x, y, batch_size):\n",
+    "    accuracy = mx.metric.Accuracy()\n",
+    "\n",
+    "    for i in range(x.shape[0] // batch_size):\n",
+    "        data = x[i*batch_size:(i*batch_size + batch_size), ]\n",
+    "        target = y[i*batch_size:(i*batch_size + batch_size), ]\n",
+    "\n",
+    "        output = model(data)\n",
+    "        predictions = nd.array([( 1 if out >= 0.5 else 0 ) for out in output ] , context)\n",
+    "        accuracy.update(preds=predictions, labels=target)\n",
+    "\n",
+    "    return accuracy.get()[1]\n",
+    "\n",
+    "# train\n",
+    "for epoch in range(epochs):\n",
+    "\n",
+    "    for b in range(X_train.shape[0] // batch_size):\n",
+    "        data = X_train[b*batch_size:(b*batch_size + batch_size),]\n",
+    "        target = Y_train[b*batch_size:(b*batch_size + batch_size),]\n",
+    "\n",
+    "        data = data.as_in_context(context)\n",
+    "        target = target.as_in_context(context)\n",
+    "        \n",
+    "        with autograd.record():\n",
+    "            output = model(data)\n",
+    "            L = sigmoid_cross_entropy(output, target)\n",
+    "        L.backward()\n",
+    "        trainer.step(data.shape[0])\n",
+    "\n",
+    "    # filename = \"lstm_net.params_epoch\" + str(epoch)\n",
+    "    # model.save_params(filename)\n",
+    "\n",
+    "    test_accuracy = eval_accuracy(X_test, Y_test, batch_size)\n",
+    "    train_accuracy = eval_accuracy(X_train, Y_train, batch_size)\n",
+    "    print(\"Epoch %s. Train_acc %s, Test_acc %s\" %\n",
+    "          (epoch, train_accuracy, test_accuracy))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "So far, sentimental analysis has been done through the model using lstm.  \n",
+    "As you can see from the code above, the accuracy is about 0.86, but it is very difficult to raise it. You can get better accuracy by adjusting the batch size, optimizer, and number of hidden layers, but it will not change much.   \n",
+    "This is because of the limitations of the LSTM model itself, there are other models with better accuracy of sentimental analysis."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:mxnet_p36]",
+   "language": "python",
+   "name": "conda-env-mxnet_p36-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services