Instructions
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 8
Download | |
Open PDF In Browser | View PDF |
ENS C ACHAN , M ASTER MVA Deep Learning for NLP Word and sentence embeddings Sentence Classification with LSTMs/ConvNets Alexis Conneau 2016/2017 1 I NTRODUCTION The first part of the assignment will guide you through the use of word2vec for word embeddings. We will explore the word embedding space, and we will see how we can generate sentence embeddings from word embeddings. The 2nd and 3rd part of the assignment will introduce two types of Deep Learning models to tackle the problem of sentence classification: • Recurrent Neural Networks (LSTMs) • Temporal Convolutional Neural Networks (1D ConvNets) We will use Python and two of its packages Gensim (for word2vec) and Keras (LSTMs/ConvNets). For all the questions in the assignmnent, short answers are preferred. We expect a a zip/tar.gz/.tgz file named "surname_firstname_deliverable_NLP", containing a pdf, and python scripts (*only your code*). The first part will be on the installation of Keras and gensim. The installation of Keras (and the setting of the theano backend) can be tricky, especially for windows users. Make sure you are able to install the python packages before going to part 3. 1 2 G ETTING STARTED WITH P YTHON AND K ERAS Python is among the most important languages for data science because of its simplicity, its large community and especially for its useful packages (scikit-learn for Machine Learning, nltk for NLP, Pytorch/Theano/TensorFlow/Keras for Deep Learning etc). In this assignment, we will guide you through the installation of python and its packages : gensim (word2vec part) and Keras (sentence classification part). Among the well-known frameworks for Deep Learning (Pytorch, Torch, Caffe, TensorFlow, Theano), Keras is the most famous in Kaggle competitions because it enables easy and fast prototyping. In the following, we guide you through the installation of python and these packages. 2.1 I NSTALL PYTHON 1. Download the latest1 Anaconda distribution for python 2.7 : from the link HERE. 2. Choose python 2.7 and the graphical installation. 3. Download the Pycharm IDE from the link HERE (for windows users : "create associations .py" when installing). 4. Open Pycharm, and follow these instructions: a) "Create a new project". b) Choose the Anaconda interpreter (IMPORTANT!). c) Create a new python script ("File", "New...", "Python file", and type the name of your file). d) Write in this script : print ’Hello World’ e) Run the script ("Run" option in the menu bar, and then "Run"). Please first ask a colleague for help if you have troubles. 2.2 PYTHON If you are new to python, don’t worry, with the right IDE (PyCharm for instance) the environment is very similar to the one of Matlab. Take some time to read this short tutorial on Python : Python in 10 minutes. 2.3 I NSTALL THE PYTHON PACKAGES G ENSIM AND K ERAS Gensim and Keras are "packages" of python. It means that these are libraries written in python and available to anyone. In a python script, you can "import" those packages to use them. For instance "import gensim" will import the gensim package. Installing them is rather easy with the "pip" command line. In the following, we guide you through the installation: 1 The latest python version includes the easy package installation tool : "pip" 2 1. Open a terminal • Windows users : how to open the command prompt in Windows • Linux/MAC users if you probably are on MAC/Linux you already know how to open the terminal. 2.3.1 I NSTALL G ENSIM ( WORD 2 VEC ) 2. Type in the terminal : pip install gensim (or try sudo pip install gensim). 3. In the python file you created, type import gensim. Run the file. If there is no error, gensim has been installed properly. 2.3.2 I NSTALL K ERAS ( DEEP LEARNING FRAMEWORK ) 4. Type in the terminal : pip install keras==1.2.0 (or try sudo pip install keras==1.2.0). 5. In the python file you created, type import keras. Run the file. If there is no error, keras has been installed properly. 6. You now need to change the backend from tensorflow (default) to theano to avoid code errors: • Windows users : Type "notepad %USERPROFILE%/.keras/keras.json" in the command prompt, it will open a file. In this file, change the line "backend": "tensorflow" to "backend": "theano". • Linux/MAC users : in the script "∼/.keras/keras.json", change "backend": "tensorflow" to "backend": "theano". 7. Once you’ve done it : "import keras" should output the following line "Using Theano backend", and run without problem.2 https://github.com/fchollet/keras/blob/master/docs/templates/backend.md That’s it : you have python, gensim and keras installed3 . Congrats!. Now that this is done, follow the instructions of the next sections. For the assignment, you will need to report your results in a pdf file, and include the python scripts (we won’t try to debug your scripts so please double-check before submitting it). 2 Note that the windows installation of python packages can be tricky. Make sure this works before going further in the assignment. 3 You can even "pip install sklearn" and take a look at the Machine Learning package that is scikit-learn. 3 3 W ORD AND SENTENCES EMBEDDINGS WITH WORD 2 VEC You are provided with the following script : embedding_word2vec_students.py. In this script, we import gensim, and from gensim we import the word2vec functionality. The ’first part’ trains the model, but you don’t need to train it until the end. You are provided with "text8.model" and "text8.phrase.model" in the "models/" directory, which are pretrained word2vec models that you will load in ’second part’. QUESTION 1 Run the "first part" (you don’t need to finish the training of the model). From the INFO that is shown while training : 1. What is the total number of raw words found in the corpus? 2. What is the number of words retained in the word2vec vocabulary (with default min_count = 5)? QUESTION 2 The "second part" loads two models, model and model_phrase. 1. What is the similarity between (’apple’ and ’mac’), between (’apple’ and ’peach’), between (’banana’ and ’peach’)? In your opinion, why are you asked about the three previous examples? 2. What is the closest word to the word ’difficult’, for ’model’ and ’model_phrase’. Comment about the difference between model and model_phrase. Find the three phrases that are closest to the word ’clinton’. 3. Find the closest word to the vector "vect(france) - vect(germany) + vect(berlin)" and report it similarity measure. 4. Explore the embedding space using these functions and report some interesting behaviour (of your choice). QUESTION 3 The "third part" proposes a way to construct sentence embeddings from word embeddings, by simply taking the mean of the word embeddings contained in the sentence. The "avgword2vec" vector of a sentence is computed : PS avgword2vec(s) = i =1 αi word2vec(w i ) PS i =1 αi (3.1) with αi = 1, ∀i = 1, . . . , S, the weights of the word vectors. 1. Fill the blanks of the "cosine_similarity" function. Report the closest sentence to the sentence with idx "777", and their similarity score. 2. Fill the blanks of the "most_5_similar" function. Report the 5 closest sentences to the sentence with idx "777", and the associated similarity scores. 4 QUESTION 4 ( OPTIONAL ) In the "fourth part", we are going to define the weights αi as the IDF scores of each word. The TF-IDF is a classical method to construct a representation of a document. TF-IDF is composed of TF (Term Frequency) and IDF (Inverse Document Frequency). In our case, documents are just sentences, and the term frequency of each word inside a sentence is not very informative (they usually all appear once or twice max). But the IDF score is informative, and we are going to use them to embed a sentence using a weighted average of word2vec embeddings. PS IDF_avgword2vec(s) = i =1 idf(w i ).word2vec(w i ) PS i =1 idf(w i ) (3.2) 1. Implement the "IDF" function. Report the IDF score of the word "the", the word "a", and the word "clinton". 2. Implement the "avg_word2vec_idf" (very small modification from "avg_word2vec" function) using the "word2idf" dictionary. Report the closest sentence to sentence with idx 777. 4 S IMPLE LSTM FOR S EQUENCE C LASSIFICATION You are provided with the following script : lstm_imdb.py. The script contains the following classical steps for training a simple LSTM for sequence classification : • Load the data (IMDB). • Build the model (assemble the modules - embedding, lstm, classifier - in a container). • Define the loss function (cross entropy), the metrics (accuracy), and the optimizer (sgd or adam). • Train the model on train, evaluate generalization error on a held-out valid set. • Evaluate your model on test set. The dataset we use is the IMDB dataset for sequence classification. It’s a binary classification (good/bad) from movie reviews written on the imdb website. Take a look at the doc here : imdb dataset. Keras has a very simple function for training a model : the "fit" function. that will train the model on a certain number of epoches, using a certain optimizer (sgd, adam..). QUESTION 1 Read the lstm_imdb.py code and the associated comments. What is the (minibatch)shape of : 1. the input of the embedding layer. 2. the input of the LSTM layer. 3. the output of the LSTM layer. 5 QUESTION 2 1. From the output of the lstm_imdb.py script, report the number of parameters of the model with the standard set of hyper-parameters. Report also the number of sentences in the training set. In standard statistics, a rule of thumb is to have less parameters than samples in your dataset. How do you think it’s possible to train a model that has so many parameters compared to this number of samples? (we expect one key word) QUESTION 3 The word embeddings are fed to the LSTM, which outputs a sentence embedding. This sentence embedding is then fed to a classifier, that outputs the prediction of the label. 1. For a single sentence, the LSTM has states h 1 , . . . , h T where T is the number of words in the sentence. The sentence embeddings that is fed to the classifier is thus computed as f (h 1 , . . . , h T ). What is the exact form of f (h 1 , . . . , h T ) used in the python script (look at the keras doc of the lstm function)? QUESTION 4 Run the model with and without dropout (training should take less than 10 minutes). 1. For each, plot the evolution of the train and valid accuracy per epoch, and write the test errors that you obtain. You can use the python package "matplotlib" that’s already installed (import matplotlib). Matplotlib tutorial. QUESTION 5 1. Explain what is the difference between SGD and Adam. (no more than 3-4 lines) (more info here : Optimization methods). 5 S IMPLE C ONV N ET FOR S EQUENCE C LASSIFICATION You are provided with the script cnn_imdb.py that is very similar to lstm_imdb.py except the model used is a 1D Convolution (or Temporal Convolution). For Computer Vision, 2D Convolution is used directly on top of the image (3D : (RGB, width, height). For text processing, Convolution1D is done over the matrix of the embeddings of the words contained in the sentence (2D : (number of words, size of embedding)). To understand the difference, take a look at the the documentation : here The model implemented in the script is composed of several modules : 1. It takes a sequence of words as input and transforms it into a sequence of word embeddings of dimension E using a lookup table (the "embedding" module). 2. The lookup table provides a matrix of word embeddings of size S × E which is the input to the 1D Convolution. The 1D Convolution applies N kernels to it (here N=256, with kernel size K=3). This outputs several "feature maps" (FMt ∈ RN , t = 1, . . . , S). 6 Figure 5.1: Temporal convolution : input is on top, output is on the bottom. Each filter is of the size of the Figure 5.2: Spatial Convolution : input is grey part (in terms of number of on top, output is on the bottom. parameters). One filter will outEach filter is of the size of the put a (1, height, width) output : grey part (in terms of number of for instance, a filter will be apparameters). One filter will outplied to the grey zone and output a (1, height, width) output : put the cyan output (a scalar). for instance, a filter will be applied to the grey zone and output the cyan output (a scalar). QUESTION 1 Fill the gaps of cnn_imdb.py. Run the script, and report the results (test loss and test error) that you obtain. QUESTION 2 1. In cnn_imdb.py: what is the input and output shape of Convolution1D? QUESTION 3 (+1.5 points) 1. In a new cnn_lstm_imdb.py script that you create (and include in your assignment), build a model where on top of the convolution, you have an LSTM. It means that the input of the LSTM will be the output of your ConvNet (yes, it’s possible, and usually done in the litterature). Run the model with the best parameters you find (sufficiently small so that you have time to run the model, limiting over/under fitting). Report your best results. 7 6 P OINTS TO FURTHER STUDY Using Deep Learning models on rather small datasets is often not the best practice. In general, if you have a small dataset, take a look at simpler methods that use unigrams or bigrams, such as the "TF-IDF" method, with PCA ("Latent Semantic Analysis"), or more recently "FastText" from Facebook. This provides very fast and strong baselines that are usually hard to beat. Deep Learning methods such as LSTMs can be applied for a various number of applications, not only text classification and are thus efficient tool for Natural Language Processing. They are state-of-the art for many different text processing tasks as we have seen in the Deep Learning for NLP course. 8
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 8 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.16 Create Date : 2017:02:03 10:14:34-08:00 Modify Date : 2017:02:03 10:14:34-08:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015) kpathsea version 6.2.1EXIF Metadata provided by EXIF.tools