POS Tagging using RNN Variants

12 min readOct 26, 2020

POS Tagging

In NLP ,POS tagging comes under Syntactic analysis, where our aim is to understand the roles played by the words in the sentence, the relationship between words and to parse the grammatical structure of sentences.

POS tagging is the process of marking up a word in a corpus to a corresponding part of a speech tag, based on its context and definition.The POS tags identify the linguistic role of the word in the sentence.

Part-of-Speech tagging in itself may not be the solution to any particular NLP problem. It is however something that is done as a pre-requisite to simplify a lot of different problems.

POS tagging finds applications in Named Entity Recognition (NER), sentiment analysis, question answering, and word sense disambiguation. In the sentences I left the room and Left of the room, the word left conveys different meanings. A POS tagger would help to differentiate between the two meanings of the word left.

Lets get acquianted to some of the popular tags 😎

There are different techniques for POS Tagging:

1. Lexical Based Methods ( Majority Wins 🎢 )

For each word, it assigns the POS tag that most frequently occurs for that word in some training corpus which means will be wrongly tagged in some of the sentence. Also such a tagging approach cannot handle unknown/ambiguous words.

2. Rule-Based Methods ( Follow the rules 👩‍🏫 )

first assign the tag using the lexicon method and then apply predefined rules.The rules in Rule-based POS tagging are built manually. Some examples of rules are:

Change the tag to VBG for words ending with ‘-ing’
Changes the tag to VBD for words ending with ‘-ed’
Replace VBD with VBN if the previous word is ‘has/have/had’

3. Stochastic/Probabilistic Methods:

Any model which somehow incorporates frequency or probability may be properly labelled stochastic. Its assign a PoS to a word based on the probability that a word belongs to a particular tag or based on the probability of a word being a tag based on a sequence of preceding/succeeding words. These are the preferred, most used and most successful methods so far.

Among these methods, there could be defined two types of automated Probabilistic methods: the Discriminative Probabilistic Classifiers (examples are Logistic Regression, SVM’s and Conditional Random Fields — CRF’s) and the Generative Probabilistic Classifiers (examples are Naive Bayes and Hidden Markov Models — HMM)

The Hidden Markov Model, an extension to the Markov process, is used to model phenomena where the states are hidden and they emit observations. The transition and the emission probabilities specify the probabilities of transition between states and emission of observations from states, respectively. In POS tagging, the states are the POS tags while the words are the observations. To summarise, a Hidden Markov Model is defined by the initial state, emission, and the transition probabilities.

4. Deep Learning Methods — Recurrent Neural Networks for POS tagging.

In this article, I will be focusing on explaining the POS tagger with RNN.

This problem can be called a sequence data problem. As in the sentence , the meaning / context of a word can change due to presence of preceding or succeeding words, refer the below image.

We can clearly see who can the other wordds present in the sentence and their order can impact the meaning of a word.And we know , RNN are variants of the vanilla neural networks which are tailored to learn sequential patterns.

RNNs address the context of context by using memory units. The output that the RNN produce at step is affected by the output from the step. So in general RNN has two sources of input. One is the actual input and two is the context (memory) unit from previous input. refer my article for deep understanding of RNN

POS Tagging using RNN

RNN has various varients. We will seeing what are those ,as we will be doing POS by using all the variants and will compare the result.

Vanilla RNN
LSTM
GRU
Bidirectional LSTM

Lets Start to code 🐍

1. Importing the dataset

Let’s begin with importing the necessary libraries and loading the dataset.

2. Preprocess data

As a part of preprocessing, we’ll be performing various steps such as dividing data into words and tags, Vectorise X and Y, and Pad sequences.

First lets look at our data. We can see it is list of tuples having key as word and value as Tag. Like a classification problem we can put our independent variable ie “word” in X and dependent variable “tag” in Y .

Here we have created two empty list X,Y and appending the values into them.

Since this is a many-to-many problem, each data point will be a different sentence of the corpora. Each data point will have multiple words in the input sequence X.Each word will have its correpsonding tag in the output sequence Y.

We are making all words in lower case so as to avoid cases where “BIG” and “big” are treated as different values.
Notice , we are using Set in num_words and num_tags in order to remove duplicates that is the reason for different size in case of “Total number of tagged sentences” and “Vocabulary size”. If we treat it as multiclass classification problem , we have 12 classes ie “Total number of tags”.
We’ll use the Tokenizer() function from Keras library to encode text sequence to integer sequence. We know , model doesnt understand or process text data, but we everytime need to convert it into numericl data.

The next step after encoding the data is to define the sequence lengths. As of now, the sentences present in the data are of various lengths. We need to either pad short sentences or truncate long sentences to a fixed length. This fixed length, however, is a hyperparameter. Here we are dealing with 100 max length sentences on the cost of loosing some data.

Currently, each word and each tag is encoded as an integer.We’ll use a more sophisticated technique to represent the input words (X) using what’s known as word embeddings. we are using word2vec model ( the smart guy who knows king-man = queen )

here , we are defining the size of the embedding as 300. The embedding has its own weights and a different layer ( not a simple guy like one-hot encoding, word2vec is a model ).

#let's look at an embedding of a word embedding_weights[word_tokenizer.word_index['joy']]

Out[22]:

array([ 0.4453125 , -0.20019531,  0.20019531, -0.03149414,  0.078125  ,
       -0.390625  ,  0.13671875, -0.13867188,  0.05395508,  0.10546875,
       -0.05029297, -0.23730469,  0.19921875,  0.12597656, -0.12695312,
        0.34179688,  0.06347656,  0.26757812, -0.07324219, -0.29101562,
        0.10498047,  0.11914062,  0.23730469,  0.00640869,  0.12451172,
       -0.00939941, -0.02770996,  0.03076172,  0.07421875, -0.22851562,
       -0.08056641, -0.05273438,  0.16894531,  0.19824219, -0.15625   ,
       -0.08740234,  0.10742188, -0.07177734,  0.05200195,  0.25976562,
        0.171875  , -0.13574219,  0.06738281,  0.00531006,  0.15527344,
       -0.03515625,  0.08789062,  0.3359375 , -0.12890625,  0.17578125,
       -0.08642578,  0.32421875, -0.09033203,  0.35351562,  0.24316406,
       -0.07470703, -0.06640625, -0.17578125,  0.06689453, -0.03833008,
        0.0100708 , -0.21484375, -0.03686523,  0.04394531,  0.02209473,
        0.00219727, -0.22460938,  0.03015137, -0.21582031,  0.16015625,
        0.23339844, -0.16699219, -0.09228516,  0.10644531,  0.19335938,
       -0.26757812,  0.15722656, -0.08691406,  0.11181641,  0.14941406,
       -0.20507812,  0.04882812, -0.07519531, -0.21582031, -0.10107422,
       -0.13378906, -0.06103516,  0.05444336,  0.07470703,  0.09521484,
       -0.0144043 ,  0.27929688, -0.25585938, -0.05273438, -0.22460938,
        0.10253906, -0.15136719,  0.21289062, -0.04711914, -0.12109375,
        0.04663086,  0.25976562,  0.13574219,  0.00799561,  0.02001953,
        0.1796875 ,  0.30664062,  0.06152344,  0.13574219, -0.09619141,
       -0.07421875,  0.38671875,  0.20800781,  0.12695312,  0.05200195,
        0.17675781, -0.16796875, -0.19335938, -0.06152344, -0.07568359,
       -0.18457031,  0.06030273, -0.15136719, -0.1953125 , -0.23339844,
        0.00738525, -0.02478027, -0.09765625, -0.06054688,  0.20214844,
       -0.2734375 ,  0.00595093, -0.34570312, -0.12988281,  0.00418091,
        0.09960938,  0.0246582 ,  0.15917969, -0.02038574,  0.30273438,
       -0.20800781, -0.20214844, -0.03930664, -0.06494141,  0.00436401,
       -0.02270508, -0.171875  ,  0.30273438, -0.16113281, -0.49414062,
        0.3515625 ,  0.39257812,  0.09814453,  0.41796875,  0.05371094,
        0.02392578, -0.03710938, -0.08251953, -0.38671875, -0.40625   ,
       -0.05664062,  0.203125  , -0.01782227,  0.3359375 ,  0.19140625,
       -0.44335938,  0.00927734,  0.24804688, -0.05102539,  0.19726562,
        0.03881836,  0.03442383, -0.40039062, -0.09912109, -0.07128906,
        0.21484375, -0.01422119,  0.04907227, -0.07421875, -0.21582031,
       -0.41992188,  0.02172852,  0.11083984, -0.33398438, -0.2734375 ,
       -0.05322266, -0.16601562, -0.28515625, -0.12207031,  0.04882812,
       -0.0625    , -0.04077148, -0.16503906,  0.0480957 , -0.21191406,
        0.20019531, -0.2109375 ,  0.10839844, -0.14648438, -0.07958984,
       -0.05151367, -0.16601562, -0.24902344, -0.375     ,  0.05664062,
       -0.13671875, -0.2578125 ,  0.28515625, -0.04736328,  0.13574219,
       -0.14550781,  0.19433594, -0.21972656,  0.08447266, -0.10791016,
       -0.11816406, -0.16015625,  0.12060547, -0.10888672,  0.04345703,
        0.11474609, -0.08447266, -0.00720215,  0.03662109, -0.38671875,
       -0.03881836, -0.03198242,  0.00344849,  0.22558594, -0.06787109,
       -0.16699219,  0.2421875 ,  0.05712891,  0.27539062, -0.0456543 ,
       -0.19042969, -0.17285156,  0.00836182, -0.03271484,  0.16992188,
       -0.18554688, -0.0703125 , -0.32617188, -0.00668335, -0.02770996,
        0.3359375 ,  0.125     , -0.2109375 ,  0.06005859, -0.07080078,
        0.11132812,  0.125     ,  0.25390625,  0.29296875, -0.03125   ,
        0.09033203, -0.20507812, -0.07861328,  0.02062988, -0.0546875 ,
       -0.23339844,  0.00096893, -0.04516602,  0.16894531, -0.22167969,
        0.08105469,  0.33398438,  0.09619141,  0.00866699, -0.03271484,
        0.05493164,  0.12109375,  0.16210938, -0.10302734,  0.27148438,
       -0.03344727, -0.30273438,  0.04223633,  0.08496094, -0.15527344,
        0.10107422, -0.11474609, -0.13085938,  0.22949219,  0.12988281,
        0.09863281, -0.03588867,  0.10693359, -0.24902344,  0.19238281,
       -0.05322266, -0.09033203, -0.31640625, -0.5703125 , -0.15917969,
        0.0291748 , -0.0246582 , -0.07714844, -0.04663086, -0.17578125])

Using one-hot encoding for Y, as it has only 12 tags. But we need not to worry about the changing tag due to changing context unlike X.

3. Split data in training, validation and tesing sets

4.0 Training on Vanilla RNN

Now, while training the model, you can also train the word embeddings along with the network weights. These are often called the embedding weights. While training, the embedding weights will be treated as normal weights of the network which are updated in each iteration. If we will not allow them to tarin , we will get less accuracy .

For comparision , lets assume model is an organization. and parameters are freshers, whom you have to train first . More the no of trainees, more time taken to train them. But once they are trained , you will get a good results.

Now lets use another variant of RNN for POS tagging -

4.1 Training with LSTM (long, short-term memory network)

The main drastic improvement that LSTMs have brought is because of a novel change in the structure of a neuron itself. Its was introduced to overcome the problem of exploding/vanishing gradient in RNN. In the case of LSTMs, the neurons are called cells, and an LSTM cell is different from a normal neuron in many ways.LSTM has an explicit memory unit which stores information relevant for learning some task, it has gating mechanisms regulate the information that the network stores (and passes on to the next layer) or forgets,The structure of an LSTM cell allows an LSTM network to have a smooth and uninterrupted flow of gradients while backpropagating. This flow is also called the constant error carousel.

Then will compile it and fit it as we did for RNN .

In RNN , each epoch took approx 46 sec but here the time has increased to 96s on avg. Wonder why 🤔

Look at the compile step, our trainable parameters have increased in this case from 17,858,905 in RNN to 17,928,985 in LSTM . See the diagram of LSTM , more trainees are working here to make it more efficient.

which is responsible for this rise in accuracy. Compare the mode accuracy curve for training and test in RNN and LSTM , curves are closer in case of LSTM then in RNN.

4.2 Training with GRU (Gated Recurrent Unit)

The GRU, known as the Gated Recurrent Unit is an RNN architecture, which is similar to LSTM units. The GRU comprises of the reset gate and the update gate instead of the input, output and forget gate of the LSTM.The reset gate determines how to combine the new input with the previous memory, and the update gate defines how much of the previous memory to keep around.

Refer this article , it has animation to explain the different variants of RNN

Now look again at the number of trainable parameters -17,905,625. which is more than RNN (17,858,905) but less than LSTM(17,928,985), same reason the no of trainees.This was expected since the parameters in an LSTM and GRU are 4x and 3x of a normal RNN, respectively.

Lets compare the time now , on avg 84s , RNN(46 sec) where as LSTM ( 96s). More the trainable parameter , more the time taken .

4.3 Training with Bidirectional LSTM

In a bidirectional RNN, we consider 2 separate sequences. One from right to left and the other in the reverse order.A bidirectional RNN can only be applied to offline sequences(The entire sequence is available before you start processing it).

By using bidirectional RNNs, it is almost certain that you’ll get better results. However, bidirectional RNNs take almost double the time to train since the number of parameters of the network increase. Therefore, you have a tradeoff between training time and performance.

Left : Bidirectional RNN |Right: Bidirectional LSTM

Now look again at the number of trainable parameters -18,023,257 which is more than 17,905,625 ( GRU) ,LSTM(17,928,985) and RNN (17,858,905) . Here we have more trainees than LSTM as we have trainees working in both directions.

Same reason for time taken ie 174s , maximum among all.

5. Running model on Test data

This brings us to the lat step of the process , running our models on test data.

The bidirectional LSTM did increase the accuracy substantially (considering that the accuracy was already hitting the roof). This shows the power of bidirectional LSTMs. However, this increased accuracy comes at a cost. The time taken was almost double than a normal LSTM network.

Thats all for now 🤗

If you liked the article, show your support by clapping for this article. This article is basically a colab of many articles from machinelearning mastery, medium , analytical vidya , upgrad material etc.

If you are also learning Machine learning like me follow me, for more articles. Lets go on this trip together :)

You can also follow me on Linkedin