Neural network for text processing

8 min readNov 1, 2020

When we hear about Convolutional Neural Network (CNNs), we typically think of Computer Vision but we can also apply CNNs to problems in Natural Language Processing.

Lets first understand CNN in brief -

Recall Convolution, its nothing but a sliding window function applied to a matrix. The sliding window is called a kernel, filter, or feature detector. Here we use a 3×3 filter, multiply its values element-wise with the original matrix, then sum them up. To get the full convolution we do this for each element by sliding the filter over the whole matrix.

ANALOGY🔍 — I like to think it as like moving a lens over a picture. But here our lens may be tinted and what we see is the modified image because of influence of the lense . Here our lense is Filter/kernal beause of which now image has a pinkish shade. Different filters can have different effect, like averaging each pixel with its neighboring values blurs an image,taking the difference between a pixel and its neighbors detects edges. And we are moving it in a nice regular pattern which can be ssociated with stride in CNN.

CNN for Text

How it works ?

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word.

Typically, these vectors are word embeddings (low-dimensional representations) like word2vec or GloVe, or one-hot vectors that index the word into a vocabulary.

Like in image processing our filters slide over local patches of an image, in NLP we typically use filters that slide over full rows of the matrix (words). Thus, the “width” of our filters is usually the same as the width of the input matrix. The height, or region size, may vary, but sliding windows over 2–5 words at a time is typical.

When we are working with sequential data, like text, we work with one dimensional convolutions, but the idea and the application stays the same. We still want to pick up on patterns in the sequence which become more complex with each added convolutional layer.

Why 1D convolution ?

Let’s take a sentence: “I am going home”.

As you can see above, the length of the filter (3 in this case) is the same as the length of word embedding (3 in this case). The filter can move only in one direction (top to bottom in this case) instead of moving in both two directions (x and y axis) in the case of 2D convolution. And that’s precisely the reason why it’s called 1D convolution.

1D CNN is almost the same as a 2D CNN. The only difference is that the convolution takes place from left to right only, that is, it goes in one direction. It doesn’t make sense to convolve over the text as you would convolve in an image. And that’s why it’s called 1D convolution

If the width of the filter is 2 , it can process a sequence of two words at a time. At the first instant, the filter processes two words ‘I’ and ‘am’. Next, it processes ‘am’ and ‘going’, and then ‘going’ and ‘home’.

CNN vs NLP models

A big argument for CNNs is that they are fast. Very fast. Compared to something like n-grams, CNNs are also efficient in terms of representation. With a large vocabulary, computing anything more than 3-grams can quickly become expensive. Even Google doesn’t provide anything beyond 5-grams. Convolutional Filters learn good representations automatically, without needing to represent the whole vocabulary. It’s completely reasonable to have filters of size larger than 5. I like to think that many of the learned filters in the first layer are capturing features quite similar (but not limited) to n-grams, but represent them in a more compact way.

CNN vs RNN

It turns out that CNNs are also really effective in extracting features from the text other than RNN. Although, you can directly feed text to an RNN but there are cases when the sequences are really long which makes it harder to use RNN because they’re computationally expensive.

But what about Location Invariance and local Compositionality in CNN , You probably do care a lot where in the sentence a word appears. Pixels close to each other are likely to be semantically related (part of the same object), but the same isn’t always true for words. In many languages, parts of phrases could be separated by several other words. The compositional aspect isn’t obvious either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually “mean” isn’t as obvious as in the Computer Vision case.

CNN with RNN

“ CNN can extract local features of input and RNN (recurrent neural network) can process sequence input and learn the long-term dependencies, we combine both of them and viola ”

So now we know about the strength and weakness of CNN and RNN wrt NLP task/ Language processing. Lets employee the best of both .

We can use a 1D CNN to extract meaningful features which results in much shorter sequences in a much faster way.
We can then feed this vector to an RNN in the same way as one would feed a sentence.

The original work behind using 1D CNNs with RNNs was proposed in this paper titled “Combination of Convolutional and Recurrent Neural Network for Sentiment Analysis of Short Texts”.

Let take an ex of text classification -

We take the word embeddings as the input of our
CNN model in which windows of different length and various weight matrices are applied to generate a
number of feature maps.

After convolution and pooling operations, the encoded feature maps are taken as the input to the RNN model.

We apply windows of different length and various weight matrices in
convolutional operation. The max pooling operates on the adjacent features and moves from left to right

instead of on the entire sentence. In this case, the feature maps generated in our CNN model retain the sequential information in the sentence context.

The long-term dependencies learned by RNN can be viewed as the sentencelevel representation.

The deep learning architecture of our model takes advantage of the encoded local features extracted from the CNN model and the long-term dependencies
captured by the RNN model.

The sentence-level representation is taken to the fully connected network and thesoftmax output reveals the classification result

In particular, our pooling operation on adjacent words is able to retain the local features and their sequential relations in a sentence. Besides, RNN can learn the long-term dependencies and the positional relation of features as well as the global features of the whole sentence

Convolution and Pooling

Generally, let l and d be the length of sentence and word vector, respectively. Let C ∈ R d×l be the sentence matrix.
A convolution operation involves a convolutional kernel H ∈ R d×w which is applied to a window of w words to produce a new feature.The convolutional kernel is applied to each possible window of words in the sentence to produce a feature map.
Next, we apply pairwise max pooling operation over the feature map to capture the most important feature. The pooling operation can be considered as feature selection in natural language processing.

Recurrent neural network

Recurrent neural networks (RNNs) are various kinds of time-recursive neural network that is able to learn the long-term dependencies in sequential data. Seeing that we can view the words in a sentence as a sequence from left to right, RNNs can be modeled in accordance with people’s reading and understanding behavior of a sentence.

We take these features as the input of the recurrent neural network.
We can apply LSTM or GRU both get good results. The output of RNN is deemed as the encoding of the whole sentence.

Fully Connected Network with Softmax Output

The features generated from RNN form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over all the categories. The softmax operation over the scores of all the categories is calculated.

We take cross entropy as the loss function that measures the discrepancy between the real distribution and the model output distribution .

So , that was deep learning of a NLP use case by using the best in CNN and RNN . You can find the full code here Github. Thats all for now ..🤗

If you liked the article, show your support by clapping for this article. This article is basically a colab of many articles from machinelearning mastery, medium , analytical vidya , upgrad material etc.

If you are also learning Machine learning like me follow me, for more articles. Lets go on this trip together :)

You can also follow me on Linkedin