T H E E P I C C O D E

PVA COMMUNITY

### Introduction

In this post, I’ll walk through an implementation of the Continuous Bag of Words Model for generating word embedding vectors.

#### Applications of Word Embeddings

• Semantic Analogies
• Sentiment Analysis
• Classification of customer feedback
• Machine Translation
• Information Extraction
• Question Answering

#### Why not One-Hot Vectors?

One-hot vectors can also be used as word vectors as they are simple and have no implied ordering but they can get extremely huge in a sufficiently dense corpus and also they don’t have any embedded meaning

#### Advantages of Word Embedding Vectors?

• Low-Dimensions
• Embed Meaning

#### Other Famous Model

• word2vec
• Global Vectors
• fastText
• BERT
• ELMo
• GPT-2

### Importing Libraries

Let’s get started by importing the required libraries and also download ‘punkt’ from the nltk module.
```import nltk
from nltk.tokenize import word_tokenize
import numpy as np
from collections import Counter

nltk.data.path.append('.')

nltk.download('punkt')```

### Pre-Processing

We need to do some pre-processing on the corpus like removing punctuations and non-alphabetical characters and turning everything into lowercase.

```import re
with open('shakespeare.txt') as f:
data = f.read()
data = re.sub(r'[,!?;-]', '.',data) # Punctuations
data = nltk.word_tokenize(data)
data = [ ch.lower() for ch in data if ch.isalpha() or ch == '.'] # Lowercase and non-alphabetical

freq_dist = nltk.FreqDist(word for word in data)```

### Building the Model

#### Initialising the Model

```def initialize(N,V, random_seed=1):
np.random.seed(random_seed)

w1 = np.random.rand(N,V)
w2 = np.random.rand(V,N)
b1 = np.random.rand(N,1)
b2 = np.random.rand(V,1)

return w1,w2,b1,b2```

#### The softmax activation function

```def softmax(z):
e = np.exp(z)
y_hat = e/e.sum(e,axis=0)

return y_hat```

#### Forward Projection

```def forward_propagation(x,w1,w2,b1,b2):

h = np.dot(w1,x) + b1
h = np.maximum(0,h) # Make-shift ReLU
z = np.dot(w2,h)+b2

return z,h```

#### Cost Function

```def cost(y,y_hat, batch_size):
log_prob = np.multiply(np.log(y_hat),y) + np.multiply(np.log(1 - y_hat), 1 - y)
cost = -1/batch_size * np.sum(log_prob)
cost = np.squeeze(cost)

return cost```

#### Back Propagation

```def back_prob(x, y_hat, y, h, w1, w2, b1, b2, batch_size):
l1 = np.dot(w2.T,(y_hat-y))
l1 = np.maximum(0,l1) # Make-shift ReLU
grad_w1 = (1/batch_size)*np.dot(l1,x.T)
grad_w2 = (1/batch_size)*np.dot(y_hat-y,h.T)
grad_b1 = np.sum((1/batch_size)*np.dot(l1,x.T),axis=1,keepdims=True)
grad_b2 = np.sum((1/batch_size)*np.dot(y_hat-y,h.T),axis=1,keepdims=True)

return grad_w1, grad_w2, grad_b1, grad_b2```

#### Getting the Batches

```def get_batches(data, word2Ind, V, C, batch_size):
batch_x = []
batch_y = []
for x, y in get_vectors(data, word2Ind, V, C):
while len(batch_x) < batch_size:
batch_x.append(x)
batch_y.append(y)
else:
yield np.array(batch_x).T, np.array(batch_y).T
batch = []```

#### Gradient Descent

```def grad_descent(data, word2Ind, N, V, num_iters, alpha=0.03):

w1, w2, b1, b2 = initialise(N,V, random_seed=282)
batch_size = 128
iters = 0
C = 2
for x, y in get_batches(data, word2Ind, V, C, batch_size):

z, h = forward_prop(x, w1, w2, b1, b2)
yhat = softmax(z)
cost = cost(y, y_hat, batch_size)

if ( (iters+1) % 10 == 0):
print(f"iters: {iters + 1} cost: {cost:.6f}")

grad_w1, grad_w2, grad_b1, grad_b2 = back_prop(x, y_hat, y, h, w1, w2, b1, b2, batch_size)

w1 -= alpha*grad_w1
w2 -= alpha*grad_w2
b1 -= alpha*grad_b1
b2 -= alpha*grad_b2

iters += 1
if iters == num_iters:
break
if iters % 100 == 0:
alpha *= 0.66

return w1, w2, b1, b2```

#### Link to Github Repository for this Post

https://github.com/psych0man/Continuous-Bag-of-Words-Model