T
H
E
E
P
I
C
C
O
D
E

PVA COMMUNITY

In this post, I’ll walk through an implementation of the Continuous Bag of Words Model for generating word embedding vectors.

- Semantic Analogies
- Sentiment Analysis
- Classification of customer feedback
- Machine Translation
- Information Extraction
- Question Answering

One-hot vectors can also be used as word vectors as they are simple and have no implied ordering but they can get extremely huge in a sufficiently dense corpus and also they don’t have any embedded meaning

- Low-Dimensions
- Embed Meaning

- word2vec
- Global Vectors
- fastText
- BERT
- ELMo
- GPT-2

Let’s get started by importing the required libraries and also download ‘punkt’ from the nltk module.

import nltk from nltk.tokenize import word_tokenize import numpy as np from collections import Counter nltk.data.path.append('.') nltk.download('punkt')

We need to do some pre-processing on the corpus like removing punctuations and non-alphabetical characters and turning everything into lowercase.

import re with open('shakespeare.txt') as f: data = f.read() data = re.sub(r'[,!?;-]', '.',data) # Punctuations data = nltk.word_tokenize(data) data = [ ch.lower() for ch in data if ch.isalpha() or ch == '.'] # Lowercase and non-alphabetical freq_dist = nltk.FreqDist(word for word in data)

def initialize(N,V, random_seed=1): np.random.seed(random_seed) w1 = np.random.rand(N,V) w2 = np.random.rand(V,N) b1 = np.random.rand(N,1) b2 = np.random.rand(V,1) return w1,w2,b1,b2

def softmax(z): e = np.exp(z) y_hat = e/e.sum(e,axis=0) return y_hat

def forward_propagation(x,w1,w2,b1,b2): h = np.dot(w1,x) + b1 h = np.maximum(0,h) # Make-shift ReLU z = np.dot(w2,h)+b2 return z,h

def cost(y,y_hat, batch_size): log_prob = np.multiply(np.log(y_hat),y) + np.multiply(np.log(1 - y_hat), 1 - y) cost = -1/batch_size * np.sum(log_prob) cost = np.squeeze(cost) return cost

def back_prob(x, y_hat, y, h, w1, w2, b1, b2, batch_size): l1 = np.dot(w2.T,(y_hat-y)) l1 = np.maximum(0,l1) # Make-shift ReLU grad_w1 = (1/batch_size)*np.dot(l1,x.T) grad_w2 = (1/batch_size)*np.dot(y_hat-y,h.T) grad_b1 = np.sum((1/batch_size)*np.dot(l1,x.T),axis=1,keepdims=True) grad_b2 = np.sum((1/batch_size)*np.dot(y_hat-y,h.T),axis=1,keepdims=True) return grad_w1, grad_w2, grad_b1, grad_b2

def get_batches(data, word2Ind, V, C, batch_size): batch_x = [] batch_y = [] for x, y in get_vectors(data, word2Ind, V, C): while len(batch_x) < batch_size: batch_x.append(x) batch_y.append(y) else: yield np.array(batch_x).T, np.array(batch_y).T batch = []

def grad_descent(data, word2Ind, N, V, num_iters, alpha=0.03): w1, w2, b1, b2 = initialise(N,V, random_seed=282) batch_size = 128 iters = 0 C = 2 for x, y in get_batches(data, word2Ind, V, C, batch_size): z, h = forward_prop(x, w1, w2, b1, b2) yhat = softmax(z) cost = cost(y, y_hat, batch_size) if ( (iters+1) % 10 == 0): print(f"iters: {iters + 1} cost: {cost:.6f}") grad_w1, grad_w2, grad_b1, grad_b2 = back_prop(x, y_hat, y, h, w1, w2, b1, b2, batch_size) w1 -= alpha*grad_w1 w2 -= alpha*grad_w2 b1 -= alpha*grad_b1 b2 -= alpha*grad_b2 iters += 1 if iters == num_iters: break if iters % 100 == 0: alpha *= 0.66 return w1, w2, b1, b2

## Saurav Maheshkar