T H E E P I C C O D E

PVA COMMUNITY

Introduction

In this post, I’ll walk through an implementation of the Continuous Bag of Words Model for generating word embedding vectors. 
 

Applications of Word Embeddings

  • Semantic Analogies
  • Sentiment Analysis 
  • Classification of customer feedback
  • Machine Translation 
  • Information Extraction
  • Question Answering 

 

Why not One-Hot Vectors?

One-hot vectors can also be used as word vectors as they are simple and have no implied ordering but they can get extremely huge in a sufficiently dense corpus and also they don’t have any embedded meaning
 

Advantages of Word Embedding Vectors?

  • Low-Dimensions 
  • Embed Meaning

 

Other Famous Model

  • word2vec
  • Global Vectors
  • fastText
  • BERT
  • ELMo
  • GPT-2

 

Importing Libraries 

Let’s get started by importing the required libraries and also download ‘punkt’ from the nltk module.
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
from collections import Counter

nltk.data.path.append('.')

nltk.download('punkt')

Pre-Processing

We need to do some pre-processing on the corpus like removing punctuations and non-alphabetical characters and turning everything into lowercase.

import re
with open('shakespeare.txt') as f:
  data = f.read()
data = re.sub(r'[,!?;-]', '.',data) # Punctuations
data = nltk.word_tokenize(data)
data = [ ch.lower() for ch in data if ch.isalpha() or ch == '.'] # Lowercase and non-alphabetical

freq_dist = nltk.FreqDist(word for word in data)

Building the Model

Initialising the Model

def initialize(N,V, random_seed=1):
  np.random.seed(random_seed)

  w1 = np.random.rand(N,V)
  w2 = np.random.rand(V,N)
  b1 = np.random.rand(N,1)
  b2 = np.random.rand(V,1)

  return w1,w2,b1,b2

The softmax activation function

def softmax(z):
  e = np.exp(z)
  y_hat = e/e.sum(e,axis=0)

  return y_hat

Forward Projection

def forward_propagation(x,w1,w2,b1,b2):

  h = np.dot(w1,x) + b1
  h = np.maximum(0,h) # Make-shift ReLU
  z = np.dot(w2,h)+b2

  return z,h

Cost Function

def cost(y,y_hat, batch_size):
  log_prob = np.multiply(np.log(y_hat),y) + np.multiply(np.log(1 - y_hat), 1 - y)
  cost = -1/batch_size * np.sum(log_prob)
  cost = np.squeeze(cost)

  return cost

Back Propagation 

def back_prob(x, y_hat, y, h, w1, w2, b1, b2, batch_size):
  l1 = np.dot(w2.T,(y_hat-y))
  l1 = np.maximum(0,l1) # Make-shift ReLU
  grad_w1 = (1/batch_size)*np.dot(l1,x.T)
  grad_w2 = (1/batch_size)*np.dot(y_hat-y,h.T)
  grad_b1 = np.sum((1/batch_size)*np.dot(l1,x.T),axis=1,keepdims=True)
  grad_b2 = np.sum((1/batch_size)*np.dot(y_hat-y,h.T),axis=1,keepdims=True)

  return grad_w1, grad_w2, grad_b1, grad_b2

Getting the Batches

def get_batches(data, word2Ind, V, C, batch_size):
    batch_x = []
    batch_y = []
    for x, y in get_vectors(data, word2Ind, V, C):
        while len(batch_x) < batch_size:
            batch_x.append(x)
            batch_y.append(y)
        else:
            yield np.array(batch_x).T, np.array(batch_y).T
            batch = []

Gradient Descent 

def grad_descent(data, word2Ind, N, V, num_iters, alpha=0.03):

    w1, w2, b1, b2 = initialise(N,V, random_seed=282)
    batch_size = 128
    iters = 0
    C = 2
    for x, y in get_batches(data, word2Ind, V, C, batch_size):

        z, h = forward_prop(x, w1, w2, b1, b2)
        yhat = softmax(z)
        cost = cost(y, y_hat, batch_size)

        if ( (iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        
        grad_w1, grad_w2, grad_b1, grad_b2 = back_prop(x, y_hat, y, h, w1, w2, b1, b2, batch_size)
        
        w1 -= alpha*grad_w1 
        w2 -= alpha*grad_w2
        b1 -= alpha*grad_b1
        b2 -= alpha*grad_b2
        
        
        iters += 1 
        if iters == num_iters: 
            break
        if iters % 100 == 0:
            alpha *= 0.66
            
    return w1, w2, b1, b2

Link to Github Repository for this Post

https://github.com/psych0man/Continuous-Bag-of-Words-Model

Related Post

Leave a Comment