Continuous Bag of Words Model from Scratch


In this post, I’ll walk through an implementation of the Continuous Bag of Words Model for generating word embedding vectors. 

Applications of Word Embeddings

  • Semantic Analogies
  • Sentiment Analysis 
  • Classification of customer feedback
  • Machine Translation 
  • Information Extraction
  • Question Answering 


Why not One-Hot Vectors?

One-hot vectors can also be used as word vectors as they are simple and have no implied ordering but they can get extremely huge in a sufficiently dense corpus and also they don’t have any embedded meaning

Advantages of Word Embedding Vectors?

  • Low-Dimensions 
  • Embed Meaning


Other Famous Model

  • word2vec
  • Global Vectors
  • fastText
  • BERT
  • ELMo
  • GPT-2


Importing Libraries 

Let’s get started by importing the required libraries and also download ‘punkt’ from the nltk module.
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
from collections import Counter'.')'punkt')


We need to do some pre-processing on the corpus like removing punctuations and non-alphabetical characters and turning everything into lowercase.

import re
with open('shakespeare.txt') as f:
  data =
data = re.sub(r'[,!?;-]', '.',data) # Punctuations
data = nltk.word_tokenize(data)
data = [ ch.lower() for ch in data if ch.isalpha() or ch == '.'] # Lowercase and non-alphabetical

freq_dist = nltk.FreqDist(word for word in data)

Building the Model

Initialising the Model

def initialize(N,V, random_seed=1):

  w1 = np.random.rand(N,V)
  w2 = np.random.rand(V,N)
  b1 = np.random.rand(N,1)
  b2 = np.random.rand(V,1)

  return w1,w2,b1,b2

The softmax activation function

def softmax(z):
  e = np.exp(z)
  y_hat = e/e.sum(e,axis=0)

  return y_hat

Forward Projection

def forward_propagation(x,w1,w2,b1,b2):

  h =,x) + b1
  h = np.maximum(0,h) # Make-shift ReLU
  z =,h)+b2

  return z,h

Cost Function

def cost(y,y_hat, batch_size):
  log_prob = np.multiply(np.log(y_hat),y) + np.multiply(np.log(1 - y_hat), 1 - y)
  cost = -1/batch_size * np.sum(log_prob)
  cost = np.squeeze(cost)

  return cost

Back Propagation 

def back_prob(x, y_hat, y, h, w1, w2, b1, b2, batch_size):
  l1 =,(y_hat-y))
  l1 = np.maximum(0,l1) # Make-shift ReLU
  grad_w1 = (1/batch_size)*,x.T)
  grad_w2 = (1/batch_size)*,h.T)
  grad_b1 = np.sum((1/batch_size)*,x.T),axis=1,keepdims=True)
  grad_b2 = np.sum((1/batch_size)*,h.T),axis=1,keepdims=True)

  return grad_w1, grad_w2, grad_b1, grad_b2

Getting the Batches

def get_batches(data, word2Ind, V, C, batch_size):
    batch_x = []
    batch_y = []
    for x, y in get_vectors(data, word2Ind, V, C):
        while len(batch_x) < batch_size:
            yield np.array(batch_x).T, np.array(batch_y).T
            batch = []

Gradient Descent 

def grad_descent(data, word2Ind, N, V, num_iters, alpha=0.03):

    w1, w2, b1, b2 = initialise(N,V, random_seed=282)
    batch_size = 128
    iters = 0
    C = 2
    for x, y in get_batches(data, word2Ind, V, C, batch_size):

        z, h = forward_prop(x, w1, w2, b1, b2)
        yhat = softmax(z)
        cost = cost(y, y_hat, batch_size)

        if ( (iters+1) % 10 == 0):
            print(f"iters: {iters + 1} cost: {cost:.6f}")
        grad_w1, grad_w2, grad_b1, grad_b2 = back_prop(x, y_hat, y, h, w1, w2, b1, b2, batch_size)
        w1 -= alpha*grad_w1 
        w2 -= alpha*grad_w2
        b1 -= alpha*grad_b1
        b2 -= alpha*grad_b2
        iters += 1 
        if iters == num_iters: 
        if iters % 100 == 0:
            alpha *= 0.66
    return w1, w2, b1, b2

Link to Github Repository for this Post

Leave a Reply

Your email address will not be published. Required fields are marked *