T H E E P I C C O D E

PVA COMMUNITY

Wide Residual Networks

To follow along please refer to this Kaggle Kernel

Introduction

Prior to the introduction of Wide Residual Networks (WRNs) by Sergey Zagoruyko and Nikos Komodakis, deep residual networks were shown to have a fractional increase in performance but at the cost of doubling the number of layers. This led to the problem of diminishing feature reuse and overall made the models slow to train. WRNs showed that having a wider residual network leads to better performance and increased the then SOTA results on CIFAR, SVHN and COCO.

In this notebook, we run through a simple demonstration of training a WideResnet on the cifar10 dataset using the Trax framework. Trax is an end-to-end library for deep learning that focuses on clear code and speed. It is actively used and maintained in the Google Brain team.

Issues with Traditional Residual Networks

WRN-1.png

Figure 1: Various ResNet Blocks

Diminishing Feature Reuse

Residual block with a identity mapping, which allows us to train very deep networks is a weakness. As the gradient flows through the network there is nothing to force it to go through the residual block weights and thus it can avoid learning during training. This only a few blocks can run valuable representations or many blocks could share very little information with small contributions to the final goal. This problem was tried to be addressed using a special case of dropout applied to residual blocks in which an identity scalar weight is added to each residual block on which dropout is applied.

As we are widening our residual blocks, this results in an increase in the number of parameters, and the authors decided to study the effects of dropout to regularize training and prevent overfitting. They argued that the dropout should be inserted between convolutional layers instead of being inserted in the identity part of the block and showed that this results in consistent gains, yielding new SOTA results.

The paper Wide Residual Networks attemptsto answer the question of how wide deep residual networks should be and address the problem of training.

Residual Networks

[latex] $x_{l+1} = x_{l} + \mathbb{F} (x_l, W_l)$ [/latex]

This is the representation of a Residual block with an identity mapping.

  • [latex]$X_{l+1}$[/latex] and [latex]$x_l$[/latex] represent the input and output of the 𝑙l-th unit in the network
  • 𝔽 is a residual function
  • 𝑊 are the parameters

Figure 1(a) and 1(c) represent the fundamental difference between the basic and the basic-wide blocks used.

Architecture

WRN-2.png

This is the basic structure of Wide Residual Networks. In the papers the size of conv1 was fixed in all the experiments, while the “widening” factor k was experimented with in the next three groups. Here k is the. widening factor which multiplies the number of features in convolutional layers

Let B(M) denote various residual block structures, where M is a list with the kernel sizes of the convoutional layers in a block. The following architectures were used in experimentation:-

  • B(3,3) – The Original “basic” block. (Figure 1(a))
  • B(3,1,3) – Same as basic but with a extra 1×1 layer in between
  • B(1,3,1) – For Bottleneck (Figure 1(b))
  • B(1,3) – Having Alternative 1×1-3×3 convolutions
  • B(3,1) – Having Alternative 3×3-1×1 convolutions
  • B(3,1,1) – A Network-in-Network style block

Experimental Results

WRN-3.png

Test error (%, median over 5 runs) on CIFAR-10 of residual networks with k = 1 and different block types. Time represents one training epoch

The paper highlights that the block structure B(3,3) beats B(3,1) and B(3,1,3) by a little margin.

Key Takeaways

The paper highlights a method, giving a total improvement of 4.4% over ResNet-1001 and showing that:-

  • widening consistently improves performance across residual networks of different depth
  • increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is required
  • there doesn’t seem to be a regularization effect from very high depth in residual networks as wide networks with the same number of parameters as thin ones can learn same or better representations. Furthermore, wide networks can successfully learn with a 2 or more times larger number of parameters than thin ones, which would require doubling the depth of thin networks, making them infeasibly expensive to train.

Importing Libraries

import trax
from trax import layers as tl
from trax.supervised import training

# Trax offers the WideResnet architecture in it's models module
from trax.models.resnet import WideResnet

Downloading the Dataset

Trax offers a rich collection of .data API’s to create input pipelines. One of which is the trax.data.TFDS() which returns an iterator of numpy arrays representing the dataset.

If you’d like to learn more about the trax.data API’s please checkout the notebook here where I explain the most common API’s in a in-depth manner

train_stream = trax.data.TFDS('cifar10', keys=('image', 'label'), train=True)()
eval_stream = trax.data.TFDS('cifar10', keys=('image', 'label'), train=False)()

Batch Generator

Here, we create pre-processing pipelines, by using the Shuffle()Batch() and AddLossWeights() functions from the trax.data API

train_data_pipeline = trax.data.Serial(
    trax.data.Shuffle(),
    trax.data.Batch(64),
    trax.data.AddLossWeights(),
)

train_batches_stream = train_data_pipeline(train_stream)

eval_data_pipeline = trax.data.Serial(
    trax.data.Batch(64),
    trax.data.AddLossWeights(),
)

eval_batches_stream = eval_data_pipeline(eval_stream)

Model Architecture

We use the WideResnet architecture defined in trax.models.resnet module. By Default the “widening factor” is set to 1, thus we experiment with four values of the widen_factor 1,2, 3 and 4. The Architecture doesn’t contain a tl.LogSoftmax() function so we add it to our model using the tl.Serial() combinator

thin_model = tl.Serial(
    WideResnet(widen_factor = 1),
    tl.LogSoftmax()
)

wide_model = tl.Serial(
    WideResnet(widen_factor = 2),
    tl.LogSoftmax()
)

wider_model = tl.Serial(
    WideResnet(widen_factor = 3),
    tl.LogSoftmax()
)

widest_model = tl.Serial(
    WideResnet(widen_factor = 4),
    tl.LogSoftmax()
)

When we have our model and the data, we use trax.supervised.training to define training and eval tasks and create a training loop. The Trax training loop optimizes training and will create TensorBoard logs and model checkpoints for you.

train_task = training.TrainTask(
    labeled_data=train_batches_stream,
    loss_layer=tl.CrossEntropyLoss(),
    optimizer=trax.optimizers.Adam(0.01),
    n_steps_per_checkpoint=1000,
)

eval_task = training.EvalTask(
    labeled_data=eval_batches_stream,
    metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
    n_eval_batches=20,
)
training_loop = training.Loop(thin_model, 
                              train_task, 
                              eval_tasks=[eval_task], 
                              output_dir='./thin_model')

training_loop.run(5000)
Step      1: Ran 1 train steps in 50.80 secs
Step      1: train CrossEntropyLoss |  2.23222899
Step      1: eval  CrossEntropyLoss |  2.29001374
Step      1: eval          Accuracy |  0.17578125

Step   1000: Ran 999 train steps in 450.67 secs
Step   1000: train CrossEntropyLoss |  1.51696503
Step   1000: eval  CrossEntropyLoss |  1.24389263
Step   1000: eval          Accuracy |  0.55156250

Step   2000: Ran 1000 train steps in 391.61 secs
Step   2000: train CrossEntropyLoss |  1.14578664
Step   2000: eval  CrossEntropyLoss |  1.07358952
Step   2000: eval          Accuracy |  0.60625000

Step   3000: Ran 1000 train steps in 379.49 secs
Step   3000: train CrossEntropyLoss |  0.98207539
Step   3000: eval  CrossEntropyLoss |  0.97521102
Step   3000: eval          Accuracy |  0.66406250

Step   4000: Ran 1000 train steps in 375.78 secs
Step   4000: train CrossEntropyLoss |  0.87311232
Step   4000: eval  CrossEntropyLoss |  0.86088786
Step   4000: eval          Accuracy |  0.71328125

Step   5000: Ran 1000 train steps in 382.36 secs
Step   5000: train CrossEntropyLoss |  0.79590148
Step   5000: eval  CrossEntropyLoss |  0.90162431
Step   5000: eval          Accuracy |  0.67968750
training_loop = training.Loop(wide_model, 
                              train_task, 
                              eval_tasks=[eval_task], 
                              output_dir='./wide_model')

training_loop.run(5000)
Step      1: Ran 1 train steps in 47.65 secs
Step      1: train CrossEntropyLoss |  2.45481586
Step      1: eval  CrossEntropyLoss |  2.29830315
Step      1: eval          Accuracy |  0.17265625

Step   1000: Ran 999 train steps in 912.74 secs
Step   1000: train CrossEntropyLoss |  1.53430188
Step   1000: eval  CrossEntropyLoss |  1.26175007
Step   1000: eval          Accuracy |  0.54296875

Step   2000: Ran 1000 train steps in 898.69 secs
Step   2000: train CrossEntropyLoss |  1.13339007
Step   2000: eval  CrossEntropyLoss |  0.98637019
Step   2000: eval          Accuracy |  0.65390625

Step   3000: Ran 1000 train steps in 914.89 secs
Step   3000: train CrossEntropyLoss |  0.94035697
Step   3000: eval  CrossEntropyLoss |  0.89243511
Step   3000: eval          Accuracy |  0.68906250

Step   4000: Ran 1000 train steps in 878.78 secs
Step   4000: train CrossEntropyLoss |  0.82797289
Step   4000: eval  CrossEntropyLoss |  0.86628049
Step   4000: eval          Accuracy |  0.70156250

Step   5000: Ran 1000 train steps in 896.67 secs
Step   5000: train CrossEntropyLoss |  0.74435514
Step   5000: eval  CrossEntropyLoss |  0.82093265
Step   5000: eval          Accuracy |  0.70859375
training_loop = training.Loop(wider_model, 
                              train_task, 
                              eval_tasks=[eval_task], 
                              output_dir='./wider_model')

training_loop.run(5000)
Step      1: Ran 1 train steps in 49.64 secs
Step      1: train CrossEntropyLoss |  2.46369076
Step      1: eval  CrossEntropyLoss |  2.42145765
Step      1: eval          Accuracy |  0.16015625

Step   1000: Ran 999 train steps in 1462.86 secs
Step   1000: train CrossEntropyLoss |  1.55000281
Step   1000: eval  CrossEntropyLoss |  1.31225752
Step   1000: eval          Accuracy |  0.53203125

Step   2000: Ran 1000 train steps in 1417.44 secs
Step   2000: train CrossEntropyLoss |  1.14296257
Step   2000: eval  CrossEntropyLoss |  1.05580651
Step   2000: eval          Accuracy |  0.61796875

Step   3000: Ran 1000 train steps in 1412.55 secs
Step   3000: train CrossEntropyLoss |  0.96064937
Step   3000: eval  CrossEntropyLoss |  0.91904441
Step   3000: eval          Accuracy |  0.66093750

Step   4000: Ran 1000 train steps in 1394.65 secs
Step   4000: train CrossEntropyLoss |  0.86051035
Step   4000: eval  CrossEntropyLoss |  0.79895681
Step   4000: eval          Accuracy |  0.71875000

Step   5000: Ran 1000 train steps in 1325.14 secs
Step   5000: train CrossEntropyLoss |  0.76998872
Step   5000: eval  CrossEntropyLoss |  0.79824924
Step   5000: eval          Accuracy |  0.73046875
training_loop = training.Loop(widest_model, 
                              train_task, 
                              eval_tasks=[eval_task], 
                              output_dir='./widest_model')

training_loop.run(5000)
Step      1: Ran 1 train steps in 50.33 secs
Step      1: train CrossEntropyLoss |  2.46660376
Step      1: eval  CrossEntropyLoss |  2.54380413
Step      1: eval          Accuracy |  0.18828125

Step   1000: Ran 999 train steps in 2232.45 secs
Step   1000: train CrossEntropyLoss |  1.60101640
Step   1000: eval  CrossEntropyLoss |  1.32047499
Step   1000: eval          Accuracy |  0.51562500

Step   2000: Ran 1000 train steps in 2207.08 secs
Step   2000: train CrossEntropyLoss |  1.23905230
Step   2000: eval  CrossEntropyLoss |  1.14217659
Step   2000: eval          Accuracy |  0.59140625

Step   3000: Ran 1000 train steps in 2193.96 secs
Step   3000: train CrossEntropyLoss |  1.02972627
Step   3000: eval  CrossEntropyLoss |  0.95948886
Step   3000: eval          Accuracy |  0.66015625

Step   4000: Ran 1000 train steps in 2184.98 secs
Step   4000: train CrossEntropyLoss |  0.88306051
Step   4000: eval  CrossEntropyLoss |  0.92772388
Step   4000: eval          Accuracy |  0.66796875

Step   5000: Ran 1000 train steps in 2179.92 secs
Step   5000: train CrossEntropyLoss |  0.79552704
Step   5000: eval  CrossEntropyLoss |  0.73757971
Step   5000: eval          Accuracy |  0.72968750

Related Post

Leave a Comment