PVA COMMUNITY
To follow along please refer to this Kaggle Kernel
Prior to the introduction of Wide Residual Networks (WRNs) by Sergey Zagoruyko and Nikos Komodakis, deep residual networks were shown to have a fractional increase in performance but at the cost of doubling the number of layers. This led to the problem of diminishing feature reuse and overall made the models slow to train. WRNs showed that having a wider residual network leads to better performance and increased the then SOTA results on CIFAR, SVHN and COCO.
In this notebook, we run through a simple demonstration of training a WideResnet on the cifar10
dataset using the Trax framework. Trax is an end-to-end library for deep learning that focuses on clear code and speed. It is actively used and maintained in the Google Brain team.
Figure 1: Various ResNet Blocks
A Residual block with a identity mapping, which allows us to train very deep networks is a weakness. As the gradient flows through the network there is nothing to force it to go through the residual block weights and thus it can avoid learning during training. This only a few blocks can run valuable representations or many blocks could share very little information with small contributions to the final goal. This problem was tried to be addressed using a special case of dropout applied to residual blocks in which an identity scalar weight is added to each residual block on which dropout is applied.
As we are widening our residual blocks, this results in an increase in the number of parameters, and the authors decided to study the effects of dropout to regularize training and prevent overfitting. They argued that the dropout should be inserted between convolutional layers instead of being inserted in the identity part of the block and showed that this results in consistent gains, yielding new SOTA results.
The paper Wide Residual Networks attemptsto answer the question of how wide deep residual networks should be and address the problem of training.
[latex] $x_{l+1} = x_{l} + \mathbb{F} (x_l, W_l)$ [/latex]
This is the representation of a Residual block with an identity mapping.
Figure 1(a) and 1(c) represent the fundamental difference between the basic and the basic-wide blocks used.
This is the basic structure of Wide Residual Networks. In the papers the size of conv1
was fixed in all the experiments, while the “widening” factor k
was experimented with in the next three groups. Here k
is the. widening factor which multiplies the number of features in convolutional layers
Let B(M) denote various residual block structures, where M is a list with the kernel sizes of the convoutional layers in a block. The following architectures were used in experimentation:-
Test error (%, median over 5 runs) on CIFAR-10 of residual networks with k = 1 and different block types. Time represents one training epoch
The paper highlights that the block structure B(3,3) beats B(3,1) and B(3,1,3) by a little margin.
The paper highlights a method, giving a total improvement of 4.4% over ResNet-1001 and showing that:-
import trax from trax import layers as tl from trax.supervised import training # Trax offers the WideResnet architecture in it's models module from trax.models.resnet import WideResnet
Trax offers a rich collection of .data API’s to create input pipelines. One of which is the trax.data.TFDS()
which returns an iterator of numpy arrays representing the dataset.
If you’d like to learn more about the trax.data API’s please checkout the notebook here where I explain the most common API’s in a in-depth manner
train_stream = trax.data.TFDS('cifar10', keys=('image', 'label'), train=True)() eval_stream = trax.data.TFDS('cifar10', keys=('image', 'label'), train=False)()
Here, we create pre-processing pipelines, by using the Shuffle()
, Batch()
and AddLossWeights()
functions from the trax.data API
train_data_pipeline = trax.data.Serial( trax.data.Shuffle(), trax.data.Batch(64), trax.data.AddLossWeights(), ) train_batches_stream = train_data_pipeline(train_stream) eval_data_pipeline = trax.data.Serial( trax.data.Batch(64), trax.data.AddLossWeights(), ) eval_batches_stream = eval_data_pipeline(eval_stream)
We use the WideResnet
architecture defined in trax.models.resnet
module. By Default the “widening factor” is set to 1, thus we experiment with four values of the widen_factor
1,2, 3 and 4. The Architecture doesn’t contain a tl.LogSoftmax()
function so we add it to our model using the tl.Serial()
combinator
thin_model = tl.Serial( WideResnet(widen_factor = 1), tl.LogSoftmax() ) wide_model = tl.Serial( WideResnet(widen_factor = 2), tl.LogSoftmax() ) wider_model = tl.Serial( WideResnet(widen_factor = 3), tl.LogSoftmax() ) widest_model = tl.Serial( WideResnet(widen_factor = 4), tl.LogSoftmax() )
When we have our model and the data, we use trax.supervised.training
to define training and eval tasks and create a training loop. The Trax training loop optimizes training and will create TensorBoard logs and model checkpoints for you.
train_task = training.TrainTask( labeled_data=train_batches_stream, loss_layer=tl.CrossEntropyLoss(), optimizer=trax.optimizers.Adam(0.01), n_steps_per_checkpoint=1000, ) eval_task = training.EvalTask( labeled_data=eval_batches_stream, metrics=[tl.CrossEntropyLoss(), tl.Accuracy()], n_eval_batches=20, )
training_loop = training.Loop(thin_model, train_task, eval_tasks=[eval_task], output_dir='./thin_model') training_loop.run(5000)
Step 1: Ran 1 train steps in 50.80 secs Step 1: train CrossEntropyLoss | 2.23222899 Step 1: eval CrossEntropyLoss | 2.29001374 Step 1: eval Accuracy | 0.17578125 Step 1000: Ran 999 train steps in 450.67 secs Step 1000: train CrossEntropyLoss | 1.51696503 Step 1000: eval CrossEntropyLoss | 1.24389263 Step 1000: eval Accuracy | 0.55156250 Step 2000: Ran 1000 train steps in 391.61 secs Step 2000: train CrossEntropyLoss | 1.14578664 Step 2000: eval CrossEntropyLoss | 1.07358952 Step 2000: eval Accuracy | 0.60625000 Step 3000: Ran 1000 train steps in 379.49 secs Step 3000: train CrossEntropyLoss | 0.98207539 Step 3000: eval CrossEntropyLoss | 0.97521102 Step 3000: eval Accuracy | 0.66406250 Step 4000: Ran 1000 train steps in 375.78 secs Step 4000: train CrossEntropyLoss | 0.87311232 Step 4000: eval CrossEntropyLoss | 0.86088786 Step 4000: eval Accuracy | 0.71328125 Step 5000: Ran 1000 train steps in 382.36 secs Step 5000: train CrossEntropyLoss | 0.79590148 Step 5000: eval CrossEntropyLoss | 0.90162431 Step 5000: eval Accuracy | 0.67968750
training_loop = training.Loop(wide_model, train_task, eval_tasks=[eval_task], output_dir='./wide_model') training_loop.run(5000)
Step 1: Ran 1 train steps in 47.65 secs Step 1: train CrossEntropyLoss | 2.45481586 Step 1: eval CrossEntropyLoss | 2.29830315 Step 1: eval Accuracy | 0.17265625 Step 1000: Ran 999 train steps in 912.74 secs Step 1000: train CrossEntropyLoss | 1.53430188 Step 1000: eval CrossEntropyLoss | 1.26175007 Step 1000: eval Accuracy | 0.54296875 Step 2000: Ran 1000 train steps in 898.69 secs Step 2000: train CrossEntropyLoss | 1.13339007 Step 2000: eval CrossEntropyLoss | 0.98637019 Step 2000: eval Accuracy | 0.65390625 Step 3000: Ran 1000 train steps in 914.89 secs Step 3000: train CrossEntropyLoss | 0.94035697 Step 3000: eval CrossEntropyLoss | 0.89243511 Step 3000: eval Accuracy | 0.68906250 Step 4000: Ran 1000 train steps in 878.78 secs Step 4000: train CrossEntropyLoss | 0.82797289 Step 4000: eval CrossEntropyLoss | 0.86628049 Step 4000: eval Accuracy | 0.70156250 Step 5000: Ran 1000 train steps in 896.67 secs Step 5000: train CrossEntropyLoss | 0.74435514 Step 5000: eval CrossEntropyLoss | 0.82093265 Step 5000: eval Accuracy | 0.70859375
training_loop = training.Loop(wider_model, train_task, eval_tasks=[eval_task], output_dir='./wider_model') training_loop.run(5000)
Step 1: Ran 1 train steps in 49.64 secs Step 1: train CrossEntropyLoss | 2.46369076 Step 1: eval CrossEntropyLoss | 2.42145765 Step 1: eval Accuracy | 0.16015625 Step 1000: Ran 999 train steps in 1462.86 secs Step 1000: train CrossEntropyLoss | 1.55000281 Step 1000: eval CrossEntropyLoss | 1.31225752 Step 1000: eval Accuracy | 0.53203125 Step 2000: Ran 1000 train steps in 1417.44 secs Step 2000: train CrossEntropyLoss | 1.14296257 Step 2000: eval CrossEntropyLoss | 1.05580651 Step 2000: eval Accuracy | 0.61796875 Step 3000: Ran 1000 train steps in 1412.55 secs Step 3000: train CrossEntropyLoss | 0.96064937 Step 3000: eval CrossEntropyLoss | 0.91904441 Step 3000: eval Accuracy | 0.66093750 Step 4000: Ran 1000 train steps in 1394.65 secs Step 4000: train CrossEntropyLoss | 0.86051035 Step 4000: eval CrossEntropyLoss | 0.79895681 Step 4000: eval Accuracy | 0.71875000 Step 5000: Ran 1000 train steps in 1325.14 secs Step 5000: train CrossEntropyLoss | 0.76998872 Step 5000: eval CrossEntropyLoss | 0.79824924 Step 5000: eval Accuracy | 0.73046875
training_loop = training.Loop(widest_model, train_task, eval_tasks=[eval_task], output_dir='./widest_model') training_loop.run(5000)
Step 1: Ran 1 train steps in 50.33 secs Step 1: train CrossEntropyLoss | 2.46660376 Step 1: eval CrossEntropyLoss | 2.54380413 Step 1: eval Accuracy | 0.18828125 Step 1000: Ran 999 train steps in 2232.45 secs Step 1000: train CrossEntropyLoss | 1.60101640 Step 1000: eval CrossEntropyLoss | 1.32047499 Step 1000: eval Accuracy | 0.51562500 Step 2000: Ran 1000 train steps in 2207.08 secs Step 2000: train CrossEntropyLoss | 1.23905230 Step 2000: eval CrossEntropyLoss | 1.14217659 Step 2000: eval Accuracy | 0.59140625 Step 3000: Ran 1000 train steps in 2193.96 secs Step 3000: train CrossEntropyLoss | 1.02972627 Step 3000: eval CrossEntropyLoss | 0.95948886 Step 3000: eval Accuracy | 0.66015625 Step 4000: Ran 1000 train steps in 2184.98 secs Step 4000: train CrossEntropyLoss | 0.88306051 Step 4000: eval CrossEntropyLoss | 0.92772388 Step 4000: eval Accuracy | 0.66796875 Step 5000: Ran 1000 train steps in 2179.92 secs Step 5000: train CrossEntropyLoss | 0.79552704 Step 5000: eval CrossEntropyLoss | 0.73757971 Step 5000: eval Accuracy | 0.72968750
Saurav Maheshkar