After the rather long post on how to implement a neural network, here is a brief summary on how each hyper-parameter affects the network.
Learning rate
The learning rate needs to be tuned optimally (it follows a U-shaped curve). When set incorrectly (too low or too high) it reduces the model capacity because the model can’t be optimised properly.
Number of hidden units
The more hidden units the greater the capacity of the model. More units also means more computational costs (time and memory) to train and run the model.
Convolutional kernel size
The wider the kernel the more parameters are available in the model and the better capacity.
However a wide kernel reduces the output dimensions and thus the model capacity. Zero-padding can be used to reduce this effect.
Wider kernel increase the computational costs for the convolution but as it reduces the output dimensions so does the computational costs.
Zero-padding
Using zero padding keeps the output size large and improves the model capacity. It also impacts the computational costs.
Weight decay
A small weight decay increases the model capacity. Decreasing the weight decay allows the model parameters to become larger.
This parameter is intended to control overfitting.
Dropout
Decreasing the dropout rate increases the model capacity because it allows units to “combine” with each other to better fit the training set.
This parameter is intended to control overfitting.