Deep Learning Week6 Notes
1. Benefits of depth
\(\text{Consider ReLU MLPs with a single Input/Output, there exists a network }f\) \(\text{ with }D^* \text{ layers, and }2D^* \text{ internal units, such that, for any network }g\text{ with }D\text{ layers of sizes }\{W^{(1)},...,W^{(D)} \}\), $\text{ since } $ $k(g)\leq 2^D \prod_{d=1}^D W^{(d)}: $
\[\begin{align} ||f-g||_1\geq 1-\frac{2^D}{2^{D^*}}\prod_{d=1}^DW^{(d)} \end{align} \]\(\text{Inparticular, with }g\text{ a single hidden layer netowrk:}\)
\[||f-g||_1\geq 1-2\frac{W^{(1)}}{2^{D^*}} \]\(\textbf{To approximate }f\textbf{ properly, the width }W^{(1)}\textbf{ of }g\textbf{'s hidden layer has to increase exponentially with }f\textbf{'s depth }D^*.\)
2. Rectifiers
\(\text{The derivative of } tanh\text{ has an exponential tail on both sides and collapses to 0 very quickly, while ReLU keeps the gradient of positive activations unchanged, which often correspond to half of them.}\)
Leaky-ReLU
\[\begin{align} \max(ax,x) \end{align} \]where \(0\leq a< 1\)
3. Dropout
假设有 \(p\) 的概率被 drop,那么从期望来看:
\[\begin{align} \mathbb{E}(X) = (1-p)X+p\cdot 0 \end{align} \]因此为了保证期望不变,只需要在 train 的时候,将激活函数 \(\times \frac{1}{1-p}\),然后在 test 的时候保持原网络不变即可。这样的方法被称为 \(\text{Inverted Dropout.}\)
x = torch.full((3, 5), 1.0).requires_grad_() x tensor([[ 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1.]]) dropout = nn.Dropout(p = 0.75) y = dropout(x) y tensor([[ 0., 0., 4., 0., 4.], [ 0., 4., 4., 4., 0.], [ 0., 0., 4., 0., 0.]]) l = y.norm(2, 1).sum() l.backward() x.grad tensor([[ 0.0000, 0.0000, 2.8284, 0.0000, 2.8284], [ 0.0000, 2.3094, 2.3094, 2.3094, 0.0000], [ 0.0000, 0.0000, 4.0000, 0.0000, 0.0000]])
\(\text{Simply add dropout layers:}\)
model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(), nn.Dropout(), nn.Linear(100, 50), nn.ReLU(), nn.Dropout(), nn.Linear(50, 2));
4. Batch Normalize
\(\text{Forcing the activation statistics during the forward pass by re-normalizing them}\)
\(\\\)
Motivation:
\(\large\textbf{If the statistics of the activations are not controlled during training, a layer will have to adapt to the changes of the activations computed by the previous layers in addition to making changes to its own output to reduce the loss.}\)
\(\\\)
\(\text{During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.}\)
\(x_b\in \mathbb{R^D},b=1,...,B \text{ are samples in the batch, empirical mean and variance:}\)
\[\begin{align} \hat{m}&=\frac{1}{B}\sum_{b=1}^Bx_b\\ \hat{v}&=\frac{1}{B}\sum_{b=1}^B(x_b-\hat{m})^2 \end{align} \]\(\text{Then do the normalization:}\)
\[\begin{align} z_b&=\frac{x_b-\hat{m}}{\sqrt{\hat{v}+\epsilon}}\\ y&=\gamma\odot z_b+\beta \end{align} \]\(\text{where }z_b,y,\gamma,\beta \in \mathbb{R}^D\)
\(\\\)
\(\large\textbf{During inference: }\text{batch normalization shifts and rescales independently each component of the input }x\text{ according to statistics estimated during training}\)
bn = nn.BatchNorm1d(3) with torch.no_grad(): ... bn.bias.copy_(torch.tensor([2., 4., 8.])) ... bn.weight.copy_(torch.tensor([1., 2., 3.])) ... Parameter containing: tensor([2., 4., 8.], requires_grad=True) Parameter containing: tensor([1., 2., 3.], requires_grad=True) x = torch.randn(1000, 3) x = x * torch.tensor([2., 5., 10.]) + torch.tensor([-10., 25., 3.]) x.mean(0) tensor([-9.9669, 25.0213, 2.4361]) x.std(0) tensor([1.9063, 5.0764, 9.7474]) y = bn(x) y.mean(0) tensor([2.0000, 4.0000, 8.0000], grad_fn=<MeanBackward2>) y.std(0) tensor([1.0005, 2.0010, 3.0015], grad_fn=<StdBackward1>)
5. Layer Normalize
\(\text{Given a single sample }x\in\mathbb{R}^D,\text{ it normalizes the component of }x:\)
\[\begin{align} \mu&=\frac{1}{D}\sum_{d=1}^Dx_d\\ \sigma&=\sqrt{\frac{1}{D}\sum_{d=1}^D(x_d-\mu)^2}\\ \forall d, y_d&=\frac{x_d-\mu}{\sigma} \end{align} \]6. ResNet
class ResBlock(nn.Module): def __init__(self, nb_channels, kernel_size): super().__init__() self.conv1 = nn.Conv2d(nb_channels, nb_channels, kernel_size, padding = (kernel_size-1)//2) self.bn1 = nn.BatchNorm2d(nb_channels) self.conv2 = nn.Conv2d(nb_channels, nb_channels, kernel_size, padding = (kernel_size-1)//2) self.bn2 = nn.BatchNorm2d(nb_channels) def forward(self, x): y = self.bn1(self.conv1(x)) y = F.relu(y) y = self.bn2(self.conv2(y)) y += x y = F.relu(y) return y
\(\text{Add ResBlock to ResNet:}\)
class ResNet(nn.Module): def __init__(self, nb_channels, kernel_size, nb_blocks): super().__init__() self.conv0 = nn.Conv2d(1, nb_channels, kernel_size = 1) self.resblocks = nn.Sequential( # A bit of fancy Python *(ResBlock(nb_channels, kernel_size) for _ in range(nb_blocks)) ) self.avg = nn.AvgPool2d(kernel_size = 28) self.fc = nn.Linear(nb_channels, 10) def forward(self, x): x = F.relu(self.conv0(x)) x = self.resblocks(x) x = F.relu(self.avg(x)) x = x.view(x.size(0), -1) x = self.fc(x) return x
\(\textbf{Veit et al. (2016) interpret a residual network as an ensemble, which explains in part its stability}\)
\(e.g. \text{ with 3 blocks we have:}\)
\[\begin{align} x_1 &= x_0+f_1(x_0)\\ x_2 &= x_1+f_2(x_1)\\ x_3 &= x_2+f_3(x_2) \end{align} \]\(\text{Hence there are 4 paths:}\)
\[\begin{align} x_3 &= x_2+f_3(x_2)\\ &=x_1+f_2(x_1)+f_3(x_1+f_2(x_1))\\ &=x_0+f_1(x_0)+f_2(x_0+f_1(x_0))+f_3(x_0+f_1(x_0)+f_2(x_0+f_1(x_0))) \end{align} \]- $\textbf{(1) performance reduction correlates with the number of paths removed from the ensemble, not with the number of blocks removed} $
- \(\textbf{(2) only gradients through shallow paths matter during train. }\)
\(\\\)
\(\large\textbf{Summarize:}\)
- \(\text{ReLU to prevent the gradient from vanishing during the backward pass}\)
- \(\text{Dropout to force a distributed representation}\)
- \(\text{Batch Normalization to dynamically maintain the statistics of activations}\)
- \(\text{Identity pass-through to keep a structured gradient and distribute representation}\)
- \(\text{smart initialization to put the gradient in a good regime}\)