Deep Learning Week6 Notes

1. Benefits of depth

$\text{Consider ReLU MLPs with a single Input/Output, there exists a network }f$ $\text{ with }D^* \text{ layers, and }2D^* \text{ internal units, such that, for any network }g\text{ with }D\text{ layers of sizes }\{W^{(1)},...,W^{(D)} \}$, $\text{ since } $ $k(g)\leq 2^D \prod_{d=1}^D W^{(d)}: $

\[\begin{align} ||f-g||_1\geq 1-\frac{2^D}{2^{D^*}}\prod_{d=1}^DW^{(d)} \end{align} \]

$\text{Inparticular, with }g\text{ a single hidden layer netowrk:}$

\[||f-g||_1\geq 1-2\frac{W^{(1)}}{2^{D^*}} \]

$\textbf{To approximate }f\textbf{ properly, the width }W^{(1)}\textbf{ of }g\textbf{'s hidden layer has to increase exponentially with }f\textbf{'s depth }D^*.$

2. Rectifiers

$\text{The derivative of } tanh\text{ has an exponential tail on both sides and collapses to 0 very quickly, while ReLU keeps the gradient of positive activations unchanged, which often correspond to half of them.}$

Leaky-ReLU

\[\begin{align} \max(ax,x) \end{align} \]

where $0\leq a< 1$

3. Dropout

假设有 $p$ 的概率被 drop，那么从期望来看：

\[\begin{align} \mathbb{E}(X) = (1-p)X+p\cdot 0 \end{align} \]

因此为了保证期望不变，只需要在 train 的时候，将激活函数 $\times \frac{1}{1-p}$，然后在 test 的时候保持原网络不变即可。这样的方法被称为 $\text{Inverted Dropout.}$

 x = torch.full((3, 5), 1.0).requires_grad_()  x tensor([[ 1., 1., 1., 1., 1.],         [ 1., 1., 1., 1., 1.],         [ 1., 1., 1., 1., 1.]])  dropout = nn.Dropout(p = 0.75)  y = dropout(x)  y tensor([[ 0., 0., 4., 0., 4.],         [ 0., 4., 4., 4., 0.],         [ 0., 0., 4., 0., 0.]])  l = y.norm(2, 1).sum()  l.backward()  x.grad tensor([[ 0.0000, 0.0000, 2.8284, 0.0000, 2.8284], [ 0.0000, 2.3094, 2.3094, 2.3094, 0.0000], [ 0.0000, 0.0000, 4.0000, 0.0000, 0.0000]])

$\text{Simply add dropout layers:}$

model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(),                         nn.Dropout(),                         nn.Linear(100, 50), nn.ReLU(),                         nn.Dropout(),                         nn.Linear(50, 2));

4. Batch Normalize

$\text{Forcing the activation statistics during the forward pass by re-normalizing them}$
$\\$

Motivation:

$\large\textbf{If the statistics of the activations are not controlled during training, a layer will have to adapt to the changes of the activations computed by the previous layers in addition to making changes to its own output to reduce the loss.}$

$\\$
$\text{During training batch normalization shifts and rescales according to the mean and variance estimated on the batch.}$

$x_b\in \mathbb{R^D},b=1,...,B \text{ are samples in the batch, empirical mean and variance:}$

\[\begin{align} \hat{m}&=\frac{1}{B}\sum_{b=1}^Bx_b\\ \hat{v}&=\frac{1}{B}\sum_{b=1}^B(x_b-\hat{m})^2 \end{align} \]

$\text{Then do the normalization:}$

\[\begin{align} z_b&=\frac{x_b-\hat{m}}{\sqrt{\hat{v}+\epsilon}}\\ y&=\gamma\odot z_b+\beta \end{align} \]

$\text{where }z_b,y,\gamma,\beta \in \mathbb{R}^D$

$\\$
$\large\textbf{During inference: }\text{batch normalization shifts and rescales independently each component of the input }x\text{ according to statistics estimated during training}$

\[\begin{align} y = \gamma\odot \frac{x-\hat{m}}{\sqrt{\hat{v}+\epsilon}}+\beta \end{align} \]

 bn = nn.BatchNorm1d(3)  with torch.no_grad(): ... bn.bias.copy_(torch.tensor([2., 4., 8.])) ... bn.weight.copy_(torch.tensor([1., 2., 3.])) ... Parameter containing: tensor([2., 4., 8.], requires_grad=True) Parameter containing: tensor([1., 2., 3.], requires_grad=True)  x = torch.randn(1000, 3)  x = x * torch.tensor([2., 5., 10.]) + torch.tensor([-10., 25., 3.])  x.mean(0) tensor([-9.9669, 25.0213, 2.4361])  x.std(0) tensor([1.9063, 5.0764, 9.7474])  y = bn(x)  y.mean(0) tensor([2.0000, 4.0000, 8.0000], grad_fn=<MeanBackward2>)  y.std(0) tensor([1.0005, 2.0010, 3.0015], grad_fn=<StdBackward1>)

5. Layer Normalize

$\text{Given a single sample }x\in\mathbb{R}^D,\text{ it normalizes the component of }x:$

\[\begin{align} \mu&=\frac{1}{D}\sum_{d=1}^Dx_d\\ \sigma&=\sqrt{\frac{1}{D}\sum_{d=1}^D(x_d-\mu)^2}\\ \forall d, y_d&=\frac{x_d-\mu}{\sigma} \end{align} \]

6. ResNet

class ResBlock(nn.Module):     def __init__(self, nb_channels, kernel_size):         super().__init__()         self.conv1 = nn.Conv2d(nb_channels, nb_channels, kernel_size,         padding = (kernel_size-1)//2)         self.bn1 = nn.BatchNorm2d(nb_channels)         self.conv2 = nn.Conv2d(nb_channels, nb_channels, kernel_size,         padding = (kernel_size-1)//2)         self.bn2 = nn.BatchNorm2d(nb_channels)          def forward(self, x):         y = self.bn1(self.conv1(x))         y = F.relu(y)         y = self.bn2(self.conv2(y))         y += x         y = F.relu(y)         return y

$\text{Add ResBlock to ResNet:}$

class ResNet(nn.Module):     def __init__(self, nb_channels, kernel_size, nb_blocks):         super().__init__()         self.conv0 = nn.Conv2d(1, nb_channels, kernel_size = 1)         self.resblocks = nn.Sequential(         # A bit of fancy Python         *(ResBlock(nb_channels, kernel_size) for _ in range(nb_blocks))         )         self.avg = nn.AvgPool2d(kernel_size = 28)         self.fc = nn.Linear(nb_channels, 10)      def forward(self, x):         x = F.relu(self.conv0(x))         x = self.resblocks(x)         x = F.relu(self.avg(x))         x = x.view(x.size(0), -1)         x = self.fc(x)         return x

$\textbf{Veit et al. (2016) interpret a residual network as an ensemble, which explains in part its stability}$

$e.g. \text{ with 3 blocks we have:}$

\[\begin{align} x_1 &= x_0+f_1(x_0)\\ x_2 &= x_1+f_2(x_1)\\ x_3 &= x_2+f_3(x_2) \end{align} \]

$\text{Hence there are 4 paths:}$

\[\begin{align} x_3 &= x_2+f_3(x_2)\\ &=x_1+f_2(x_1)+f_3(x_1+f_2(x_1))\\ &=x_0+f_1(x_0)+f_2(x_0+f_1(x_0))+f_3(x_0+f_1(x_0)+f_2(x_0+f_1(x_0))) \end{align} \]

$\textbf{(1) performance reduction correlates with the number of paths removed from the ensemble, not with the number of blocks removed} $
$\textbf{(2) only gradients through shallow paths matter during train. }$

$\\$
$\large\textbf{Summarize:}$

$\text{ReLU to prevent the gradient from vanishing during the backward pass}$
$\text{Dropout to force a distributed representation}$
$\text{Batch Normalization to dynamically maintain the statistics of activations}$
$\text{Identity pass-through to keep a structured gradient and distribute representation}$
$\text{smart initialization to put the gradient in a good regime}$