Deep Learning Week3 Notes

1. Perceptron

\(\text{If }\sum_iw_ix_i+b\ge 0\)

\[\begin{align} f(x)=1 \end{align} \]

\(\text{Otherwise, } f(x)=0\)

\(\large \textbf{Perceptron Algorithm:}\)

  • \(\text{Start with }w^0=0\)
  • $\text{While }\exist n_k \text{ s.t. } y_{n_k}(w^k\cdot x_{n_k})\leq 0,\text{ update }w^{k+1} = w^k+y_{nk}x_{nk} $

\(\text{Code}\):

def train_perceptron(x, y, nb_epochs_max):     w = torch.zeros(x.size(1))     for e in range(nb_epochs_max):         nb_changes = 0         for i in range(x.size(0)):         ### cross the batch_size             if x[i].dot(w) * y[i] <= 0:                 w = w + y[i] * x[i]                 nb_changes = nb_changes + 1         if nb_changes == 0: break;     return w 

\(\text{We can get convergence under 2 assumptions:}\)

  • \(\text{The }x_n\text{ are in the sphere of radius }R:\)

\[\begin{align} \exist R>0,\forall n, ||x_n||\leq R \end{align} \]

  • \(\text{The two populations can be separated with a margin }\gamma\):

\[\begin{align} \exist w^*,||w^*||=1,\exist\gamma>0, \forall n,y_n(w^*\cdot x_n)\leq \gamma/2 \end{align} \]

\(\large\text{The large the margin, the more quickly the algorithm classifies all the samples correctly.}\)

\(\textbf{SVM }\text{achieve this by minimizing:}\)

\[\begin{align} \mathcal{L}(w,b) = \lambda||w||^2+\frac{1}{N}\sum_n\max(0,1-y_n(w\cdot x_n+b)) \end{align} \]

\(\textbf{which is convex and has a global optimum.}\)

\(\large\text{Hinge Loss:}\)

\[\begin{align} \max(0,1-a) \end{align} \]

3. Probabilistic view of Linear Classifier

\(\text{Consider the following class populations:}\)

\[\begin{align} \mu_{X|Y=y}(x) = \frac{1}{\sqrt{(2\pi)^D|\Sigma|}}\exp{[-\frac{1}{2}(x-m_y)\Sigma^{-1}(x-m_y)^T]} \end{align} \]

\(\text{where }\forall y\in\{0,1\} ,x\in\mathbb{R}^D\)

\(\text{We have:}\)

\[\begin{align} P(Y=1|X=x)&=\frac{\mu_{X|Y=1}(x)P(Y=1)}{\mu_X(x)}\\ &=\frac{\mu_{X|Y=1}(x)P(Y=1)}{\mu_{X|Y=0}(x)P(Y=0)+\mu_{X|Y=1}(x)P(Y=1)}\\ &=\frac{1}{1+\frac{\mu_{X|Y=0}(x)P(Y=0)}{\mu_{X|Y=1}(x)P(Y=1)}}\\ &=\sigma(\log{\frac{\mu_{X|Y=1}(x)}{\mu_{X|Y=0}(x)}}+\log{\frac{P(Y=1)}{P(Y=0)}}) \end{align} \]

\(\text{where}\)

\[\begin{align} \sigma(x) = \frac{1}{1+e^{-x}} \end{align} \]

\(\text{So with our Gaussians }\mu_{X|Y=y}\text{ of the same }\Sigma,\text{ we get:}\)

\[\begin{align} P(Y=1|X=x)&=\sigma(\log{\frac{\mu_{X|Y=1}(x)}{\mu_{X|Y=0}(x)}}+\log{\frac{P(Y=1)}{P(Y=0)}})\\ &=\sigma(\log{\mu_{X|Y=1}}-\log{\mu_{X|Y=0}}+Z)\\ &=\sigma({-\frac{1}{2}(x-m_1)\Sigma^{-1}(x-m_1)^T}+{\frac{1}{2}(x-m_1)\Sigma^{-1}(x-m_1)^T}+Z)\\ &=\sigma[(m_1-m_0)\Sigma^{-1}x^T+\frac{1}{2}(m_0\Sigma^{-1}m_0^T-m_1\Sigma^{-1}m_1^T)+Z]\\ &=\sigma(w\cdot x+b) \end{align} \]

3. Multi-layer Perceptrons

\(\textbf{Universal Approximation Theorem}\)
\(\text{A better appromixation requires a larger hidden layer, which means that we can make the }\textbf{training error }\text{as low as we want. It states nothing about the }\textbf{test error}.\)

4. Gradient Descent

\[\begin{align} w_{t+1} = w_t-\eta\nabla L(w_t) \end{align} \]

\(\text{Consider logistic regression loss:}\)

\[{L}(w,b) = -\sum_n\log{\sigma[y_n(w\cdot x_n+b)]} \]

\(\text{Therefore:}\)

\[\begin{align} \frac{\partial L}{\partial w_d} = -\sum_n x_{n,d}y_n\sigma[-y_n(w\cdot x_n+b)] \end{align} \]

\(\textbf{Note that:}\)

\[[\log{\sigma(x)}]' = \sigma(-x) \]

\(\textbf{Code}:\)

def gradient(x, y, w, b):     ### bias's gradient     u = y * ( - y * (x @ w + b)).sigmoid()     ### W's gradient     v = x * u.view(-1, 1) # Broadcasting return - v.sum(0), - u.sum(0) 

4. Forward and Backpropagation

\(\text{Forward:}\)

\[\begin{cases} s^{(l)} &= w^{(l)}x^{(l-1)}+b^{(l)}\\ x^{(l)} &= \sigma[s^{(l)}] \end{cases} \]

\(\text{Backward:}\)

\[\begin{align} \frac{\partial L}{\partial s^{(l)}}&= \frac{\partial L}{\partial x^{(l)}}\odot\sigma'[s^{(l)}]\\ \frac{\partial L}{\partial x^{(l)}} &= [w^{(l+1)}]^T\frac{\partial L}{\partial s^{(l+1)}}\\ \frac{\partial L}{\partial w^{(l)}}&=\frac{\partial L}{\partial s^{(l)}}[x^{(l-1)}]^T\\ \frac{\partial L}{\partial b^{(l)}} &=\frac{\partial L}{\partial s^{(l)}} \end{align} \]