To understand Multilayer Perceptron (MLP), we must first understand Perceptron. ## Perceptron Perceptron is a supervised learning algorithm for the binary classifier. The model is $\mathbf {\hat y}=h(\mathbf w\cdot \mathbf x)$ where $h$ is the Heavside step-function $h(x)=\begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x \ge 0 \end{cases}$ **Heuristic rule** is used to improve the model upon new data $(\mathbf x, y)$ $\mathbf w \leftarrow \mathbf w -r\cdot\mathbf x\cdot (\hat y-y)$ where $r$ is the learning rate. ## Logistic Regression Logistic regression is similar to perceptron. The model is $\mathbf {\hat y}=\sigma(\mathbf w\cdot \mathbf x)$ where $\sigma$ is the Sigmoid function $\sigma(x)={\frac {1}{1+e^{-x}}}$ **Gradient descent** is used to improve the model upon new data $(\mathbf X, \mathbf y)$ $\mathbf w\leftarrow \mathbf w - \frac{r}{n}\cdot\mathbf X^\mathrm T \cdot (\mathbf{\hat y}-\mathbf y)$ ## Comparison The most obvious difference between perceptron and logistic regression is their **activation function**. Perceptron uses a step-function whereas logistic regression uses a sigmoid function. Surprisingly, this affects how they can learn as well. Their learning formula is similar, but perceptron can only learn one data point at a time and cannot be vectorised like logistic regression. - If we vectorised and **average**: we get vanishing update when error is sparse. - If we vectorised and **sum**: we get magnifying update when error is dense. ## XOR Problem Both perceptron and logistic regression (called **neuron**) fail when the data is not linearly separable. A classic example is the XOR problem: | Input 1 | Input 2 | XOR | | ------- | ------- | --- | | 0 | 0 | 0 | | 0 | 1 | 1 | | 1 | 0 | 1 | | 1 | 1 | 0 | ## Multilayer Perceptron Multilayer perceptron (MLP) solves this by stacking layers of neurons. A solution using perceptron is to have 2 layers - a middle layer and an output layer: 1. Middle layer with 2 neurons. Each neuron takes in both inputs. 1. $\hat y_{1,1} = h(x_1+x_2-1)$ 2. $\hat y_{1,2}=h(x_1+x_2-2)$ 2. Output layer with 1 neuron. It takes the output of the middle layer as its input. 1. $\hat y_2=h(\hat y_{1,1}-\hat y_{1,2}-1)$ | Input | $\hat y_{1,1}$ (OR) | $\hat y_{1,2}$ (AND) | $\hat y_2$ (XOR) | | ------ | ------------------- | -------------------- | ---------------- | | (0, 0) | 0 | 0 | 0 | | (0, 1) | 1 | 0 | 1 | | (1, 0) | 1 | 0 | 1 | | (1, 1) | 1 | 1 | 0 | Layering the neurons breaks linearity because of the non-linear activation function $h$ or $\sigma$. If they were linear, the layers will simply collapse into yet another linear function. The example solution above uses perceptron as neurons, but this is only for demonstration purposes. In practice, perceptron cannot be used as neurons because it cannot be trained with backpropagation. ### Backpropagation Backpropagation is an efficient algorithm to calculate the gradients needed for gradient descent. The gradient for a node with weight $\mathbf w$ is $\dfrac{\partial L}{\partial \mathbf w}$where $L$ is the loss. Intuitively, this value tells us how much does the weight $\mathbf w$ affects the loss, and we nudge it towards the direction that minimises this loss. The weight $\mathbf w$ affects the node's output $\mathbf{\hat y}$, so we can break it down using chain rule $\dfrac{\partial L}{\partial \mathbf w}=\dfrac{\partial \mathbf{\hat y}}{\partial \mathbf w}\cdot \dfrac{\partial L}{\partial \mathbf{\hat y}}$ If $\mathbf {\hat y}$ is the final output $\begin{aligned}L&=\frac{1}{2}(\mathbf{\hat y} - \mathbf y)^2 \\ \dfrac{\partial L}{\partial \mathbf{\hat y}}&=\mathbf{\hat y} - \mathbf y\end{aligned}$ Otherwise $\mathbf {\hat y}$ affects all output $\mathbf {\hat y}_i$ of the next layer $\dfrac{\partial L}{\partial \mathbf{\hat y}}=\sum_i\dfrac{\partial \mathbf{\hat y}_i}{\partial \mathbf {\hat y}}\cdot \dfrac{\partial L}{\partial \mathbf {\hat y}_i}$ This gives us a recursive algorithm to calculate the gradients.