Syllabus
- Linear Algebra basics
- Numpy Essentials
- Linear Regression
- Cost function
- Gradient Descent
- Multivariable Linear Regression
- Logistic Regression
- Theano Basics
- Neural Networks
- Introduction to Keras
- Backpropagation
Linear Algebra Revision
To study advanced machine learning, it is recommended to know a few concepts in Linear Algebra like Matrix calculus, Eigen Decomposition, etc. But for this course, we just need to remember a few matrix operations like matrix addition, multiplication, inversion from high school. Let’s start with how we represent a matrix. A matrix is typically of 2 dimensions, represented by [m,n]. A matrix with just one row or one column is called a vector.
To practice these operations, instead of pen and paper approach, let us run the operations using numpy. Numpy is a python library that lets us create and operate on large matrices efficiently. Open up a terminal to get started.
Addition
\[A + I = C\\ \begin{bmatrix}1 & 2 & 3\\4 & 5 & 6 \\ 7 & 8 & 9\end{bmatrix} + \begin{bmatrix}1 & 0 & 0\\0 & 1 & 0 \\ 0 & 0 & 1\end{bmatrix} = \begin{bmatrix}2 & 2 & 3\\4 & 6 & 6 \\ 7 & 8 & 10\end{bmatrix}\]Subtraction
\[A - I = D\\ \begin{bmatrix}1 & 2 & 3\\4 & 5 & 6 \\ 7 & 8 & 9\end{bmatrix} - \begin{bmatrix}1 & 0 & 0\\0 & 1 & 0 \\ 0 & 0 & 1\end{bmatrix} = \begin{bmatrix}0 & 2 & 3\\4 & 4 & 6 \\ 7 & 8 & 8\end{bmatrix}\]Multiplication
\[A \cdot I = I \cdot A = A\\ \begin{bmatrix}1 & 2 & 3\\4 & 5 & 6 \\ 7 & 8 & 9\end{bmatrix} \cdot \begin{bmatrix}1 & 0 & 0\\0 & 1 & 0 \\ 0 & 0 & 1\end{bmatrix} = \begin{bmatrix}1 & 2 & 3\\4 & 5 & 6 \\ 7 & 8 & 9\end{bmatrix}\\\\ A \cdot B = E\\ \begin{bmatrix}1 & 2 & 3\\4 & 5 & 6 \\ 7 & 8 & 9\end{bmatrix} \cdot \begin{bmatrix}6 & 5 & 7\\2 & 8 & 8 \\ 0 & 3 & 9\end{bmatrix} = \begin{bmatrix}6 & 10 & 21\\8 & 40 & 48 \\ 0 & 24 & 81\end{bmatrix}\\\]More Operations
Numpy Essentials
Task | Function | Snippet |
---|---|---|
Convert a list to numpy array | np.array() | np.array([1,2,3,4]) |
Create a null vector of size 10 | np.zeros() | np.zeros(10) |
Create a vector with values ranging from 10 to 49 | np.arange() | np.arange(10,50) |
Create a 3x3 matrix with values ranging from 0 to 8 | np.reshape() | np.arange(9).reshape(3,3) |
Create a 3x3 identity matrix | np.eye() | np.eye(3) |
Create a 3x2x2 array with random values | np.random.random() | np.random.random([3,2,2]) |
Create a 4x4 array (x) with random integers from 0-99 | np.random.randint() | x = np.random.randint(0,100,[4,4]) |
Find the index of maximum of x | np.argmax(), np.unravel_index | np.unravel_index(x.argmax(),x.shape) |
Find the index of minimum of x | np.argmin(), np.unravel_index | np.unravel_index(x.argmin(),x.shape) |
Find mean of x | np.mean() | x.mean() |
Find sum of all elements in x | np.sum() | x.sum() |
Find the datatype of x | np.dtype | x.dtype |
Set datatype of x as float32 | np.astype() | x.astype(‘float32’) |
Linear Regression
Regression is a statistical method for modeling the relationship between a set of (independent) variables. Linear regression basically assumes a linear model. What is linear model? A model that is linear in parameters but can still use the independent variables in non-linear forms like \(x^{2}\) and \(\log x\).
The model is defined by Hypothesis, which is a function of variables and parameters. \(H = a + bx\)
In the equation above, [a,b] is the set of parameters and x is the independent variable. Our objective in regression, is to find the parameters [a,b]. Usually we are given a table of values of variables(x) and outputs(y) and asked to build a model that approximates the relationship between x and y.
Download the whole table from here.
x | y |
---|---|
6.1101 | 17.592 |
5.5277 | 9.1302 |
8.5186 | 13.662 |
7.0032 | 11.854 |
5.8598 | 6.8233 |
…. | …. |
5.8707 | 7.2029 |
5.3054 | 1.9869 |
8.2934 | 0.14454 |
13.394 | 9.0551 |
First we define a hypothesis, \(h=a + bx\). Now our objective is to find the best values of a and b that closely fits the relationship between x and y, in other words find a,b such that \(h \approx y\).
Cost Function
The cost or loss or error function calculates the difference between the hypothesis and the actual model (i.e) How wrong are the values of a and b? For linear regression, we define cost function as:
\[L = (1/2m)\sum_{i=1}^{m}(h-y)^2\]By measuring the error in the hypothesis we can adjust the parameters a and b to decrease the error. Now this becomes the core function of regression. We adjust the parameters, check the error, adjust them again and on and on; eventually we get the best set of parameters and hence the best fitting model. We call this iterative process, learning. But how exactly do we adjust the parameters?
Gradient Descent
Gradient Descent is an optimization technique that improves the parameters of the model, step by step. In each iteration, a small step is taken in the direction of the local minima of the cost function. The distance of movement in each step, is called the learning rate. If the learning rate is too small, it takes a long time for the model to converge (to fit the data well) and if it is too big, the model might not converge. The value of learning rate(\(\alpha\)) is thus, crucial to the learning process.
\[a : a - (1/m) \alpha \nabla_{a}\\ b : b - (1/m) \alpha \nabla_{b}\\ \nabla_{a} = \sum_{i=1}^{m} (h_{i} - y_{i})\\ \nabla_{b} = \sum_{i=1}^{m} (h_{i} - y_{i})x_{i}\\\]Multivariable Linear Regression
What we have seen so far, is Simple Linear Regression, which models the relationsip between two scalar variables. In multivariable linear regression, we will deal with a vector of inputs.
Hypothesis, \(H = \theta^{T} X\\\) where \(X, \theta\) are vectors
X is a vector of [\(x_1, x_2, x_3,...x_n\)] and \(\theta\) is a vector of all the parameters [\(\theta_0, \theta_1,...\theta_n\)].
Cost function
\[L = (1/2m)\sum (Y-H)\]
Gradient Descent
Logistic Regression
Logistic Regression is a regression technique where the output variable is categorical (note that the output variable is continuous in the previous cases). Categorical means that the output variable can only take a limited set of values. Take the case of Binary Logistic Regression model where the output is either zero or one [0,1]. Similarly we can build a model such that the output is [0,1,2,3,4,5,6,7,8,9]. Such a model can be used for classification of images of characters [A-Z] or digits [0-9]. The logistic regression model can also be called a classifier.
The hypothesis is defined as
\[h = g(\theta^{T}X)\]where \(g\) is a sigmoid function, which squashes the input \(\theta^{T}X\) into the range (0,1).
Sigmoid, \(g(z) = 1/(1 + e^{-z})\).
Cost Function, \(J(\theta) = (1/m) \sum_{i=1}^{m} [-y^{(i)} log( h_{\theta}(x^{(i)})) - (1-y^{(i)}) log( 1 - h_{\theta}(x^{(i)})) ]\)
In vector form, \(J(\theta) = (1/m) \sum [ Y \cdot log(H) - (1 - Y) \cdot log(1-H) ]\)
Gradients, \(\frac{\partial J(\theta)}{\partial \theta} = (1/m) \sum [ (H - Y) \cdot X ]\)
Tabulation of Models
Model | Hypothesis | Cost Function | Gradients |
---|---|---|---|
Simple Linear Regression | \(H = a + bx\) | \(L = (1/2m)\sum_{i=1}^{m}(h-y)^2\) | \(\nabla_{a} = \sum_{i=1}^{m} (h_{i} - y_{i})\\ \nabla_{b} = \sum_{i=1}^{m} (h_{i} - y_{i})x_{i}\) |
Multivariable Linear Regression | \(H = \theta^{T} X\) | \(J(\theta) = (1/2m)\sum (Y-H)\) | \(\frac{\partial J(\theta)}{\partial \theta} = (1/m) (X \cdot (H - Y))\) |
Logistic Regression | \(H = g(\theta^{T}X)\) | \(J(\theta) = (1/m) \sum [ Y \cdot log(H) - (1 - Y) \cdot log(1-H) ]\) | \(\frac{\partial J(\theta)}{\partial \theta} = (1/m) ( X \cdot (H - Y) )\) |
Neural Networks
Neural Networks or Artificial Neural Networks are multilayered architectures of interconnected neurons or computational units to be accurate, that maps the inputs to outputs. They are function approximators, known for their representation power. Deep Neural Networks are data hungry. More the data better the model. These networks learn the mapping between input and output by adjusting the numerical weights which define the connections between neurons. There are a great many categories of Neural Networks. The kind of neural network that we will learn is Feed Forward Neural Network or Multi-layer Perceptron (MLP).
In an MLP, there are 3 types of layers : Input layer, Hidden Layer and the Output Layer.
The representation power of the neural network depends on the number of hidden layers (depth). The name “deep” learning comes from this depth. Notice the connections in the figure. Every node in one layer is connected to every other node in the next or previous layer, but there are no connections between the nodes in the same layer. This kind of layers are called Fully connected layers. Also, notice the arrow pointing in the forward direction (looking from the input to output). This means that the flow of information happens from the input to the output, hence the name Feed Forward Neural Network. The network can be seen as a collection of non-linear functions that are parameterized by the weights (connections between neurons).
Representation Learning is another important property of Neural Networks that should be understood. In a deep network, the raw input is transformed into a useful form/feature/representation, through a series of non-linear transformations. This representation is useful, in the sense that it is convinient for the network to use the input in this form and learn the best input-output mapping. Moreover, the inputs, which are physical phenomena like sound, image, etc., are complex, noisy and highly redundant. While learning to map inputs to outputs, the network automatically learns the best representation from complex raw inputs that helps the network to get the job done. The job I mentioned could be classification, regression, etc,.
Introduction to Keras
Keras is a minimalist, highly modular neural networks library, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.
Install keras via pip.
Let us learn keras by creating an MLP (Multilayer perceptron) to classify digits from MNIST dataset.