Intro to Machine Learning

About this article: you'll read about concepts and applications of Machine Learning. I will present you the building blocks for successfully applying Machine Learning for real world problem solving.

Learn here about the main concepts of ML, basic terminology, and setting up your Python environment for working with data analysis.

Let's get started.


We have no doubt to note that the most abundant resource of the XXI century is data. Data is everywhere - in your fingerprints, credit card transactions, purchase history, web browser history... Machine Learning involves thousands of techniques to extract knowledge to the absurd amount of data we have stored. For instance, earlier today (April 30th 2020) I was analyzing a 6 Tb database of text data. Can you imagine the density of this dataset?

Machine Learning is a subfield of Artificial Intelligence including self-learning algorithms that derive knowledge from data to make predictions. Can you imagine how a Tesla would drive itself with high autonomy level? Well, it wasn't a programmer who's written code to "teach" the car to perform all the driving; machine learning did all that. By feeding correct driving data into algorithms, the driving model was created.

Machine Learning Types

Let's take a look into Supervised Learning, Unsupervised Learning, and Reinforcement Learning.

IN a nutshell, here are the main different of the three types:

Supervised Learning works with labeled data and predicts the future according to the dataset

Unsupervised Learning works with unlabeled ldata and finds hidden structures in it

Reinforcement Learning is a decision process based in a reward system. It learns from the outcome (good or bad) through a series of actions

Now, let's dive deep into each one.

Make Predictions with Supervised Learning

The goal of Supervised Learning is to learn a model from a labeled set of data that allows us to predict about unseen or future data. Supervised refers to a set of training examples (data inputs) where the desired outputs (labels) are already defined.

Here's a typical Supervised Learning workflow:

The labeled training data is passed to a ML algorithm for fitting a predictive model that can make predictions on new, unlabeled inputs

For instance, you could download all your emails and classify each one into being spam or not. Therefore, each message would have a label informing if it's a spam message (discrete class labels). A Supervised algorithm would classify new messages into being spam or not according that what it has learned from your messages. The other subcategory of machine learning is regression (or prediction) where the outcome signal if a continuous value.


The goal of classification is to predict the categorical class labels of new instances, according to past observations. Class labels are discrete, unordered values. The email spam example is a typical example of a binary classification task, where the ML algorithm learns a set of rules in order to distinguish between two possible classes: spam and non-spam messages. The binary output of classification can also be called positive and negative results. The dataset is two-dimensional, which means that each example has two values associated with it: x1 and x2.

Multiclass classification also exists. For instance, handwritten character recognition is an example. The letters of the alphabet will represent the different unordered categories or class labels that we want to predict.

Regression for predicting continuous outcomes

The second type of Supervised Learning is the prediction of continuous outcomes, also referred as regression analysis.

IN regression analysis, we are given a number of predictior (explanatory) variables and a continuous response variable (outcome), and find a relationship between both variables that allows us to predict an outcome.

Predictor values are commonly called features.

Response variables are usually referred to as target variables.

Let's think that we want to predict the future perfomance of a selected stock. If there's a relationship between the volume transaction and the prices of this particular stock, we could use as training data to learn a model that uses the volume to predict the future stock price.

Solving problems with Reinforcement Learning

The goal in reinforcement learning is to develop a system (agent) that improves its performance based on trial and error with environment variables. The information about the current state of the environment uses to include a reward-signal. We can consider that reinforcement learning as a subfield of supervised learning. However, the feedback of reinforcement learning is not the correct label or value, but a measure of how well the action was measured by a reward function.

An agent can use reinforcement learning to learn a series of action that maximizes this reward via an exploratory trial and error method.

An interesting project about reinforcement learning is training a model in the game Grand Theft Auto 5 to make its vehicles drive themselves. The original name of this project is Deepdrive and now it seems they've joined a bigger company.

Here's how they started:

Here's where they are right now:

Discover hidden structures with Unsupervised Learning

In Unsupervised Learning, we are dealing with data of unknown structure. Using this technique, we can be able to explore the structure of out data to extract valuable information without the guidance of a reward function, like in reinforcement learning.

Clustering is an exploratory technique that organizes a pile of data into meaningful subgroups (clusters) without having prior knowledge of the dataset itself. Each cluster defines a group of objects that share a certaindegree of similarity but are more dissimilar to objects in other clusters. Clustering is sometimes called unsupervised classification.

Clustering is the best approach for structuring information and deriving meaningful relationships from data. It allows marketers to discover customer groups according to their interests in order to segment their public and develop custom-made campaigns. In fact, as of May 2020 I'm developing a software for the marketing world using this logic.

Data Compression: Dimensionality Reduction

A subfield of unsupervised learning is dimensionality reduction.

Data of high dimensionality is when each observation comes with a high number of measurements, and that can present a challenge for memory storage and computational performance when processing this data.

Unsupervised dimensionality reduction is a technique used in feature preprocessing (a very important step in the machine learning pipeline) to remove noise from data, without compromising the predictive performance of the unsupervised algorithm, and compress the data onto a smaller dimensional subpace while retaining most relevant information.

Think of that as MP3. Remember when you illegally downloaded songs from P2P clients back in the 2000's? Well, thanks to the Moving Picture Experts Group (MPEG), audio files can be compressed losing only non-relevant data. Therefore, songs could be compressed into 4 mega-byte files and easily downloded through a dial-up internet. Your ears would barely note the lost information from the original CD song.

Sometimes, dimensionality reduction can be useful for visualizing data. A high-dimensional feature set can be projected onto one, two or, three dimensional feature spaces in order to visualize it in 2D or 3D scatterplots or histograms.

3D scatterplot

Machine Learning Roadmap

Now that we are aware of the three types of learning, let's dive into a machine learning roadmap which can be applied to build most ML projects.

Basic Terminology

Training Example: A row in a table representing the dataset and synonymous with an observation, record, instance, or sample (collection of training examples)

Training: Model fitting, for parametric models similar to parameter estimation

Feature, abbrev. x: A column in a data table. Similar with predictor, variable, input, attribute, or covariate.

Target, abbrev. y: Similar with outcome, output, response variable, dependent variable, (Class) label

Loss function: often used as cost function. Sometimes the loss function is called error function.

Machine Learning Pipeline

At this point, we will define the typical roadmap applied to most machine learning projects. Take a moment to look at the diagram below:

Preprocessing: get the data in the right shape

Machine Learning algorithms are developed to work optimally when the data is formatted in the right way. The preprocessing of the data will put the dataset in the right format, which is necessary to feed it to the algorithms.

We can think of raw data as the unprocessed data, which were stored in a SQL or non-SQL database, Excel spreadsheet, or even a .csv file.

By looking at the raw data, you can look for relevant features. In a stock market dataset, relevant features could be volume_of_transactions and equity_price. Most ML algorithms require that the features are on the same scale for optimal performance - that can be achieved by transforming the features in range [0, 1] or standard normal distribution with zero mean and unit variance (I will cover that in another article, when diving deep in preprocessing).

To select relevant features, pay attention to avoid redundancy (for instance, choosing two or more highly correlated feateures). Dimensionality reduction is useful for compressing the features onto a lower dimensional subspace.

To also determine if our algorithm will generalize well to the new data, we can also divide the dataset into two sets: a training set and a test set. The training dataset is used to train and optimize the model, and the test set evaluates the final model.

Selecting a Predictive model

Each classification algorithm has its biases, and no single classification model enjoys superiority if we make no assumptions about the task we have to perform. It's useful to compare a a few different algorithms to train and select the best ML model.

Classification Accuracy is a metric to measure algorithm performance. It is the proportion of correctly classified instances.

Cross-Validation can also be useful. We divide a dataset into training and validation subsets in order to estimate the generalization performance of the model.

Congratulations if you've reached this point. We've walked together from the definitions of Machine Learning to an introduction to preprocessing and predictive models.

In the next articles and videos, we will dive into ML through hands-on projects.

Stay hungry, stay foolish.