Machine Learning

Machine learning is NOT magical! - Data exploration, model selection and evaluation

Overview

Teaching: 10 min
Exercises: 0 min
Questions
  • What are good practices of building machine learning models?

  • How to objectively evaluate model performance?

  • What are some tools for model diagnosis?

Objectives
  • Understand the typical procedures of building machine learning models.

  • Understand the usage of training, validation and test sets.

  • Understand the basic concept of underfitting and overfitting (bias-variance trade-off).

Machine Learning is NOT Magical!

In an ideal world, machine learning, as a sub-category of artifical intelligence (AI), can automatically learn information from data. In reality, machine learning is still far from being that magical.

As we have seen in previous sections, there exist numerous machine learning algorithms. How should we choose the best algorithm for a specific problem? How do we evaluate model performance? In this section, we will talk about the typical procedure and good practices of building machine learning models.

Data Exploration: Always Try to Understand Your Data First!

Before building a machine learning model, always try to understand your data better. Various statistical and visualization tools can be used for exploratory data analysis (EDA), including:

A good understanding of the basic data behavior helps choose suitable models and is key to model diagnosis.

alt text

Figure: You need a solid understanding for your data before being effective with machine learning (image source: https://en.wikipedia.org/wiki/Exploratory_data_analysis).

alt text

Figure: Example pairwise scatterplot matrix of multiple variables (image source: https://blogs.sas.com/content/iml/2011/08/26/visualizing-correlations-between-variables-in-sas.html).

Machine Learning Procedure: Training, Validation and Test Sets

Suppose we are facing a supervised learning problem, and have a set of observed input and outcome data. As a typical machine learning procedure, we divide the entire dataset into three subsets:

Rule of thumb for dividing the three subsets: the ratio of training : validation : test = 6:2:2 or 7:1.5:1.5 or 8:1:1

Underfitting and overfitting

The validation set provides the opportunity of diagnosing and improve model setup. Quite commonly, when we initially train a model, we might observe that:

To understand why these happen, we need to introduce the concept of model underfitting and overfitting. Typically:

alt text

Figure: Illustration of underfitting and overfitting - regression example (image source: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76)

alt text

Figure: Illustration of underfitting and overfitting - classification example (image source: https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76)

Key Points