Training Models 101: Understanding What It Is and Why It’s Important
A key concept in machine learning (ML) is the idea that computer programs can learn to do things they aren’t explicitly programmed to do. We’re familiar with software code and programming a machine with specific instructions to achieve tasks. ML is more akin to how we (humans) learn, which involves interacting with data and then making decisions based on the findings contained within that data.
The process of building an ML model is called training. Training involves an ML algorithm looking at a lot of data. to find patterns. The model can then leverage the patterns found in the data towards solving a business problem.
It’s easy to train a model but it’s not necessarily easy to do it properly so that a model can accomplish its intended objectives.
Start by Properly Defining the Problem and Support It with Data
It’s important to have a clearly defined business use case in mind that has been framed in terms of an ML problem. Redefining a business problem as a ML experiment is the hardest and most important task that ML developers do. There is a misconception that most time is spent on developing and training the model. However, it’s defining the business objective and framing it as a ML problem that takes the most time. This involves defining the objective of the model, choosing the right performance metric, and deciding the threshold on what a good performance means. This also means that appropriate data is available that will support the ability to address the business problem.
It’s not enough just to have access to a lot of historical data. The data needs to contain useful information and be processed to be usable within an ML context.
There are two primary categories of data:
- Structured data is easily readable by machines and is usually categorized and labeled. It’s the type of data that we might find in a database or excel spreadsheet such as a table with multiple rows and columns.
- Unstructured data is, as the name implies, more free-form and not neatly categorized. Until recently it was problematic to work with this kind of data (it typically needed to be compiled into structural format first), but recent advances in deep learning are pushing the boundaries of what is possible.
Part of the workflow in an ML project often involves sourcing and cleaning data to ensure it is suitable for use in a model. Having available data that aligns with a business use case and ensuring it’s formatted appropriately is a critical step in training a ML model.
Feature Engineering and Feature Selection
A key role for ML developers is preparing the data that will be used to train the algorithms, and an important part of this process is feature engineering and feature selection. Features are columns in an excel spreadsheet or sql table. Raw data may contain an enormous amount of features that can be combined in endless ways, which may not always yield the most useful results. An experienced ML developer will be able to select the features and datasets most likely to assist in solving the particular business problem and create new features that provide more signal.
Let’s consider a simple example of feature engineering. Say we have data that contains the weight and height of a person. However, to better address our problem, it might be more useful to understand Body Mass Index (BMI) which is a function of weight/height. The process of identifying that BMI may be useful and creating a column to represent BMI is feature engineering.
In the end it may not be optimal to use all generated features and it is typically better to select a relevant subset of it. It’s not practical to test every combination of features in a large dataset (as there may be more combinations than there are atoms in the universe). Instead, a special class of algorithms, broadly categorized as heuristic search, is used to search for the best subset of features deemed most useful in solving the business problem.
Selecting and engineering features is an iterative process. That is to say, developing ML solutions involves repeating and refining the analysis to make improvements. Human decisions help set the conditions for a good learning experience by the machine.
Preparing the Data and Training the Model
Before ML takes place, human experts must prepare the data and select which algorithms make most sense given the problem and available data. When it comes to running the algorithms for each of them a series of hyperparameters, which establish the conditions for learning, will be then tried.
Let’s say ML is the same as gardening. Hyperparameters might then be such things as the soil conditions or level of sunlight exposure and how much water is given to the plant. These gardening hyperparameters indirectly inform the environment and impact how the plant will grow.
In addition, human experts make decisions about what data to use as the inputs to train the model. For example, there may be available data that’s less useful or less pertinent to the business use case, or conversely, there may be additional datasets that would prove useful for the project and need to be added to the mix.
Beyond the ‘Black Box’
ML models are sometimes referred to as a “black box” because the decisions made by the model are not always easily explainable. However, there are trade-offs that can be made in how we approach the design of ML models that can increase explicability.
For example, decision tree models, which are based on “if then” type decisions that are easily understood by humans, might actually be considered a “white box” model. In addition, there is work being done in the area of explainable AI (XAI), to develop models that are able to explain decisions of other models.
Each model will have its strengths and weaknesses, and sometimes trying to attain greater explicability may result in less accuracy or reliability.
How Well Does It Work?
How well does the model work? This is a key question as businesses will need to be able to trust a model before implementing it.
Performance is estimated through various evaluation metrics. One common measure is accuracy, a metric that shows how many good decisions a model makes. Accuracy sounds like a simple concept; however it can be one of the most misleading metrics.
For example, let’s say a model has 95 per cent accuracy. That sounds very good on the surface, however, the relative strength or weakness of accuracy is really dependent on the underlying data distribution. If a model is supposed to determine healthy vs. sick patients and the underlying dataset was gathered from 95% healthy patients and only five per cent sick patients, a model may have a 95 per cent accuracy rate if it says everyone is healthy.
In contrast, a model may only have a 90 per cent accuracy rate but it might make sure all the sick patients are accurately diagnosed as being sick. The trade-off in this second model is that five per cent of the healthy patients are incorrectly diagnosed as being sick.
However, if our objective was to make 100 per cent sure we identified all the sick patients, then the second model would be a better fit to solve our problem. Thus, determining accuracy isn’t simply choosing the higher overall percentage for the model, it really comes down to determining how the model fits with the business use case.
Beware of Overfitting
Building a model that works on training data but fails to work on test data as part of its training is called overfitting. There are numerous techniques that can be used in order to avoid this scenario and make the model generalizable, which means it will be applicable to future data. In general terms, the model is built in a way that prefers simpler models vs complex, even though complex models will have better performance on the training dataset. The data that is used to train the model should also be robust and representative of the whole population on which the model will be applied.
The concept of how to train a model is integral to ML. Successful ML projects hinge on finding a good business use case that is properly defined as a ML problem supported by access to relevant data. The data science team will perform the “heavy lifting” needed to ensure the data gathered best addresses the problem and is properly structured.
The data can then be applied in building the ML model. During this phase, choices are made by the data science team to ensure the best model “fit” in terms of addressing the business problem and to make sure errors like overfitting do not occur.
Through applying proper and rigorous model training methodology, the team ensures that the business can rely on the models and can use them to make decisions.