The previous phase of CRISP-DM was Phase 4: Model Building. We developed several models of varying complexity to compare to the baseline model. We ran experiments, collected metrics, and assessed the performance of the models.
Now, in the fifth phase of CRISP-DM, the most widely used data mining lifecycle, we will evaluate the models from the context of the use case.
In the 9 Laws of Data Mining (2022), Dr. Tom Khabaza, one of the original authors of CRISP-DM, wrote that “ CRISP-DM’s Evaluation phase should have been called “Business Evaluation”, because it involves evaluating the results of analytics or data mining in terms of the impact they have on the business or in the domain.”
In essence, the evaluation phase revisits the project goals outlined during the Business/Problem Understanding phase at the beginning of the project and determines if the model aligns with the needs and goals of the business.
According to the CRISP-DM 1.0 Step-by-Step Guide (1999), this phase also assesses other data mining results generated during the previous phases. The data mining results can include models that relate to the initial business objectives and other findings that may not be directly related. These additional results can provide valuable insights into potential challenges, information, or opportunities for future directions.
Evaluating a model from the business perspective is critical to ensure that it is meets the goals and needs of the organization, provides a good return on investment (ROI), is resource efficient, mitigates risks, has a positive impact on the business, and is aligned with the overall strategy of the organization.
We need to test the model’s performance to gain some of this information, such as speed, ROI and resource efficiency. But we also need to look at the model holistically and consider the ethical impacts of the model from multiple perspectives.
The following are some key considerations for evaluating the performance of the model from the business perspective. Remember to document the results of each of these factors to be included on the model cards :
Evaluating the fairness and ethics of machine learning models is crucial before they can be deployed in real-world settings. Here are key aspects that should be considered during the model evaluation phase of the data science lifecycle. Remember to document the results of each of these factors to be included on the model cards:
If you are wondering about how you will answer all of these questions, there are some suggestions below. Keep in mind that in addition to quantifying the business impact using data-driven testing, it’s important to gather qualitative feedback from end-users and subject matter experts.
During the planning phase of the model building step, you may have established a validation dataset (train/test/validation sets). This is also called a hold-out dataset. This should be a representative sample from the training data that has not been used at any point in the data exploration or model building process. Accidentally including test or validation data in your training data is called “data leakage” and leads to overfitting and inflated performance metrics.
It’s also important to construct edge cases to test specific rare events, especially high-impact rare events
Depending on your use case, you may need to use simulated data. As mentioned above, this can become necessary if you have rare events without a lot of examples in the dataset. It’s possible to bootstrap or simulate some examples to bolster your model validation step.
Training online with real data can be time and resource intensive since it requires deploying a version of the model. If your organization is set up for this, then A/B testing with real data allows you to compare multiple models in real-time.
In this case, you would run a test instance of the model on live data and compare the predicted values to the actual values. The longer you can run a test like this, the better estimate you will have of real-world performance. It is best to combine this strategy with offline, synthetic data testing designed around “canary,” or to make sure you estimate performance for edge cases.
None of the quantitative testing and analysis, ROI calculations, and speed tests in the world, no matter how thorough, will tell you about the user experience and perception of your model or data product. You need to collect some qualitative feedback and not just from stakeholders.
Ask users if the interface to interact with the model works for them. Do they trust the output? Will they use the output?
Ask subject matter experts for a gut check. Does the model output make sense based on their experience? Does the model structure itself make sense?
Recall that we made a baseline model in Phase 4 and gradually added complexity. We likely have more than one model built, so it is worth bringing more than one of them to Phase 5.
We might find one model rise to the top during the model assessment process, when we are mainly focused on the model’s prediction performance. However, now we have discussed a myriad of other evaluations and created a more holistic view of the model through both performance and ethical evaluations.
We can use these various metrics to decide which, if any, model we will move forward to the next steps. We may find that the Return on Investment is greater if we sacrificing accuracy for computational efficiency. Or we may find that the highest performing model is biased due to high impact rare events.
The whole picture needs to be clearly communicated by the data team. Ultimately, the business stakeholders need to be given the complete, holistic picture and involved in the decision making process.
The final model should be reviewed with the help of stakeholders and subject matter experts to ensure that nothing was missed along the way. Ensure that all of the used attributes will be available in the deployed environment. Double-check all of the assumptions about resources and the deployment environment. This is an important quality assurance step.
During this step of the evaluation phase, while you are gathering all of the information about the model or data product, you need to be thinking about how you will tell the data story.
Data scientists and teams communicate the results to the stakeholders and subject matter experts to help facilitate decision-making. Clear communication can help the whole team gain confidence in the analysis, or understand the issues to support further iteration.
I mention model cards below as a best practice for documenting and communicating model details. Having a standardized model card format in your organization helps facilitate communication about models as audiences learn the standardized format over time.
Here are a few key considerations when deciding whether to deploy, iterate, or kill the project:
Once you have chosen your final mode, the best practice for communicating the holistic view of the model to both technical and non-technical audiences is to use model cards. Adopting these in your organization is a great way to increase data literacy and communicate model strengths and weaknesses to a wide audience.
Outputs from this phase are:
In this phase of CRISP-DM, we conducted a holistic model review from the business or stakeholder perspective. We updated critical documentation and decided next steps.
While Phase 4: Model Building might have shown us some great performing models, Phase 5: Model/Business Evaluation gave us confidence in the performance and business impacts of the model so we know whether or not we will see real-world results by deploying it.
From here, we either go back to a previous step to reassess, or we go on to the next step: Phase 6: Deployment