CRISP-DM, Phase 5: Model/Business Evaluation

The previous phase of CRISP-DM was Phase 4: Model Building. We developed several models of varying complexity to compare to the baseline model. We ran experiments, collected metrics, and assessed the performance of the models.

Now, in the fifth phase of CRISP-DM, the most widely used data mining lifecycle, we will evaluate the models from the context of the use case.

What is Model/Business Evaluation?

Process diagram of Phase 5: Evaluation

In the 9 Laws of Data Mining (2022), Dr. Tom Khabaza, one of the original authors of CRISP-DM, wrote that “ CRISP-DM’s Evaluation phase should have been called “Business Evaluation”, because it involves evaluating the results of analytics or data mining in terms of the impact they have on the business or in the domain.”

In essence, the evaluation phase revisits the project goals outlined during the Business/Problem Understanding phase at the beginning of the project and determines if the model aligns with the needs and goals of the business.

According to the CRISP-DM 1.0 Step-by-Step Guide (1999), this phase also assesses other data mining results generated during the previous phases. The data mining results can include models that relate to the initial business objectives and other findings that may not be directly related. These additional results can provide valuable insights into potential challenges, information, or opportunities for future directions.

The importance of the evaluation phase

Evaluating a model from the business perspective is critical to ensure that it is meets the goals and needs of the organization, provides a good return on investment (ROI), is resource efficient, mitigates risks, has a positive impact on the business, and is aligned with the overall strategy of the organization.

A holistic evaluation of the model

We need to test the model’s performance to gain some of this information, such as speed, ROI and resource efficiency. But we also need to look at the model holistically and consider the ethical impacts of the model from multiple perspectives.

Performance considerations

The following are some key considerations for evaluating the performance of the model from the business perspective. Remember to document the results of each of these factors to be included on the model cards :

Accuracy, or whatever metric we used to assess the model in the previous phase: Does the model meet the criteria established in the project plan during the first phase? Is it within the allowable bounds?
ROI: Focusing on the “Return” in ROI, we can calculate an expected return based on our model’s performance metrics. For instance, if we estimate that we can accurately predict 80% of customer churn, and historical marketing data implies that we can retain 50% of those customers using promotions, then we can estimate how many people might churn with and without the model predictions. This type of reasoning is called counterfactual reasoning. We are essentially trying to compare the world where the model doesn’t exist to the world where we are using the model.
Cost effectiveness: Models and data products require resources to be deployed. There is electricity and infrastructure costs on one hand; on the other, there is time and human resources to monitor, maintain and retrain the model. Consider this to be part of the “Investment” in the ROI calculation.
Speed: How long does it take to make inferences from the model, load the dashboard, or generally get value from the data product? Is it fast enough?
Deployability: Is the model compatible with the production environment? Can we serve the model without crashing the production environment?
Scalability: What are the future anticipated needs of the model? Is it an API end point that needs to be able to handle a growing number of requests? Or a dashboard that needs to support many users?
Longevity: How long can we expect the model will be useful before needing to retrain it?
Actionability : How easily can people make decisions based on the model output or the data product? You will need real user feedback to determine if the output is clear and whether the user takes actions based on your data product.

Fairness and ethical considerations

Evaluating the fairness and ethics of machine learning models is crucial before they can be deployed in real-world settings. Here are key aspects that should be considered during the model evaluation phase of the data science lifecycle. Remember to document the results of each of these factors to be included on the model cards:

Regulatory Compliance: In phase 1, you identified any regulations or standards that govern your datasets. Now is the time to verify that all regulations have been followed and standards are met.
Data Privacy: Do you have permission to use the data? Did you validate the licensing information for all of the data that you are using, especially if it’s from a third party or scraped from the web? Are you anonymizing the data enough?
Bias Detection: Did you check if the model has any biases that may lead to discrimination against certain groups or demographics? Was fairness considered during development? Have you done a group-wise analysis of the model metrics to determine if the model performs the same across all striations of the data? Even if your data is not about people, there can be biases that will affect the performance of your model.
Transparency: Is the system transparent enough for external auditors to evaluate how decisions are made? Can the decision-making process be easily explained to users or stakeholders? Does the model output and end-user experience
Explainability: Do you know why theCan domain experts verify whether the outcomes make sense for specific contexts?
Interpretability: Are the results provided by the system explainable to end-users?
Robustness: Have you tested for edge cases?
Environmental Impact: How much energy is consumed while training the model?

4 ways to evaluate business impact

If you are wondering about how you will answer all of these questions, there are some suggestions below. Keep in mind that in addition to quantifying the business impact using data-driven testing, it’s important to gather qualitative feedback from end-users and subject matter experts.

Offline with real data

During the planning phase of the model building step, you may have established a validation dataset (train/test/validation sets). This is also called a hold-out dataset. This should be a representative sample from the training data that has not been used at any point in the data exploration or model building process. Accidentally including test or validation data in your training data is called “data leakage” and leads to overfitting and inflated performance metrics.

It’s also important to construct edge cases to test specific rare events, especially high-impact rare events

Offline with Simulated data

Depending on your use case, you may need to use simulated data. As mentioned above, this can become necessary if you have rare events without a lot of examples in the dataset. It’s possible to bootstrap or simulate some examples to bolster your model validation step.

Online with real data

Training online with real data can be time and resource intensive since it requires deploying a version of the model. If your organization is set up for this, then A/B testing with real data allows you to compare multiple models in real-time.

In this case, you would run a test instance of the model on live data and compare the predicted values to the actual values. The longer you can run a test like this, the better estimate you will have of real-world performance. It is best to combine this strategy with offline, synthetic data testing designed around “canary,” or to make sure you estimate performance for edge cases.

User feedback

None of the quantitative testing and analysis, ROI calculations, and speed tests in the world, no matter how thorough, will tell you about the user experience and perception of your model or data product. You need to collect some qualitative feedback and not just from stakeholders.

Ask users if the interface to interact with the model works for them. Do they trust the output? Will they use the output?

Ask subject matter experts for a gut check. Does the model output make sense based on their experience? Does the model structure itself make sense?

Compare model to alternatives

Recall that we made a baseline model in Phase 4 and gradually added complexity. We likely have more than one model built, so it is worth bringing more than one of them to Phase 5.

We might find one model rise to the top during the model assessment process, when we are mainly focused on the model’s prediction performance. However, now we have discussed a myriad of other evaluations and created a more holistic view of the model through both performance and ethical evaluations.

We can use these various metrics to decide which, if any, model we will move forward to the next steps. We may find that the Return on Investment is greater if we sacrificing accuracy for computational efficiency. Or we may find that the highest performing model is biased due to high impact rare events.

The whole picture needs to be clearly communicated by the data team. Ultimately, the business stakeholders need to be given the complete, holistic picture and involved in the decision making process.

Review Process

The final model should be reviewed with the help of stakeholders and subject matter experts to ensure that nothing was missed along the way. Ensure that all of the used attributes will be available in the deployed environment. Double-check all of the assumptions about resources and the deployment environment. This is an important quality assurance step.

During this step of the evaluation phase, while you are gathering all of the information about the model or data product, you need to be thinking about how you will tell the data story.

Data scientists and teams communicate the results to the stakeholders and subject matter experts to help facilitate decision-making. Clear communication can help the whole team gain confidence in the analysis, or understand the issues to support further iteration.

I mention model cards below as a best practice for documenting and communicating model details. Having a standardized model card format in your organization helps facilitate communication about models as audiences learn the standardized format over time.

Iterate, Deploy, or ☠️

Here are a few key considerations when deciding whether to deploy, iterate, or kill the project:

Does the model meet the business need and pass all performance and ethics reviews?
What are the potential improvements and enhancements that can be made to the data model? How much value will they add?
Do we iterate immediately? Deploy and iterate?
If the model does not meet the desired criteria, have we identified and addressed any underlying issues? How much more time and resource investment is needed to retrain the model?

Update Documentation

Once you have chosen your final mode, the best practice for communicating the holistic view of the model to both technical and non-technical audiences is to use model cards. Adopting these in your organization is a great way to increase data literacy and communicate model strengths and weaknesses to a wide audience.

Outputs from this phase are:

Final model
Model card
Formal review
List of actions for next iteration
Final decision on next steps

Summary and conclusion

In this phase of CRISP-DM, we conducted a holistic model review from the business or stakeholder perspective. We updated critical documentation and decided next steps.

While Phase 4: Model Building might have shown us some great performing models, Phase 5: Model/Business Evaluation gave us confidence in the performance and business impacts of the model so we know whether or not we will see real-world results by deploying it.

From here, we either go back to a previous step to reassess, or we go on to the next step: Phase 6: Deployment