The use of machine learning (ML) is increasing. A recent report by IDG Research Services shows that two-thirds of companies use ML, and 86 per cent of the participants stated that they have their budget for ML projects. Sixty per cent of the companies that had previously used ML also report that they could see results after just three months.
But despite this positive dynamic, the failure rate for ML projects is high. According to reports from Gartner and others, between 85 and 96 per cent of these projects never go live. A recent report from Databricks and MIT shows that scaling ML use cases seem incredibly complex for many companies. According to 55 per cent of respondents, the biggest challenge is the lack of a central place to store and retrieve ML models. This lack, error-prone exchanges between data science and operations teams, and a lack of skilled ML resources—cited by 39 per cent of respondents—point to significant difficulties in collaborating between ML, data, and business application teams.
Reasons for failure can be overly ambitious goals at the beginning and, above all, vague ideas about how the ML models should be used later. The strategy and the business case are the essential prerequisites at the beginning of every ML model and project. Data is also crucial and one of the critical components of ML. The sayings “garbage in, garbage out” and “80 per cent of a data scientist ‘s time is spent cleaning data‘ are well known in the data science community. Both refer to data related to successful model training. If the input data for training is low quality, then the output model will be low quality, as simple as that. However, model training is only one part of a productive ML system.
Various data must be collected to meet business and technical needs to measure and maintain success. For example, how to keep up with the business KPIs for a specific project. It must be established who is responsible for the model and how its development can be traced. The diagram below illustrates a possible data flow in a fictional web application that uses ML to recommend plants to buyers and the data team members responsible for each stage.
The source data 8flows from the web application into a cache and then into derived tables. These are used for monitoring, reporting, feature engineering, and model training. Additional metadata about the model is extracted, and logs of testing and deployment are collected for audit and compliance. A project that cannot manage this data is in danger of failing, no matter how well the ML model performs at a given step.
ML Engineering, MLOps, And Model Governance
DevOps and data governance have reduced the risk of failed ML projects and have become disciplines in their own right. MLengineering has also developed into its domain. Their goal is to ensure ML applications’ operation (also called MLOps) and governance. There are various types of risks associated with these tasks. First, the risk inherent in the system of ML applications, and second, the risk of non-conformity with external systems. The data pipeline infrastructure, KPIs, model monitoring, and documentation should not be missing. Otherwise, the system will become unstable or ineffective. On the other hand, a well-designed app that does not meet corporate, regulatory, and ethical requirements runs the risk of losing budgetary resources, Receiving fines, or even damaging the company’s reputation. Maps and model governance are still in their infancy, and there are no official standards or definitions for them yet, but they represent promising approaches to mitigate the above risks.
MLS is the active management of a productive model and its mission, including stability and effectiveness. In other words, MLOps is primarily concerned with keeping the ML application functional through better data, model, and development management. Simply put: MLOps = ModelOps + DataOps + DevOps.
On the other hand, model governance controls and regulates a model, its mission, and its impact on the surrounding systems. It primarily addresses the broader implications of how an ML application works in the real world. So to build a system that respects human values and is functional simultaneously, you need both.
Having distinguished between operations and governance, we can now examine what specific skills are required to support them. The findings can be divided into six categories:
Data Processing And Management
Since most innovation in ML is open source, supporting structured and unstructured data types with available formats and APIs is a prerequisite. The system must also process and manage pipelines for KPIs, model training/inference, target drift, testing, and logging. It should be noted that not all channels process data in the same way or with the same SLAs. Depending on the use case, a training pipeline may require GPUs, a monitoring pipeline may require streaming, and an inference pipeline may require low-latency online serving. Features need to be kept consistent between training (offline) and serving (online) environments, leading many to consider feature stores as a solution.
ML engineering in the real world is a cross-functional endeavour – comprehensive project management and ongoing collaboration between the data team and business stakeholders are critical to success. Access controls play a significant role here, allowing the right groups to work on data, code, and models in the same place while limiting the risk of human error or misconduct.
To ensure that the system meets quality expectations, tests should be performed on code, data, and models. This includes unit testing the pipeline code, covering feature engineering, training, serving, metrics, and end-to-end integration testing. Models should be tested for their reference accuracy across demographic and geographic segments, feature importance, bias, input schema conflicts, and computational efficiency. The data should be tested for sensitive information and training/serving bias and validation thresholds for a feature and target drift. Ideally automated, testing reduces the likelihood of human error and helps with regulatory compliance.
Regular monitoring of the system helps identify and respond to events that threaten its stability and effectiveness. How quickly can it be discovered when a critical pipeline fails, a model becomes obsolete, or a new version leaks memory in production? When was the last time all input feature tables were updated, or did someone try to access restricted data? Answers to these questions can require a mix of live (streaming), periodic (batch), and event-driven updates.
This refers to the ability to validate the results of a model by recreating its definition (code), inputs (data), and system environment (dependencies). Suppose a new model performs unexpectedly poorly or has a bias relative to a population segment. In that case, organizations need to be able to review the code and data used for feature development and training, an alternative version to reproduce and redeploy.
Documenting an ML application scales operational knowledge, reduces the risk of technical issues, and serves as a bulwark against compliance violations. This includes accounting and visualization of the system architecture, the schemas, parameters, and dependencies of properties, models, and metrics, and reports on each model in production and the associated governance requirements.
The Need For A Data-Centric Machine Learning Platform
Data science tools, born from a model-centric approach, are fundamentally limited in ease of enterprise-wide adoption, integration with data infrastructure, and collaboration capabilities. They provide advanced model management capabilities in software separate from critical data pipelines and production environments. This disjointed architecture relies on other services to manage the most crucial component of the infrastructure – the data.
As a result, access control, testing, and documentation for the entire data flow are distributed across multiple platforms. Separation at this point seems arbitrary and, as previously noted, unnecessarily increases the complexity and risk of failure for any ML application. A data-centric ML platform, such as a lakehouse, brings models and features with data for business metrics, monitoring, and compliance.
By definition, lakehouses are data-centric, combining the flexibility and scalability of data lakes with the power and data management capabilities. Their open-source nature makes it easy to integrate ML where the data resides. There is no need to export data from a proprietary system to leverage ML frameworks. This also makes acceptance much more straightforward.
A model-centric approach to ML applications can unintentionally pose a significant source of risk. The shift to a data-centric system shows that this risk is the responsibility of the application function itself or compliance with external mandates. Maps and governance are emerging disciplines to instil confidence in and de-risking ML initiatives, which they achieve through foundational skills. The future lies in Lakehouse architecture that is open and easy to adopt.