This article was originally written for Deloitte by myself and my colleague Erik Bookholt, as posted in this article.
How DataOps and MLOps enable analytics at scale
Back in 2006, Clive Humby coined the famous phrase “data is the new oil”, which was used to describe how data can be valuable. Fast forward to 2021 and many organizations have realized the role that (processed) data can play in gaining competitive advantage. In almost every industry there are examples of cost savings, products innovations and even business models that would not be possible without data.
As with crude oil, raw data needs to be processed, or “refined”, to be valuable. This is where many organizations struggle. As data becomes more abundant, the demands and expectations from the organization increase. This puts pressure on the people and the technology involved in “refining” the crude data into usable insights and products.
In data science, discovery is a process requiring deep expertise and high creative control. Data scientists solve a business problem by writing code. Code to ingest, clean and transform data. Code to build, train and tune models. The outcome is a recommended action and/or a predicted outcome.
From there to creating business value is a bigger step than it seems. Most business decisions need to be taken not once, but on an ongoing basis. We are suddenly talking about an ‘insights product’ rather than a ‘proof of concept’. And totally different challenges arise. Predictions need to be updated continuously. Data pipelines need to be highly automated and maintainable. The solution needs to handle data that changes over time (sometimes in unanticipated ways). Models will have to be retrained to avoid quality loss over time. On top of that, the solution should be able to handle changing business priorities (and thereby optimization objectives) and continue to evolve as new data sources are unlocked.
Suddenly the data scientist is tasked with coordinating the business owners, process owners, data owners, IT owners and leadership; All of that, next to her day job. Weeks or months pass, and the model is outdated if it ever gets deployed.
This is where the oil analogy is still as relevant today as it was in 2006 (if not more): Moving beyond the ‘proof-of-concept’ stage of data and ML applications requires an industrialized refinery type of setting.
Combining existing and new practices to solve your data challenges
A new approach emerged to industrialize data operations and ML applications: DataOps and MLOps take the lessons learned from software development (Agile & DevOps to be precise) and complement those with techniques from the field of data science. This approach has been recognized as a key trend in Deloitte’s 2021 release of the annual Tech Trends report.
The DevOps principles concern many aspects of both the software delivery/operations organization as well as its supporting technology. DevOps emphasizes (among others) the use of multi-disciplinary teams that incrementally deliver value with a strong focus on automating the process and measuring all aspects for better insights. For further reading about DevOps and Deloitte’s point-of-view, see Deloitte’s perspective on DevOps.
When applying the DevOps principles to the delivery of data products an immediate difference becomes clear: the data. Automation of handling codebases and measurement of delivery or operational processes is only half the story when working with data.
Because organizations are working with an ever-increasing scope of continuously refreshed data, attention should be paid to versioning which data was ingested, what transformations were applied to that data, and since data is not static we need to monitor what kind and quality of data we continuously receive and process. Finally, we need to know at any time what dataset and which parameters were used in the process of training a machine-learning model. How else could we determine the quality, performance and bias, or lack thereof, of a model in production?
For these reasons we need additional measures to be in place besides the typical DevOps practices to stay in control of our code, our data and our ML models. Luckily, the field of operationalizing data and analytics is rapidly evolving and there is a lively technology ecosystem that solves most of the technical challenges described above. This is the machinery that we need to distill value from data in our refinery.
Technology and organization are needed for a true refinery in DataOps and MLOps
Technology plays a key role in implementing DataOps and MLOps, but the surrounding organization needs to operate like a refinery as well. A data scientist, much like a software developer, cannot be the lone superstar of an organization’s data aspirations. Nor does it suffice to give developers and data scientists the latest toys. Leadership needs to understand the value of data and organize for success. It needs to enable continuous end-to-end data value creation from data. Just like a refinery is organized around the flow of oil.
DataOps and MLOps are the refinery-equivalent for data and AI
The oil analogy was coined quite some time ago, but recent advances in technology and organizational thinking have made an industrialized refinery in your organization actually possible. The implications of industrializing data and ML solutions are immense: it changes what tools developers should use, how to assess data quality, and how to measure the impact of a data solution. And it changes how teams organize, work, and collaborate.
As more and more organizations are venturing into the field of Analytics at scale, my ambition is to help our clients define and develop true end-to-end data products that are scalable, maintainable and actionable.
(Cover image: Photo by PilMo Kang on Unsplash)