The misery of a Data Scientist and the hope for their salvation
Platform Engineering and Internal Developer Platforms for Data DevX
Much has been said about the plight of software engineers. Their hardship in productionising applications is well understood and courageous efforts in DevOps and, more recently, Platform Engineering have sought to come to the rescue.
However, there is another tribe whose story is perhaps far more perilous and whose experience is far more neglected. Much like Sisyphus in the Underworld, Data Scientists are endlessly pushing a large boulder up a steep hill to release a model into production.
This boulder is larger and the hill is steeper than in the case of software engineering because data scientists experience the entirety of the software engineering struggle but are also encumbered with many additional burdens.
These burdens often relate to the complexity of working with data and include difficulty in discovering data, requesting access to data, working against disparate data sources, types, and technologies, the absence of industrialised data pipelines, missing metadata, and data quality issues.
There are another set of burdens which relate to operationalising a machine learning model, such as model training and its intense compute requirements, ethics testing and other Responsible AI practises, model governance including regulatory commitments, model monitoring such as drift detection, and the list goes on.
As such, for a data scientist to release a Machine Learning model into production with any respectable degree of velocity is a challenge that only Sisyphus might empathise with.
It is a miserable experience for the Data Scientist and the retention statistics are damning. Attrition rates are 55% higher than their technology peers, average tenure stands at 1.7 years, and less than 2% stick it out for more than 5 years in the same company.
So how do we turn this tide?
Internal Developer Platform
Looking at how traditional software engineering is managing to make some headway is probably a good place to start. After all, Data Scientists also write code and build applications for deployment, so they could just be regarded as another species of a larger developer family.
The rise of Internal Developer Platforms (IDPs) is allowing software engineers to operate at greater velocity and enjoy a better developer experience. The driving idea is to reduce the cognitive load of developers so rather than concentrating on a range of setup and operations activity they can spend more of their time doing their actual job of writing application code.
IDPs effectively provide a consistent and friendly user interface to expose complex infrastructure services, abstracting away common operational activities through automation and self-serve pathways. That might sound a bit abstract in itself so let’s look at an example.
A developer visits a portal where they browse and select, for example, a pre-configured option to build a website.
This template experience is referred to as a Golden Path. Once selected, it automatically deploys a range of resources with associated URLs:
How might this look for a data scientist?
- Data Scientist visits the portal
- Data Scientist selects the “Data Science” Golden Path
- A set of URLs appear corresponding to a code repository, a Python notebook, a PyCharm IDE, a storage bucket and a data warehouse with customer, transaction and product data.
- Data Scientist follows these URLs and is able to start working immediately
There was no need for a data scientist to define a VPC network or setup a firewall, launch a compute instance with an appropriate CPU/GPU or build a base image with the right Python version and Data Science packages installed. All these activities have been baked into the recipe or Golden Path ahead of time.
We use the term data scientist loosely to describe the developers who work with data as part of their development lifecycle. Data Developer is perhaps a more appropriate umbrella term and would also encompass the Data Analyst, the Quant, the Data Visualisation Engineer, the Natural Language Engineer, the Generative AI Prompt Engineer, the Data Engineer, and so on.
A simple and easy Golden Path to go after first is “Data Science Experimentation” or “SQL Data Querying”. These Golden Paths are built by Platform Engineers that understand the lifecycle of data development.
Data Developer Platform
The concept of a Data Developer Platform or DDP is emerging. It has been positioned as “an internal developer platform (IDP) for data engineers and data scientists. Just as an IDP provides a set of tools and services to help developers build and deploy applications more easily, a DDP provides a set of tools and services to help data professionals manage and analyse data more effectively.”
It’s important, in my view, that DDPs are not built in isolation from an IDP. There needs to be a single developer platform and it should serve the need of all developers — data developers as well as application developers and whatever new species which may emerge.
Why? In short, it’s an unnecessary proliferation of tooling that will require additional build and operational complexity but that’s not the only reason. Data and Software products may need to talk to each other: a Machine Learning model (data product) is often consumed through a web front end (software product). It makes sense that these are built off the same platform.
And once a data product has been built it should be thought of as part of an organisation’s application estate and be discoverable alongside all other engineering assets; at run-time who cares if the underlying logic is deterministic or machine learnt? A single enterprise-wide service catalogue is an important component part of an IDP and some (e.g. Spotify) would argue its raison d’être.
Software Golden Paths vs Data Golden Paths
While you should have a single unified IDP/DDP experience, there is a case for suggesting that Data Golden Paths and Software Golden Paths are fundamentally different animals.
Your data developers will behave in a much more consistent way than your software developers. As such, Data Golden Paths can be even more opinionated whereas Software Golden Paths need to allow for greater flexibility through customisation.
Your data warehouse analysts could be served by a single Golden Path that requires no further customisation, as all of these users tend to require the same tools, resources and access. With data being an organisation’s most prized asset, assurance around consistent access and consistent behaviours will also be relevant.
In contrast, there is unlikely to be a Golden Path that ever perfectly satisfies the needs of every Java microservice. In this case, a base Software Golden Path that provides a foundation and allows developers to customise with their own application configurations would be more appropriate.
In short, a Software Golden Path is extensible and a Data Golden Path is often non-extensible or, at least, less extensible.
Granularity of Data Golden Paths
With the idea that Data Golden Paths will largely meet the needs of the data developer community, they will need to be relatively specific in their purpose. As such, there are likely to be far greater number of Data Golden Paths than Software Golden Paths.
For example, it is unlikely there will be a single Golden Path for Generative AI. Instead, expect “Generative AI Text — Prompt Engineering”, “Generative AI Text — RAG”, “Generative AI Text — PEFT” Golden Paths. For Data Science, there may be “Machine Learning — Batch” and “Machine Learning — Stream.” You get the idea.
Data Developer Interfaces
Beyond the command line and APIs, the interaction layer for data developers may be slightly broader and encompass:
- IDEs like PyCharm or Visual Studio Code
- Interactive Notebooks like Jupyter
- SQL or Data Pipeline DAG user interfaces
- Natural Language interfaces for prompt engineering
On the last point, natural language interfaces may well also be an integral part of an interacting with the IDP as a whole: “launch an environment that will allow me to build an experimental machine learning model” or “show me how to create a website” or “provide me with an itemised breakdown of my team’s usage and spend.”
Data Access
Another distinction is that typically Software Golden Paths will not require default access to enterprise datasets. However, this will be bread and butter for Data Golden Paths which should have an associated data access policy and relevant network connectivity.
Golden Paths can be configured so that “Data Science Experimentation” can retrieve only synthetic or scrambled data whereas “Data Science with Customer Data” can retrieve more sensitive customer account and transaction information.
Shared Services for Data
For software developers, one might think about building out a shared Kubernetes cluster to be utilised by multiple developers or multiple teams. Shared services will also be relevant to the data community and could be leveraged across a range of Golden Paths.
With Generative AI all the rage, let’s look at the example of Large Language Models. These models are humongous in size with open-source models often requiring you to host them in your environment. Every team will need to do this in their own project if there isn’t a shared model hosting repository they can access.
The same is true of any common data tooling that requires dedicated compute instances. There is an opportunity to build out a shared capability that can serve the needs of many rather than be stood-up each and every time a Golden Path is deployed. Adopting this approach can also minimise the time it takes to instantiate resources defined in a Golden Path.
Paved Paths vs Golden Paths for Data
In contrast with “Golden Path,” the term “Paved Path” or “Paved Road” describes what a developer might choose to do if left to their own devices; the path they would walk or pave for themselves, if you will.
Golden Paths give you an opportunity to guide developers down a more standardised and supported approach but why might this be necessary when it comes to data developers?
Well, you might wish to restrain access to open-source tooling that doesn’t have a support model or requires dedicated compute. Or you might want to prevent usage of Large Language Models (LLMs) that have been trained on copyright data. Or you might just want to control the number of ETL products being used so that your tooling estate is more manageable.
Architecting your Internal Data Platform for both Software and Data developers
To enable this experience, you will likely have a three-tier architecture that broadly resembles the following:
- The first layer is where your developers operate and where workload configuration originates. As such, your developer portal, source control and IDEs cohabit this space.
- The second layer is the workhorse or execution engine. It does the triggering and building of the previous layer’s configuration and acts as a layer of glue between the top and bottom layers.
- The third and final layer is where the actual infrastructure exists and where shared services should be built out. This is what layer one and two are seeking to expose through automation engines and self-serve Golden Paths.
There’s some advice on how to get started here.
Key takeaways
Platform Engineering and the adoption of an IDP can be transformative for the data scientist or data developer experience.
Perhaps keep in mind:
- There should be one IDP that serves both software and data developers. A separate DPP instance should not exist in isolation.
- Data Golden Paths and Software Golden Paths are different species. The former will be more opinionated and less extensible and the latter will serve as more of a foundational blueprint to be extended.
- There will likely be a wide selection of Data Golden Paths to cover the rich tapestry of data developer activity. These will typically out number Software Golden Paths by some way.
- In defining an enterprise pathway, Data Golden Paths are likely to not always be palatable to the data developer community.
- Portal’s catalogue should be home to discoverable data products and, as such, perhaps include a feature store, a data dictionary, and a model repository.
And for good measure, here’s a final link on the Data Developer Experience that may help as you design your offering. Good luck!