Identity Access Management (IAM) for Data Science

4 min readNov 8, 2023

Access to your enterprise’s data stores and their data assets are controlled by Identity Access Management (IAM) policies that deny or grant you access depending on what role and permissions have been assigned.

Unfortunately, its implementation almost always fails to serve data science and impedes your organisation’s ability to leverage modern artificial intelligence.

This is because the IAM policies of today are largely shaped by the role of contact call centre colleagues or database administrators: colleagues that are responding to the needs of a particular customer and need access in-the-moment to conduct a search or make a change. The traditional 1990’s application of IAM still serves an important security function in these cases.

However, data science and its necessary dependency on data differs significantly and there needs to be an appreciation of the extent:

Data Science experimentation, build and deployment can last for many, many months.
Data Scientists (potentially with different data access privileges) need to work together in a single project space, pulling in different data assets which they re-shape collaboratively.
Data Scientists often do not know up-front what data will be of value and need to conduct data exploration activities to assess the need and usefulness of datasets.
Regardless of the use case, Data Scientists will almost always need access to core enterprise datasets: customer data, transaction data, product data, market data and channel data.
When building a predictive customer model, Data Scientists will also require access to highly confidential data sources to perform bias detection to ensure the model is not discriminating against any particular demographic group.

These data science requirements make traditional IAM policies, which often embrace the principle of least privilege and JIT (Just-in-Time) access, very challenging to apply.

Security professionals are understandably troubled by any suggestion to relax these well-established guardrails. However, by not doing so, we are gravely inhibiting an organisation’s ability to exploit their data for analytics, machine learning and artificial intelligence.

Solutions have been proposed, such as using synthetic or anonymised data in place of providing access to live data. However, raw non-anonymised data is needed at scale in these scenarios so that real-world trends and multi-variable correlations can be identified, so that data can be joined across multiple source systems, and so that ethics testing can take place.

Having up-to-date anonymised data that maintains referential integrity and statistical relationships across many millions of records over thousands of fields over hundreds of disparate source systems is not really feasible. This is especially true in large organisations with multiple business areas where different data models, storage, infrastructure and legacy systems make this landscape incredibly complex.

Some organisations will manage access to data at a business area level, some at a database level, some at a dataset level, some at a table level, and some even at a field level. Data security has evolved to allow for the most granular of controls to be configured.

These access requests are usually managed by humans, that have the option to either approve or decline a request to a particular data asset. This is not an effective approach to keeping data secure, unless of course for each request, a meeting is set-up to discuss exactly what data the colleague needs, how long they require it, what they intend to do with it, etc. This is not a scalable approach and would amount to a full-time job for the data approver involved and so, to maintain their sanity, approvers tend to just click ‘approve’.

Equally, when your organisation zealously applies data access rules of increasing granularity across the breadth of your data landscape, you may end up with an unmanageable proliferation of data personas with differing permissions. As IAM roles become more complicated and your organisation struggles to make sense of them, this often has the unintended consequence of weakening your security posture.

Also, these IAM policies can be easily bypassed in the analytics arena. Let’s say you have two data scientists, Mel and Joan, working on a data science project. Assume Mel doesn’t have access to Table A but Joan does. Joan can pull the table or some extract of it into their shared collaboration space, entirely innocently, exposing it to Mel who can now see it.

So what is the solution? We would propose:

Establishing a single Data Scientist IAM role that has default and sweeping access to, at the very least, all your core enterprise datasets.
Adopting a Data Route-to-Live to limit data to the confines of your production environment and effectively virtualise compute, data and developer tooling.
Implementing intelligent monitoring and auditing capabilities to capture and alert on data access behaviours.
Rolling out a training programme together with a data developer license agreement that provides robust contractual obligations for all Data Scientist IAM members.

The proposal suggests relaxing IAM policies but counterbalancing that with the introduction of other security measures that could potentially provide an even more robust posture overall. If your organisation has an appetite to get value out of its data, it may need to revisit traditional approaches to IAM.

Identity Access Management (IAM) for Data Science

Written by Data Nick

No responses yet