Recently, I have taken to an exploration of the world of synthetic data, specifically associated with design of safety systems. This is motivated in no small part by Weapons of math destruction by Cathy O’Neil - which is a masterful examination of the pervasive impact of data driven choices & it’s real world impact.
It’s a niche that generally lives beyond the horizon of most day to day interactions people have with technology. These are systems that overlay production & process systems in industrial sites to allow for safe operations & are designed with a singular goal of preventing the loss of containment & catastrophe.
As with all engineering systems in private enterprises - these are designed with the goal to generate maximum value (In this what can be seen as a negative cost, that is limiting liability & damage). However what was traditionally seen as the domain of a safety factor based philosophy has become increasingly subject to the use of analytical methods that try to optimize the architecture of these systems using risk based analytics.
A core challenge with this is the lack of data & this is where stochastic processes seem to be of appeal to use. The use of distribution functions often seems to this repository of endless data that can be used to estimate, predict & analyze the behavior of these systems, but what is often conflated in this approach is that its applicability to the world of safety systems in industrial facilities is highly suspect.
To develop this stochastic function, we need an event distribution - or in simplified language - I need something bad to have happened in the past to tell you how it might happen in the future. Now what if, nothing bad has happened in the past ? (As innocuous and favorable this conundrum might seem, it is nothing if not a Trojan horse to introduce synthetic data as a solution).
Industrial safety systems dance between the line of known unknowns & unknown unknowns.
We likely are always in a position to deal with the known unknowns - think adding a thicker layer of concrete as a safety feature or a fire suppression system designed to deal with fires larger than one can expect - but unknown unknowns present a staggeringly complex problem. What synthetic data is supposed to provide is an elegant (but as we will see, a high simplified and idealistic) solution. It uses the idea that we abstract away the context of events and digest it into a stochastic functions. Then we generate numbers that are an outcome of these stochastic function & can neatly sit into them allowing us to point them at massive scale compute. Most people have probably already recognized this method - the Monte-carlo - which in theory should generate a sufficiently accurate estimate for the outcome.
It is time to pause here for a minute - what did we just do?
We had few or no events in the beginning - approximated that with a function - generated what seemingly is a projection into the future & then again generated stochastic parameters from them - strange isn’t it ? Especially if we factor in the intrinsic assumption - that the abstracted value contains all the context & information associated with the event of failure.
The aggressive interrogation of these models have been contemporaneously possible given transparency of methods - but recent developments focused on completely black box models (in the literal sense - outcomes are almost always unexplainable even when they seem to be correct) - is likely going to give rise to a staggering rise in the use of data using the synthetic data approach.
The problem we will face is that we will have be challenged to utilize the outcomes of a faceless algorithm, with synthetic data to design systems that have very real needs of safety.