We are already living in the post data science world
You’ve heard it here first. Not from Gartner, Forbes, or Buzzfeed. Right here in the S.A.D newsletter: We are already living in the post data science world.
Now that the clickbait is over and you are still here: let me clarify this further.
Data is everywhere, data is this and that (I have talked about this earlier). We are inundated with the importance and usefulness of data from warfare to pet care and everything in between. Along with that every few years we see articles like this in the popular media (Jan 9, 2022 in the Guardian for example): “Are we witnessing the dawn of post-theory science?” Another clickbait headline, perhaps? According to the Betteridge's law of headlines: "Any headline that ends in a question mark can be answered by the word no”. So the answer is no: We are not witnessing the dawn of post-theory science. But I claim post-theory does not matter because we are already living in the post data science world.
On a side note, the ongoing news coverage about data and AI also brings up how media often does not (yes, there are exceptions) help driving these discussions in a more constructive way. It becomes binary, clickbait journalism: (“data vs theory”, “can AI be trusted?” or “AI rules”, “look at the new shiny model”).
Let’s now look at some excerpts from the Guardian article (with some commentary for dramatic flair):
“Somewhere between Newton and Mark Zuckerberg, theory took a back seat.” - Gasp. Really? From Newton we are going to draw a line to Zuckerberg? (Btw, the article with a few shortcomings, is actually well researched, here I am critiquing more the state of the world than the article itself):
In 2008, Chris Anderson, the then editor-in-chief of Wired magazine, predicted its demise. So much data had accumulated, he argued, and computers were already so much better than us at finding relationships within it, that our theories were being exposed for what they were – oversimplifications of reality. Soon, the old scientific method – hypothesise, predict, test – would be relegated to the dustbin of history. We’d stop looking for the causes of things and be satisfied with correlations.
With the benefit of hindsight, we can say that what Anderson saw is true (he wasn’t alone). The complexity that this wealth of data has revealed to us cannot be captured by theory as traditionally understood. “We have leapfrogged over our ability to even write the theories that are going to be useful for description,” says computational neuroscientist Peter Dayan, director of the Max Planck Institute for Biological Cybernetics in Tübingen, Germany. “We don’t even know what they would look like.”
Chris Anderson and others indeed saw the role of data and the future implications. However, the “leapfrogging” that Dayan mentions is a problematic framework that allowed us to think: finding relationships among data elements is enough to declare success and deploy models and systems. We created an environment (media, academia, industry, fossil fuel industry, venture capitalist, war machines are all part of this) that made it easy to leapfrog and say correlation is enough. To examine the context of such leapfrogging is what is urgently needed (see my comments here). Also see, Emily Bender’s recent piece of Medium: On NYT Magazine on AI: Resist the Urge to be Impressed.
Our analytical framework about data needs a revamp. So we do not keep asking and repeating the same old cliched questions — what comes first: “theory” or “data”. The simple answer is it depends. The complicated answer is it depends on the mode of thinking, discipline, context, funding and where in the hype cycle you are now. Ideally, it should be a simultaneous and integrated process. But real world is messy. Massive amounts of data and labelling engines can correctly identify and label pictures of cats. However, it cannot answer why we should and keep looking at pictures of cats (then you need to write an article about cat lovers, cat memes, and cat languages in the Internet).
The Guardian article does mention several of the scholars that are thinking about these issues. In particular, to understand the context, provenance, and history of data:
Critics point out, for example, that neural nets can throw up spurious correlations, especially if the datasets they are trained on are small. And all datasets are biased, because scientists don’t collect data evenly or neutrally, but always with certain hypotheses or assumptions in mind, assumptions that worked their way damagingly into Google’s and Amazon’s AIs. As philosopher of science Sabina Leonelli of the University of Exeter explains: “The data landscape we’re using is incredibly skewed.”
And this is just an example from the science domain. The majority of data science techniques and resources are used now to solve Ad-tech, news feed (and cat pictures) problems instead of fixing a world wide crisis.
Building theory is complicated. So the next best option is to leapfrog over our ability to write theory and pose questions in a data vs theory framework. We are also teaching the next generation of data scientists in this framework. I am not implying we all need to learn everything and all these data sciences and theories need nuances and contexts. But imagine teaching engineers about building bridges without teaching them the function of rivers, floods, or soil erosion. At the moment, we are teaching and learning “data science” with methods heavily tilted towards technique rather than content and process. There is a lack of understanding of what theory and data mean and how they interact for data science. For example, theory often relies on abstraction where you need to throw away details and generalise (for example, “all birds fly”). This is where the power of data and machine learning comes in handy to help us categorise, sort and label things to verify and test our theory. The theory is the basis to understand what’s going on and offers the most accurate explanation given the data and knowledge at that time (later, for instance, we find that penguins ostriches, emus, and kiwis are birds too, so we modify our theory: “Not all birds fly”). This type of theory making iterative exercises is part of modern intellectual practice (be in social or natural sciences). So data driven endeavours (and “data science” by extension) should follow these practices. Therefore, we do not need to ask questions about the end of theory. Done.
Now going from the abstract concept to decision making, policy is also not that easy. As we form new ideas new theories emerge. The abstraction helps us in the way of thinking where we can consider several objects or ideas and try to connect them (“government”, “crisis”, “health care” for example). Data science is not there yet to do this. It probably never will be and that is ok.
The Guardian article goes on to say the following:
But Anderson’s prediction of the end of theory looks to have been premature – or maybe his thesis was itself an oversimplification. There are several reasons why theory refuses to die, despite the successes of such theory-free prediction engines as Facebook and AlphaFold. All are illuminating, because they force us to ask: what’s the best way to acquire knowledge and where does science go from here?
The first reason is that we’ve realised that artificial intelligences (AIs), particularly a form of machine learning called neural networks, which learn from data without having to be fed explicit instructions, are themselves fallible. Think of the prejudice that has been documented in Google’s search engines and Amazon’s hiring tools.
The second is that humans turn out to be deeply uncomfortable with theory-free science. We don’t like dealing with a black box – we want to know why.
I think I can live with this conclusion. But I want to see another end of theory news item ever again! So to summarise:
We need to ask the right questions. “Why” matters. How we problematise things and issues matter. And theory matters so does data. But sometime to know “why” we also need to ask “how”. “Theory-free” is just not going to work.
There is a difference between making theories or models understandable and making them accurate. This is an iterative process.
Over reliance on data and not domain knowledge can create problems. The COVID crisis is the perfect example. See Ioannidis J.P., Cripps S., Tanner M.A.Forecasting for COVID-19 has failed, International Journal of Forecasting (2020). Also see: Hundreds of AI tools have been built to catch covid. None of them helped.
“Raw data is both an oxymoron and a bad idea; to the contrary, data should be cooked with care.” —Geoffrey Bowker.
And why are we now living in the post data science world? “Post” here is not about the chronology. It is about the mode of critique and problematisation. We need this mode of “meta” level thinking to example how data is being used.
We also have seen this movie before that we call data science. Yes, there has been tremendous amount of progress but we have a long way to go. But even if we look at this chronologically, the fundamental aspect of “data science” is 50 years old. Yes, 50 years! See 50 Years of Data Science article by David Donoho: “Data Science is a multidisciplinary area where one of its disciplines (perhaps the most important) being statistics, which all credits of data “science” comes from in a way that, without statistics, it would be just scientific content application, as same old engineering. In any case, there is no consensus yet as to where one area begins and the other ends so that there is a lot of overlap between both, thus this creates several initiatives to bring these fields together.”
So instead of mulling over data vs theory, let’s focus on what we can build that can help us and make life better for everyone.