In The Adventures of the Copper Beeches, Sherlock Holmes bemoans to John Watson: “Data! Data! Data!” he cries impatiently. “I can’t make bricks without clay.” If you have read the Sherlock Holmes canon or watched some of the film or television interpretations of Sherlock Holmes, and therefore have an inkling of the tremendous deductive powers of the great detective, you may wonder what sort of buildings, let alone bricks, he could build in the present day where data is all-engulfing.
When it comes to data, perhaps the most overused phrase, albeit with some truth in its foundation, is, “Data is the new oil.” Where oil and fossil fuels powered economic growth over the past 150 years or so, data is supposed to do the same for economic growth in our coming future. With modern-day data analysis tools such as neural networks-based deep learning, we do have the “combustion engine” required to use this “oil”.
Indeed, much of what we think of “data” these days is in the digital space. Google, for instance, uses your Search, YouTube, Maps and Play activities, among others, to personalise advertisements for you. Netflix, Spotify and Amazon use sophisticated artificial intelligence methods to personalise television shows, music and books respectively for their users. Indeed, the recent uproar in Malaysia over the ownership of the MySejahtera platform was, ultimately, a question of data.
Digital data does not just stop there, of course. There are overwhelmingly large sources of data from things like satellites, mobile usage, smart machines and much more. In fact, I was informed that even some of the largest data collectors in the world have mountains and mountains of untapped and unexplored data for the simple reason that they just do not have the resources to get to it yet. And it is undeniable that all the digital data that we produce on a daily basis will continue to remain pervasive in our lives moving forward.
That being said, while all of this digital data certainly opens all kinds of avenues for exploration, the methods or “combustion engine” that uses this data can also be used for other kinds of data. The reason why we can digitise data so much today is because we have high-speed internet. As a thought experiment, imagine if we had high-speed internet in the 1800s. How much more data would we produce then? Perhaps not as much as we do today — because of the sheer differences in population sizes — but certainly far more than the data that exists today for those times.
But how was data recorded then? Well, since the advent of writing way back when, we have had all kinds of archives and records that just have not been digitised. Digitising them would open a new treasure trove of data that we might use in understanding our ancestors and our histories, and how the lessons of the past might be relevant today. In fact, in some of the most cutting-edge research today in economics, there are some wonderful papers that show just how creative we can be in using data from the past.
In a recent paper by Yuhua Wang, a professor of government at Harvard University, he finds that in 11th century China, a politician’s support for a strong state and state-building increases with the geographic size of that politician’s kinship network. Kinship can be roughly defined to include one’s extended family and relatives. The difficulty, of course, is in constructing an 11th century kinship network. To do so, Wang uses a really creative and novel source, namely, tomb epitaphs. As epitaphs were a literary genre, the texts of hundreds of them survive, which include lengthy eulogies containing information on several generations of the deceased individual’s kin members. Once Wang digitised that data, he could then unlock interesting insights from our past, adding new perspectives on the relationship between kinship and state-building.
Economists Jeanet Bentzen and Lars Andersen at the University of Copenhagen have found that, in Western Europe over the past 700 years, the greater the religiosity of individuals in a given region, the less likely those individuals are to become engineers, doctors and scientists, or to move on to advanced studies within university. Cities with a lower proportion of such professions ended up growing more slowly relative to cities with a higher proportion of such professions. To measure religiosity, they used parents’ naming practices. Their argument is that names signal the identity and preferences of parents, therefore more religious parents would choose names belonging to religious figures in history, particularly major patron saints. To get those names, the authors obtained a data set of 61,573 students at universities in the Holy Roman Empire in the years 1250-1550, and a database of 4.1 million authors of books in libraries across the globe. Much of this data is just sitting around waiting to be digitised — imagine how much we might have in Malaysia that we just do not typically think of as data?
Stelios Michalopoulos and Melanie Xue, economists at Brown University and the London School of Economics respectively, show yet another creative way of gleaning data from the past. In their paper titled “Folklore” (which is not to be confused with the album by Taylor Swift), they compile a catalogue of oral traditions spanning about 1,000 societies around the world. Thus, the myths and stories of those societies have now become digitised, adding to the ethnographic record and are being made available for further exploration. They then show how machine learning, applied to this digitised set of folklore, can help shed light on cultural traits, using gender roles, attitudes towards risk and trust as examples. For instance, societies with tales portraying men as dominant and women as submissive tend to relegate their women to subordinate positions in their communities, both historically and today.
The three papers that I describe here are but the tip of the iceberg of the research that is currently being done that digitises records and archives from the past. In Malaysia, primary data pre-dating the 20th century (and even in the 20th century) can be extremely difficult to come by. But what if we do not have to depend solely on written records or official documents? There is a universe of potential data out there that is more “real” because it reveals actual attitudes and preferences, as opposed to self-reported surveys — for instance, what might we learn about the evolution of Malaysian society based on the lyrics of its most popular songs across time periods? Or what might we stand to learn from key words in our nation’s newspapers across time?
As such, data should not be limited to just whatever we produce digitally today via our online activities. We need to be comprehensive of the mountains of potential data that we have lying all around us. Researchers have used tomb epitaphs, folklore stories and naming conventions from the past — can we challenge ourselves to be as creative and, more so, to digitise our data accordingly? Recall, after all, that oil is from the flora and fauna of millions of years ago, and if data is the new oil, maybe it isn’t so different from the old oil after all. And to come back to Sherlock Holmes, perhaps with modern-day techniques, Holmes might not just be building bricks from clay — he may be building the digital version of St Peter’s Basilica. Doing so just requires him to learn from the past as well.