Getty Images
The next generation data scientist
The data scientist role continues to evolve as AI takes hold and domain knowledge becomes paramount. Explore the skills and expectations required for the future data scientist.
Lately, everyone wants to be a data scientist. Numerous outlets have proclaimed that being a data scientist is considered the "sexiest" job out there, the ultimate vindication of the high school nerd in the lottery of life. However, it is almost certain that the role of data scientist, or any mathematician or statistician who uses these skills to drive analysis or actions, will ultimately be transitory both as the technical underpinnings of data science become increasingly automated and as the need for domain knowledge on the part of the data scientist grows.
A history lesson
Living on the cutting edge of any new technology means that means you must be nimble. This is the state of affairs that many people who entered into the realm of data science now find themselves. Twelve years ago, the term "data scientist" had barely entered into the technical lexicon. There were people who used statistical methods to analyze population dynamics, most of whom saw themselves as researchers or, perhaps, data analysts.
Statistics and statistical modeling have a long history in computing. Fortran, for instance, was one of the first computer languages to incorporate statistical libraries. Yet it wouldn't be until the 1990s that an open source statistical programming language was developed and made available by Robert Gentleman and Ross Ihaka, or R&R, as they called themselves. It didn't take long for this language to be christened R, with its 1.0 release coming in 2000 as well.
In 2009, AQR Capital Management released its own open source statistical extensions to the Python language with the library called Pandas, a portmanteau term made from Panel Data Statistics. Pandas was intended to work with the NumPy libraries for high precision numeric processing, and, with the two libraries, a growing cadre of Python programmers began encountering statistics programming for the first time.
There is nothing like a good religious war to spur rapid evolution of a language, and soon enough, R enthusiasts and Pandas aficionados were exchanging barbs in pointed blog posts trying to prove they were the better language for working with statistics, with the R purists placing emphasis on statistical analysis while Python programmers began concentrating on deep matrix operations in order to better solve neural network problems.
Meanwhile, business analysts, who until this point had focused primarily on building complex models using Microsoft Excel or business intelligence tools, began noticing what was happening -- as did their managers. Additionally, the rise of Hadoop spurred the development of big data lakes and warehouses, but while this facilitated moving data into centralized repositories, the question of what to do with this data became a significant concern.
Finally, advances in graphical processing units (GPUs), primarily in support of self-driving vehicles, began spurring two distinct areas: neural network programming and semantic networks, both of which are heavily reliant upon a concept known as a network graph. While graphs, like data science, have been around for a long time, they require processor power and multiple distributed pipelines to work effectively. By 2015, these things were all beginning to come together.
The ambiguous future of the data scientist
So, what is a data scientist today? If you were to take all the attributes that go into a sample of data science job listings, one thing that would emerge is that such a person would need to both be a super-genius and be able to work at 10 times the speed of lesser mortal beings. Just as there are many different flavors of programmers, there is a growing plethora of data scientist roles emerging as the need for specialization arises.
To better understand the distinctions, it is worth looking first at the differences between a data scientist and a programmer. Both use computer languages, as well as specific ancillary tools such as command-line interfaces and code-based editors (such as Microsoft Visual Studio Code, R-Studio, Python's IDLE or the Eclipse IDE). The distinction, in general, is that the goal of a programmer is to create an application, while the goal of a data scientist is to create a model. For instance, a person may write an application that will show weather patterns over time. That is the role of a programmer. However, a meteorologist will use that tool to make predictions of how specific patterns will manifest in future weather.
The tool builder will likely be an engineer of some sort, while the tool user is an analyst or data scientist. This also generally means that certain roles that are often assigned to data scientists (such as creating visualizations) will most likely be taken on not by an analyst but by an engineer, or in some cases, by a designer. Designers (also frequently known as architects) can be thought of as the third leg of the stool, as they neither implement nor use data but instead shape the expression of that data in some way. There's also one more primary role -- that of the data strategist -- who serves primarily to manage how data gets utilized by an organization, making the three-legged stool into a far more stable four-legged one.
With these four "meta-roles" it becomes possible to see how data science itself will evolve. First, the formal role of "data scientist" will (and has already begun to) disappear. It's worth understanding that most data scientists are, in fact, subject matter experts, not "professional" coders. They have a deep understanding of their area of expertise, from demographics to political analysis to scientific research to business analytics, and in general see data science as a tool set instead of a profession.
This means the training involved to become a subject matter expert will become more technical, even in seemingly non-technical areas. Marketing is a good case in point here. As recently as a decade ago, marketing was considered a non-technical domain.
Increasingly, though, marketers are expected to be fluent with statistical concepts and data modeling tools. Companies looking for marketing directors are not necessarily hiring more statisticians. Rather, they are seeking out increasingly sophisticated software tools that piggyback on top of spreadsheets or similar analytics techniques.
Additionally, AI systems will increasingly be used to determine what the best potential analytics pipelines are needed for a given problem set, and once determined, will build the model for the market analyst to examine. Over time, the analyst becomes more familiar with the overall approach to be taken with the data and can develop and run such models faster. This means less need for statistical generalists or data scientists, though at the same time demanding an increase in domain-specific technical analysts.
A similar process is affecting data engineers, though for somewhat different reasons. The SQL era is ending in favor of the graph era. This is not to say that SQL itself is likely to disappear for decades, if ever, but increasingly the back-end data systems are becoming graph-like, with SQL being simply one of possibly many different ways of accessing information. This means that the same data system can hold documents and data, and moreover can configure itself dynamically to find the best indexing optimizations.
Such systems will also likely be federated, meaning that a given query can reach out to multiple distinct data stores simultaneously, while at the same time such data can be configured to be output in whatever format is needed at the time by external processes -- likely without the need for human mediation.
In this evolution, coordination is managed by data catalogs, which identify and provide access to data in a conceptual, rather than an implementation-specific, manner. AI systems -- likely facilitated by some form of semantics processing -- would then be responsible for converting human requests for data into queries and corresponding filters for presentation and visualization. In this case, it is likely that the data engineer's role will increasingly shift toward constructing the tools that will build the pipelines and filters, especially in the area of visualization and instantiation.
From mathematician to designer
There has been a quiet revolution taking place in the realm of visualizations, as the process of creating "technical art" -- diagrams, presentations, graphs and charts -- has led to the deployment of diagram languages that can in turn be created by data systems. We're already moving into the next phase of this with the dynamic presentation, which is a presentation -- likely some form of the HTML ecosystem -- that changes itself in response to changes from external data.
This means that the data storyteller too will likely shift from being a technical specialist to becoming more of a designer that tweaks the presentation based upon the audience, potentially in real time. As media becomes more fungible and as GPUs become faster, such presentations would have production values similar to blockbuster movies from a few years ago.
Similarly, instantiation is a fancy word for printing, with the caveat that this printing extends well beyond books and into 3D printing of physical goods. There has been a concept floating around for a while called the Digital Twin, in which physical objects create data trails that represent them.
However, this process is likely to go the other way as well, with physical goods being designed virtually, then 3D-printed into existence, likely with embedded transceivers in the final product that can communicate with the virtual twin. It is likely that by 2030 such instantiation will be commonplace and tied into smart contracts built around distributed ledger systems.
Ultimately, the data scientist's most tangible products are models. When you deploy a model, you are in effect publishing it, transforming real-world data into tangible actions that can control robotic processes or provide guidance in advising human processes, with the latter's scope increasingly falling into the former's domain. Getting a loan, for instance, used to be a wholly human decision. With many banks, however, getting that loan is increasingly determined not by a banker, but by a model created by a data scientist that ultimately produces a recommendation, often with "analysis" indicating what factors went into that decision. The banker can, of course, override that recommendation, but must justify the decision to do so.
The upshot of this shift is that while the title of data scientist is likely to disappear, the role itself is not. The data scientist will shift to become a subject matter expert in a specific domain who uses knowledge of that domain to effectively model it, with the model then driving subsequent recommendations or actions. The role will become more design-oriented as the tools deal with higher levels of abstraction, moving away from the underlying mathematics through code to pipelines and filters, before finally being assembled directly by artificial intelligence based upon requests made by the modeler.