A “Data Scientist” is often described as a unicorn, a
mythical creature. My name card says (Regional) Data Scientist. However, I
don’t see myself as a unicorn at all. To me, my role consists in simply using
data to solve problems; this is where Mr Tomko comes in. In the WWE, he was
known as ‘the problem solver’. So what is a ‘Data Scientist’ (DS)? What does
he/she do? And especially, how different is he/she from a Data Miner/Analyst
(DM/A)?
Forbes has a great article on the various definitions of
‘Data scientist’ (http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/#4b34bc621a97).
The Oxford dictionary focuses on the complexity of the data
used: “A person employed to analyse and interpret complex digital data, such as
the usage statistics of a website, especially in order to assist a business in
its decision-making.”
Wikipedia has a broader definition: “Data scientists use
their data and analytical ability to find and interpret rich data sources;
manage large amounts of data despite hardware, software, and bandwidth
constraints; merge data sources; ensure consistency of datasets; create
visualizations to aid in understanding data; build mathematical models using
the data; and present and communicate the data insights/findings.”
These definitions are based on complexity of data, how it is
used (including visualisation) and also put weight on its role in assisting
decision making and communication of insights.
Tom Davenport has a succinct definition: “What data
scientists do is make discoveries while swimming in data”. The focus is on
‘discovery’ and data.
My basic question is which part of the first 3 definitions
DM/As do not do? Is it the size of the data (the 3 or 4 Vs)? Is it the
visualisations? It can’t be the models unless you are thinking of models that
could not be built earlier through lack of computing resources, data… It can’t
be purely visualisation since to me; visualisation is an aid to
analysis/discovery and communication, not an end in itself. (I used to have huge
issues with a previous supervisor who preferred looks of a slide to content, I
still believe in content first, looks second). It can’t be discovery itself
since DM/A had the choice of being hypothesis driven or data driven, so what is
it?
I personally like IBM’s definition: “A data scientist
represents an evolution from the business or data analyst role. The formal
training is similar, with a solid foundation typically in computer science and
applications, modeling, statistics, analytics and math. What sets the data
scientist apart is strong business acumen, coupled with the ability to
communicate findings ”
In other words, in IBM’s view, a DS is simply closer to the
business that a DM/A was, or a DM/A who has moved even closer to the business.
To me, you can’t blame DM/As simply because the tools they
had 10 years ago were inferior to those available today. But you can blame them
if they do not evolve and embrace the new tools, techniques and capabilities
these enable :) .
So to me, a DS is just an evolution of a DM/A enabled by better technology and
driven by the desire to learn and improve – which is inherent in any good DM/A
by nature. Also a good DM/A should be close to the business; castles built in
the air tend to fall to ground and crash spectacularly.
I love this article which basically says the data scientist
has been always been around (http://www.computing.co.uk/ctg/analysis/2405050/has-the-data-scientist-always-been-around)
and specifically Stephen Brobst’s defining characteristic of a DS “Two year olds always ask ‘why, why, why' and
data scientists are the same - this is the personality trait I want to see in
them, those who want to know why generally make good data scientists”. I
think this ‘attitude’ or ‘soft skill’ has been overlooked in most definitions
of DS, and I believe good DM/A have these skills.
What I think is creating the confusion is that the combination
of skills of people who call themselves DS is different from the recent skill
mix of people who call themselves DM/A.
The above diagram represents 3 hard skill sets a DS should
have (www.oreilly.com). Today, a large premium,
in ‘Data Science’, is placed on exploiting the newest technologies and by
definition, this is a role that suits people with hacking/computer science
skills. This is exactly how DM/A was fifteen years ago, where the data had to
be extracted from various source systems, using UNIX for example… And then came
the warehouse and more democratic windows based software such as SAS or SPSS
(point and click) which allowed people whose background was in the 2 other
skills to shine too, as they learnt the computer skills.
The same thing is happening in ‘Data Science’ today. Most DS
have skills in Python or Hive… but there are software that are making it easier
for people whose background is in the 2 other skills such as Aster, Revolution
Analytics… Furthermore, as people realise that data from various sources can
help decision making, more of these systems are being designed to keep data in
a way that is easier to use, reducing the need for data wrangling.
In sum I would say that a DS is just an evolution of a DM/A;
what makes a good DS or DM/A is the same thing. The unicorn is simply Tyson
Tomko who has decided to evolve and adopt a costume that makes him even cooler.
No comments:
Post a Comment