Monday, 23 May 2016

Tyson Tomko v/s The Unicorn


Tyson Tomko pic By Robertlbeukema - File:Christian_cage_2.JPG, Public Domain,

A “Data Scientist” is often described as a unicorn, a mythical creature. My name card says (Regional) Data Scientist. However, I don’t see myself as a unicorn at all. To me, my role consists in simply using data to solve problems; this is where Mr Tomko comes in. In the WWE, he was known as ‘the problem solver’. So what is a ‘Data Scientist’ (DS)? What does he/she do? And especially, how different is he/she from a Data Miner/Analyst (DM/A)?

Forbes has a great article on the various definitions of ‘Data scientist’ (http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/#4b34bc621a97).

The Oxford dictionary focuses on the complexity of the data used: “A person employed to analyse and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making.”

Wikipedia has a broader definition: “Data scientists use their data and analytical ability to find and interpret rich data sources; manage large amounts of data despite hardware, software, and bandwidth constraints; merge data sources; ensure consistency of datasets; create visualizations to aid in understanding data; build mathematical models using the data; and present and communicate the data insights/findings.”

These definitions are based on complexity of data, how it is used (including visualisation) and also put weight on its role in assisting decision making and communication of insights.

Tom Davenport has a succinct definition: “What data scientists do is make discoveries while swimming in data”. The focus is on ‘discovery’ and data.

My basic question is which part of the first 3 definitions DM/As do not do? Is it the size of the data (the 3 or 4 Vs)? Is it the visualisations? It can’t be the models unless you are thinking of models that could not be built earlier through lack of computing resources, data… It can’t be purely visualisation since to me; visualisation is an aid to analysis/discovery and communication, not an end in itself. (I used to have huge issues with a previous supervisor who preferred looks of a slide to content, I still believe in content first, looks second). It can’t be discovery itself since DM/A had the choice of being hypothesis driven or data driven, so what is it?

I personally like IBM’s definition: “A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings ” 

In other words, in IBM’s view, a DS is simply closer to the business that a DM/A was, or a DM/A who has moved even closer to the business.

To me, you can’t blame DM/As simply because the tools they had 10 years ago were inferior to those available today. But you can blame them if they do not evolve and embrace the new tools, techniques and capabilities these enable :) . So to me, a DS is just an evolution of a DM/A enabled by better technology and driven by the desire to learn and improve – which is inherent in any good DM/A by nature. Also a good DM/A should be close to the business; castles built in the air tend to fall to ground and crash spectacularly.

I love this article which basically says the data scientist has been  always been around (http://www.computing.co.uk/ctg/analysis/2405050/has-the-data-scientist-always-been-around) and specifically Stephen Brobst’s defining characteristic of a DS “Two year olds always ask ‘why, why, why' and data scientists are the same - this is the personality trait I want to see in them, those who want to know why generally make good data scientists”. I think this ‘attitude’ or ‘soft skill’ has been overlooked in most definitions of DS, and I believe good DM/A have these skills.

What I think is creating the confusion is that the combination of skills of people who call themselves DS is different from the recent skill mix of people who call themselves DM/A.


The above diagram represents 3 hard skill sets a DS should have (www.oreilly.com). Today, a large premium, in ‘Data Science’, is placed on exploiting the newest technologies and by definition, this is a role that suits people with hacking/computer science skills. This is exactly how DM/A was fifteen years ago, where the data had to be extracted from various source systems, using UNIX for example… And then came the warehouse and more democratic windows based software such as SAS or SPSS (point and click) which allowed people whose background was in the 2 other skills to shine too, as they learnt the computer skills.

The same thing is happening in ‘Data Science’ today. Most DS have skills in Python or Hive… but there are software that are making it easier for people whose background is in the 2 other skills such as Aster, Revolution Analytics… Furthermore, as people realise that data from various sources can help decision making, more of these systems are being designed to keep data in a way that is easier to use, reducing the need for data wrangling.

In sum I would say that a DS is just an evolution of a DM/A; what makes a good DS or DM/A is the same thing. The unicorn is simply Tyson Tomko who has decided to evolve and adopt a costume that makes him even cooler.



No comments:

Post a Comment