Monday 23 May 2016

Tyson Tomko v/s The Unicorn


Tyson Tomko pic By Robertlbeukema - File:Christian_cage_2.JPG, Public Domain,

A “Data Scientist” is often described as a unicorn, a mythical creature. My name card says (Regional) Data Scientist. However, I don’t see myself as a unicorn at all. To me, my role consists in simply using data to solve problems; this is where Mr Tomko comes in. In the WWE, he was known as ‘the problem solver’. So what is a ‘Data Scientist’ (DS)? What does he/she do? And especially, how different is he/she from a Data Miner/Analyst (DM/A)?

Forbes has a great article on the various definitions of ‘Data scientist’ (http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/#4b34bc621a97).

The Oxford dictionary focuses on the complexity of the data used: “A person employed to analyse and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making.”

Wikipedia has a broader definition: “Data scientists use their data and analytical ability to find and interpret rich data sources; manage large amounts of data despite hardware, software, and bandwidth constraints; merge data sources; ensure consistency of datasets; create visualizations to aid in understanding data; build mathematical models using the data; and present and communicate the data insights/findings.”

These definitions are based on complexity of data, how it is used (including visualisation) and also put weight on its role in assisting decision making and communication of insights.

Tom Davenport has a succinct definition: “What data scientists do is make discoveries while swimming in data”. The focus is on ‘discovery’ and data.

My basic question is which part of the first 3 definitions DM/As do not do? Is it the size of the data (the 3 or 4 Vs)? Is it the visualisations? It can’t be the models unless you are thinking of models that could not be built earlier through lack of computing resources, data… It can’t be purely visualisation since to me; visualisation is an aid to analysis/discovery and communication, not an end in itself. (I used to have huge issues with a previous supervisor who preferred looks of a slide to content, I still believe in content first, looks second). It can’t be discovery itself since DM/A had the choice of being hypothesis driven or data driven, so what is it?

I personally like IBM’s definition: “A data scientist represents an evolution from the business or data analyst role. The formal training is similar, with a solid foundation typically in computer science and applications, modeling, statistics, analytics and math. What sets the data scientist apart is strong business acumen, coupled with the ability to communicate findings ” 

In other words, in IBM’s view, a DS is simply closer to the business that a DM/A was, or a DM/A who has moved even closer to the business.

To me, you can’t blame DM/As simply because the tools they had 10 years ago were inferior to those available today. But you can blame them if they do not evolve and embrace the new tools, techniques and capabilities these enable :) . So to me, a DS is just an evolution of a DM/A enabled by better technology and driven by the desire to learn and improve – which is inherent in any good DM/A by nature. Also a good DM/A should be close to the business; castles built in the air tend to fall to ground and crash spectacularly.

I love this article which basically says the data scientist has been  always been around (http://www.computing.co.uk/ctg/analysis/2405050/has-the-data-scientist-always-been-around) and specifically Stephen Brobst’s defining characteristic of a DS “Two year olds always ask ‘why, why, why' and data scientists are the same - this is the personality trait I want to see in them, those who want to know why generally make good data scientists”. I think this ‘attitude’ or ‘soft skill’ has been overlooked in most definitions of DS, and I believe good DM/A have these skills.

What I think is creating the confusion is that the combination of skills of people who call themselves DS is different from the recent skill mix of people who call themselves DM/A.


The above diagram represents 3 hard skill sets a DS should have (www.oreilly.com). Today, a large premium, in ‘Data Science’, is placed on exploiting the newest technologies and by definition, this is a role that suits people with hacking/computer science skills. This is exactly how DM/A was fifteen years ago, where the data had to be extracted from various source systems, using UNIX for example… And then came the warehouse and more democratic windows based software such as SAS or SPSS (point and click) which allowed people whose background was in the 2 other skills to shine too, as they learnt the computer skills.

The same thing is happening in ‘Data Science’ today. Most DS have skills in Python or Hive… but there are software that are making it easier for people whose background is in the 2 other skills such as Aster, Revolution Analytics… Furthermore, as people realise that data from various sources can help decision making, more of these systems are being designed to keep data in a way that is easier to use, reducing the need for data wrangling.

In sum I would say that a DS is just an evolution of a DM/A; what makes a good DS or DM/A is the same thing. The unicorn is simply Tyson Tomko who has decided to evolve and adopt a costume that makes him even cooler.



Monday 16 May 2016

Grab + AXA, good, but could easily be better



In Singapore, Grab and AXA have come together to offer a variable premium commercial policy that covers Grab drivers for third party liability, including passengers and property damage (http://news.asiaone.com/news/transport/grab-hopes-cheaper-insurance-will-see-more-becoming-part-time-drivers). This comes after grab offering free personal accident policy for Grab drivers and passenger across South East Asia as announced a couple of weeks ago (https://www.grab.com/sg/grab-provides-free-personal-accident-insurance-for-passengers-and-drivers-2/).
This is very good news.


One of the issues with taking rides from organisations like Uber and Grab was whether, as a passenger, you are covered in case of accident because remember, the car you are in might not have been insured for ‘business/commercial use’. The accident policy covers this and it seems all grab vehicles in the region are automatically covered from the moment the booking is confirmed until the passenger is dropped off.
Another concern was if you were in an accident involving (caused by) a grab vehicle, the variable premium commercial policy covers this, assuming grab ensures that all drivers are covered by the policy since it seems that the premium is to be paid by the driver.


The premium is variable in the sense that the driver pays 70% of the normal premium as a base, and a minimal amount (6 cents) per km driven as a grab driver (the commercial part), up to a total of the full premium. This is designed to attract part-time drivers – basically as long as you drive less than 15000 km per year as a grab driver, you pay a lower premium. (https://www.axa.com.sg/latest-news/2016/grab-and-axa-launch-first-usage-based-insurance)
Interestingly, according to the Straits Times, the policy can also cover cases when the driver is using rival platforms, such as Uber, but the deductible would be S$5,000 rather than S$2,000 if the driver were using Grab. This is a neat way for Grab to get a leg up on Uber, making the cost/risk of Grab, to the driver, to be cheaper than Uber’s. (http://www.straitstimes.com/singapore/transport/new-motor-insurance-for-part-time-grab-drivers)
Hence, to me, the key is whether grab will strictly enforce the need to covered by the commercial auto insurance on all their drivers. Apparently, as the law stands in Singapore, this is not mandatory (http://www.straitstimes.com/askst/askst-am-i-covered-by-insurance-if-im-a-passenger-in-a-grab-or-uber-car) although both grab and uber require the production of a commercial insurance at the beginning of the engagement, but this doesn’t really mean that the organisations monitor the status of their drivers  (https://www.grab.com/sg/driver/car/) (http://www.driveuber.sg/owncar/).
By the way, since ‘regular’ taxis are covered by the commercial vehicle license, this gives and advantage to taxi drivers who are also grab drivers, not only do they presumably not have to pay any extra for the coverage but the passengers safely know they are covered, albeit by the taxi company’s policy.
Interestingly, grab has also been playing with telematics in Indonesia and has applied it to indicate to the grabbike riders when they are breaking the speed limit, and this has apparently led to a 35% drop in such incidents. (https://www.axa.com.sg/latest-news/2016/grab-and-axa-launch-first-usage-based-insurance)
I think the collaboration between Grab and AXA is a good thing. There is much more that can be done.
 There is a project for anonymous collection of telematics data between AXA and Grab “to better understand driving speeds and patterns – these can include the average speed and distance travelled, the most traffic-congested days of the week or lull time belts throughout the day”.  (https://www.axa.com.sg/latest-news/2016/grab-and-axa-launch-first-usage-based-insurance)
But what this data should be used for is to set a baseline for the driving habits of the commercial passenger drivers. From there, monitoring drivers on a non-anonymised basis in exchange for behaviour based premium is just a small step away.
What I think is also very easy but still missing from the offering of companies such as Grab and Uber (especially Grab since the cost of the ride is usually fixed which implies an approximate route has been calculated in advance based on real time traffic conditions) is simply to inform the passenger and driver of the recommended route. This would be a simple safety measure for the passenger, in case the driver deviates unnecessarily from the route.
I am not saying the service provider to track whether the driver is deviating from the most efficient route (this would probably need further investment), but simply enhance the safety of passengers at virtually no cost to anyone.
It’s just a question of using existing data in a different way at virtually no cost on existing infrastructure.