Wednesday 12 October 2016

Some thoughts on the 2016 Gartner Hype Cycle for Data Science



The Gartner Data Science hype cycle diagram looks very interesting:



A disclaimer first: I didn’t spend US$1995 to purchase the document, I am just relying on the chart and using my own hands-on experience to comment. I’ll just focus on a few points.

1              Video/Image Analytics v/s Text Analytics

According to Gartner, both these capabilities are at the beginning of the scope of enlightenment. Interestingly however, the video/image analytics is expected to reach plateau of productivity in less than 2 years, while text analytics would take longer, between 2 and 5 years.

At first glance this looks strange.

Tools for doing text analysis have been around for many years and work quite well; for example I used similarity indices for distance computations in SAS to do some fuzzy matching with pretty decent accuracy (more than 90%).

Video Analytics, as far as I know, is less easily done by non-geeky people (I obviously can’t claim to be a geek) that text analytics, until recently. Most ‘mainstream’ analytics tools are capable of handling still images, so you basically have to break down videos into a sequence of still images, and analyse them; video analytics isn’t that mature I’d say.

On the other hand, text analytics should have reached the plateau of productivity, unless we are expecting different outcomes from studying text and video/image. For example, it is possible that gartner expects gauging emotion from text but not from video.

2              Speech Analytics: almost at the bottom of the trough of disillusionment 

Speech Analytics appears to be in a much worse place than what I would have thought. I guess it has to do with expectations. Many people have been impressed by ‘Siri’ and expect the ability to instantly and accurately transcribe a piece of speech. The trick is to understand that the ‘Siri’ you use has been trained to understand you; it is likely to work for someone who speaks like you, but not necessarily that accurate for someone else: Singlish can be understood easily by Singaporeans, but not instantly by foreigners who move to Singapore.

However, the technology to find keywords in a speech, even in real time, is already here, and is quite accurate again (of course you’ll need the appropriate accent/language as well as garbage models). So if it is simply a question of tagging discussions and even understanding the emotion in a conversation, it’s not that difficult. Of course sarcasm is tough still, but not that many people are sarcastic, right?

I would therefore say that Speech Analytics is much closer to the plateau of productivity than what Gartner mentions.

3              Hadoop based data Discovery: obsolete

I agree with this. I think Hadoop is great at what it does and as more developments take place, it will increase its usefulness in different areas. However, I also believe in horses for courses. I keep going back to the Drew Conway idea of what a data scientist is:
 


I believe that, at this moment, people who have the skills to do discovery in Hadoop tend to have great hacking skills, but less statistical and domain knowledge. (Basically they are red, or may be slightly purplish). 

Tools have been developed to enable people with less hacking skills to do discovery, and thus I would agree that Hadoop based Data Discovery is kind of obsolete. Not because it will not work, but simply because of the set of skills required to play well enough in Hadoop and those required to do discovery are different.

4              Machine Learning : peak of Inflated Expectations but close to mainstream

I agree with this. Gartner has already presented this point-of-view when they came up with the hype cycle of emerging technologies earlier this year.



First of all, many algorithms of what people understand machine learning to be have been around for decades, and they have been used by many people; even the humble k-means and logistic regressions are considered machine learning (http://stats.stackexchange.com/questions/158631/why-is-logistic-regression-called-a-machine-learning-algorithm).

Hence they have already been adopted in the mainstream.

I think the main issue has to do with people’s expectations. I know of an organisation who aimed to “throw all data in a machine and the answers will come out”; this is what machine learning has been ‘sold’ to be able to do, and this is what has created the unrealistic expectations. 

Machine learning has its uses, and these uses will, no doubt, be expanded over time, but expectations will have to come down and meet them for machine learning to really reach the plateau of productivity.

5              Data lakes : peak of inflated expectations and 5-10 years from mainstream 

I actually think Gartner is a bit conservative here. A data lake is basically  a repository of all sorts of data, and if you believe the sales numbers of the 3 main hadoop distributions (Cloudera, Hortonworks and MapR), and if you see one main function of Hadoop systems as being data lakes, then I think you would agree that mainstream adoption is much nearer than 5 to 10 years. Furthermore, if data lakes are only seen as data repositories, then they should not having that much inflated expectations.

This is especially true as you have more providers such as AWS (https://techcrunch.com/2015/10/07/amazons-aws-is-now-a-7-3b-business-as-it-passes-1m-active-enterprise-customers/) who are taking the pain of maintaining the systems from businesses and making data lakes very easy to work with.

6              Graph Analytics: 5 to 10 years away from mainstream adoption

I personally believe that the hype cycle has been shortened as technology changes even more rapidly and there are more and more vendors with different flavours of solutions/software. I have used graph analysis for customers in different industries and the results are quite promising. 

The key with graph analysis is the computational power required. Graph analysis is computationally expensive. Imagine a very simple scenario, you want to calculate the fastest way to get from station a to station b. To be exhaustive, you will have to calculate the time taken by all possible paths. It’s not a trivial exercise, but can be done. I do this all the time when I decide which train lines to take and where to change.

Imagine doing this for all pairs of points. This increases the effort, and especially the amount information you need to ‘keep in memory’. Basically the shortest distance between points A and B has to be kept in memory when you calculate the distance from A to C since going through B is an option and it is easier to remember than recompute every time (while there are algorithms such as Floyd-Warshall that make things easier, you still need a fair bit of memory).

Now move from a train map to visualising connections between people. LinkedIn is a great example of how graph analytics is used; how do you think the “people you may know” works?

As the technology advances and the cost of memory falls, more and more organisations will realise the benefits of graph analytics. In the last couple of years I have applied it to a few customers, and see this as a trend that will only accelerate.

I think that graph analytics is much closer to mainstream adoption than 5 to 10 years.

The only area where I believe there might still be inflated expectations is in the area of finding “influencers” in social networks. But as people realise that the definition of the right person to contact changes based on the business goal, this expectation becomes less inflated.

7              Predictive Analytics is past the peak of inflated expectations

I actually find that many people on the business side of things (even at the C-level) do not fully realise the power of predictive analytics. In the last couple of years I have had the opportunity to meet many people in different industries and at different levels of the corporate ladder.  And one of the things that surprised me the most is how conservative they are in terms of what issues can be tackled using predictive analytics. It could be due to the region in which I work, but I am not so sure; information is easy to access nowadays and there are many forums for sharing of analytics/Big Data results and use cases.

I think that there is one of the issues facing “data science” today it that while amazing steps are being taken in terms of technology, application of the technology in the business context and the derivation of RoI of Data Science is lagging behind.