The Gartner Data Science hype
cycle diagram looks very interesting:
A disclaimer first: I didn’t
spend US$1995 to purchase the document, I am just relying on the chart and
using my own hands-on experience to comment. I’ll just focus on a few points.
1 Video/Image
Analytics v/s Text Analytics
According to Gartner, both these
capabilities are at the beginning of the scope of enlightenment. Interestingly
however, the video/image analytics is expected to reach plateau of productivity
in less than 2 years, while text analytics would take longer, between 2 and 5
years.
At first glance this looks strange.
Tools for doing text analysis have
been around for many years and work quite well; for example I used similarity
indices for distance computations in SAS to do some fuzzy matching with pretty
decent accuracy (more than 90%).
Video Analytics, as far as I
know, is less easily done by non-geeky people (I obviously can’t claim to be a
geek) that text analytics, until recently. Most ‘mainstream’ analytics tools
are capable of handling still images, so you basically have to break down
videos into a sequence of still images, and analyse them; video analytics isn’t
that mature I’d say.
On the other hand, text analytics
should have reached the plateau of productivity, unless we are expecting
different outcomes from studying text and video/image. For example, it is
possible that gartner expects gauging emotion from text but not from video.
2 Speech Analytics: almost at the bottom of the trough of
disillusionment
Speech Analytics appears to be in
a much worse place than what I would have thought. I guess it has to do with
expectations. Many people have been impressed by ‘Siri’ and expect the ability
to instantly and accurately transcribe a piece of speech. The trick is to
understand that the ‘Siri’ you use has been trained to understand you; it is
likely to work for someone who speaks like you, but not necessarily that
accurate for someone else: Singlish can be understood easily by Singaporeans,
but not instantly by foreigners who move to Singapore.
However, the technology to find
keywords in a speech, even in real time, is already here, and is quite accurate
again (of course you’ll need the appropriate accent/language as well as garbage
models). So if it is simply a question of tagging discussions and even understanding
the emotion in a conversation, it’s not that difficult. Of course sarcasm is
tough still, but not that many people are sarcastic, right?
I would therefore say that Speech
Analytics is much closer to the plateau of productivity than what Gartner
mentions.
3 Hadoop
based data Discovery: obsolete
I agree with this. I think Hadoop
is great at what it does and as more developments take place, it will increase
its usefulness in different areas. However, I also believe in horses for
courses. I keep going back to the Drew Conway idea of what a data scientist is:
I believe that, at this moment, people
who have the skills to do discovery in Hadoop tend to have great hacking
skills, but less statistical and domain knowledge. (Basically they are red, or
may be slightly purplish).
Tools have been developed to enable
people with less hacking skills to do discovery, and thus I would agree that
Hadoop based Data Discovery is kind of obsolete. Not because it will not work,
but simply because of the set of skills required to play well enough in Hadoop
and those required to do discovery are different.
4 Machine Learning : peak of Inflated Expectations
but close to mainstream
I agree with this. Gartner has
already presented this point-of-view when they came up with the hype cycle of
emerging technologies earlier this year.
First of all, many algorithms of
what people understand machine learning to be have been around for decades, and
they have been used by many people; even the humble k-means and logistic
regressions are considered machine learning (http://stats.stackexchange.com/questions/158631/why-is-logistic-regression-called-a-machine-learning-algorithm).
Hence they have already been
adopted in the mainstream.
I think the main issue has to do
with people’s expectations. I know of an organisation who aimed to “throw all
data in a machine and the answers will come out”; this is what machine learning
has been ‘sold’ to be able to do, and this is what has created the unrealistic
expectations.
Machine learning has its uses,
and these uses will, no doubt, be expanded over time, but expectations will
have to come down and meet them for machine learning to really reach the
plateau of productivity.
5 Data lakes : peak of inflated expectations and
5-10 years from mainstream
I actually think Gartner is a bit
conservative here. A data lake is basically
a repository of all sorts of data, and if you believe the sales numbers
of the 3 main hadoop distributions (Cloudera, Hortonworks and MapR), and if you
see one main function of Hadoop systems as being data lakes, then I think you
would agree that mainstream adoption is much nearer than 5 to 10 years.
Furthermore, if data lakes are only seen as data repositories, then they should
not having that much inflated expectations.
This is especially true as you
have more providers such as AWS (https://techcrunch.com/2015/10/07/amazons-aws-is-now-a-7-3b-business-as-it-passes-1m-active-enterprise-customers/)
who are taking the pain of maintaining the systems from businesses and making
data lakes very easy to work with.
6 Graph
Analytics: 5 to 10 years away from mainstream adoption
I personally believe that the
hype cycle has been shortened as technology changes even more rapidly and there
are more and more vendors with different flavours of solutions/software. I have
used graph analysis for customers in different industries and the results are
quite promising.
The key with graph analysis is
the computational power required. Graph analysis is computationally expensive.
Imagine a very simple scenario, you want to calculate the fastest way to get
from station a to station b. To be exhaustive, you will have to calculate the
time taken by all possible paths. It’s not a trivial exercise, but can be done.
I do this all the time when I decide which train lines to take and where to
change.
Imagine doing this for all pairs
of points. This increases the effort, and especially the amount information you
need to ‘keep in memory’. Basically the shortest distance between points A and
B has to be kept in memory when you calculate the distance from A to C since
going through B is an option and it is easier to remember than recompute every
time (while there are algorithms such as Floyd-Warshall that make things
easier, you still need a fair bit of memory).
Now move from a train map to
visualising connections between people. LinkedIn is a great example of how
graph analytics is used; how do you think the “people you may know” works?
As the technology advances and
the cost of memory falls, more and more organisations will realise the benefits
of graph analytics. In the last couple of years I have applied it to a few
customers, and see this as a trend that will only accelerate.
I think that graph analytics is much closer to mainstream
adoption than 5 to 10 years.
The only area where I believe
there might still be inflated expectations is in the area of finding “influencers”
in social networks. But as people realise that the definition of the right
person to contact changes based on the business goal, this expectation becomes
less inflated.
7 Predictive
Analytics is past the peak of inflated expectations
I actually find that many people
on the business side of things (even at the C-level) do not fully realise the
power of predictive analytics. In the last couple of years I have had the
opportunity to meet many people in different industries and at different levels
of the corporate ladder. And one of the
things that surprised me the most is how conservative they are in terms of what
issues can be tackled using predictive analytics. It could be due to the region
in which I work, but I am not so sure; information is easy to access nowadays
and there are many forums for sharing of analytics/Big Data results and use
cases.
I think that there is one of the
issues facing “data science” today it that while amazing steps are being taken
in terms of technology, application of the technology in the business context and
the derivation of RoI of Data Science is lagging behind.