“Data Science” process: Taking straight lines might solve a
puzzle but not get you out of the box
Microsoft recently published a paper outlining the “data
science process” they are proposing for Azure: “A linear method for non-linear
work”.
The diagram below is straight from that article:
From what I saw of Azure at Strata, it looks like they have
nicely integrated Revolution Analytics and have a nice comprehensive offering,
a nice toolkit. However, how you use the tools is important in determining the
outcome.
Despite its title, Microsoft’s proposed “data science
process” is not totally linear, it has 2 loops: one between identifying data
sources and exploring them (this allows adding more data sources as necessary
prior to analysis) and the second between machine learning and the analytics
data set (which allows the machine to optimize given the data set and
presumably to choose between multiple machine learning algorithms).
My question is: where is ‘human learning’ in all this? It
seems that the process is devoid of human participation (except at the
penultimate step, and may be at the very beginning).
I think that the answer lies in this diagram on the
skill-sets of a data scientist.
Microsoft uses machine learning; it is a combination of hacking/computer
science skills and maths/stats skills. There is no place for domain knowledge.
If “data science” is seen as a science - after all computer
science and maths/stats are scientific- then it can be seen as being unequivocal.
Therefore, making the process linear is a natural extension of that view.
However, I believe that the role I play is not a pure ‘science’.
Sure it has rigour brought about by understanding the algorithms used and knowing
which to apply in different circumstances and what care needs to be taken to
ensure the algorithm chosen is applicable is ‘scientific’. But the real life context,
the application, the implementation, the story-telling are not.
A simple illustration of the process I kind of follow is:
This process has multiple loops and a lot of
interaction/communication with clients and SMEs. I do not mention the class of
algorithms used since I believe in ‘horses for courses’, to use the
algorithm/combination of algorithms that suit the issue rather than the other
way round.
What I am trying to say is that, your approach, processes, and
may be even tool set you adopt to solve an issue using “data Science” will vary
based on how you define “data science”.
If you see “data science” as ‘throw the data into the
machine and get the answers out of it’ then a linear approach would more likely
be the one you prefer.
But to me a linear approach without the input of domain
knowledge is unlikely to have the best results during implementation.
The process I follow may be loopy, but, from experience, it
is flexible and delivers more than adequate results.
Hello,
ReplyDeleteThe Article on Data Science process Taking straight lines might solve a puzzle but not get you out of the box is nice .It give amazing inf0rmation about it. Thanks for sharing the article on Data science. data science consulting
It is really a great work and the way in which u r sharing the knowledge is excellent.
ReplyDeleteAlso Check out the : https://www.credosystemz.com/training-in-chennai/best-data-science-training-in-chennai/
Thanks for this wonderful post.
ReplyDeleteData Science Online Training
Python Online Training
Sometimes loops are better.
ReplyDelete