Tuesday 7 June 2016

“Data Science” process: Taking straight lines might solve a puzzle but not get you out of the box



“Data Science” process: Taking straight lines might solve a puzzle but not get you out of the box





Microsoft recently published a paper outlining the “data science process” they are proposing for Azure: “A linear method for non-linear work”.


The diagram below is straight from that article:





From what I saw of Azure at Strata, it looks like they have nicely integrated Revolution Analytics and have a nice comprehensive offering, a nice toolkit. However, how you use the tools is important in determining the outcome.

Despite its title, Microsoft’s proposed “data science process” is not totally linear, it has 2 loops: one between identifying data sources and exploring them (this allows adding more data sources as necessary prior to analysis) and the second between machine learning and the analytics data set (which allows the machine to optimize given the data set and presumably to choose between multiple machine learning algorithms).

My question is: where is ‘human learning’ in all this? It seems that the process is devoid of human participation (except at the penultimate step, and may be at the very beginning).

I think that the answer lies in this diagram on the skill-sets of a data scientist.



Microsoft uses machine learning; it is a combination of hacking/computer science skills and maths/stats skills. There is no place for domain knowledge.

If “data science” is seen as a science - after all computer science and maths/stats are scientific- then it can be seen as being unequivocal. Therefore, making the process linear is a natural extension of that view.

However, I believe that the role I play is not a pure ‘science’. Sure it has rigour brought about by understanding the algorithms used and knowing which to apply in different  circumstances and what care needs to be taken to ensure the algorithm chosen is applicable is ‘scientific’. But the real life context, the application, the implementation, the story-telling are not.

A simple illustration of the process I kind of follow is:


 
This process has multiple loops and a lot of interaction/communication with clients and SMEs. I do not mention the class of algorithms used since I believe in ‘horses for courses’, to use the algorithm/combination of algorithms that suit the issue rather than the other way round.

What I am trying to say is that, your approach, processes, and may be even tool set you adopt to solve an issue using “data Science” will vary based on how you define “data science”.

If you see “data science” as ‘throw the data into the machine and get the answers out of it’ then a linear approach would more likely be the one you prefer.

But to me a linear approach without the input of domain knowledge is unlikely to have the best results during implementation. 

The process I follow may be loopy, but, from experience, it is flexible and delivers more than adequate results.


4 comments:

  1. Hello,
    The Article on Data Science process Taking straight lines might solve a puzzle but not get you out of the box is nice .It give amazing inf0rmation about it. Thanks for sharing the article on Data science. data science consulting

    ReplyDelete
  2. It is really a great work and the way in which u r sharing the knowledge is excellent.
    Also Check out the : https://www.credosystemz.com/training-in-chennai/best-data-science-training-in-chennai/

    ReplyDelete