Monday, 6 August 2018

"If you don't have a PhD, don't call yourself a data scientist"


“If you don’t have a PhD, don’t call yourself a data scientist”; with these remarks the government linked person set the stage to explain his views via a presentation on AI, and the problems implementation of AI suffers from today.

To flesh out his argument, he argued that only a PhD gives the rigour and access to large enough data to play with to become a data scientist.

A very interesting point of view from someone linked to the government.

Not everything he said was that controversial, at least to me.

The presenter used my favourite diagram for data science, the ‘Drew Conway’ diagram (1), acknowledging the importance of subject matter expertise. Data Science is a balanced combination of “Substantive Expertise” or subject matter expertise, “Maths and Statistics Knowledge” and “Hacking Skills” or IT skills.

Furthermore, the presenter also mentioned how hard it was to find all 3 skills at a required level in 1 person and also spoke of data science teams; or like what I say: “Data Science is a team Sport”.



Also the presenter was at pains to point out that a 3 month course in data science does not make you a data scientist, so even if you are an English Literature PhD, or hold a PhD in Astro Physics, a 3 months data science course does not make you a data scientist; it takes years.


I am on the wall on this one. I think “data science” like every subject needs practice, and while a 3 month course will most likely not give you enough experience, it doesn’t have to take years and years. Any expertise is gained through practice.

Furthermore, the presenter is a proponent of open source, and advises everyone to eschew classes and learn online instead, pay tens of dollars rather than hundreds. I am all for learning online, have taken classes from Data Camp (2) where I learnt a lot, as well as from Coursera (3).

But where it gets really weird, and please remember that the presenter is linked to the government, he then went on to “sell” is classroom courses, of around 3 months, and hopes he can provide some practical experience.

Unless he is targeting only PhDs as students, I find what he is saying quite contradictory...

The reason I mentioned English Literature and AstroPhysics is the presenter further mentioned that one of the reasons why the country may be finding it hard to find “data scientists” is the fault of HR departments. They are looking for a unicorn with degrees in computer science (let alone PhDs). The advice was that they should loosen the criteria and accept people from different disciplines and who have taken the online courses...

My view is not that dissimilar. I believe in passion and without knowing anything about a person, I would say that an engineer is more likely to make a good “data scientist” than a Statistician or a Computer Scientist. The reason is that to me, “data science” is about delivering value and the passion should be to solve problems, the end, not the means – AI/ML/Stats...

Then this goes back to the PhD question. Do I believe you can’t be a “data scientist” without a PhD? Well, it may be self-serving since my profile states “data scientist”, but no, I do not believe a PhD is required.

In fact, quite a few organisations have found this. Basically, people with PhDs are great at their own domain, but “data science: requires a multitude of skills that they may not have (for example subject matter expertise, or statistics for computer scientists, or IT skills for Statisticians) or may not want to engage in: the ‘dirty’ work of cleaning and preparing the data. Hence the organisations whose “data science” department is staffed purely by PhDs find it very difficult to get a decent RoI. (results, results and results).

While I am at it, I will also mention that another way that organisations get their staffing wrong (hey, may be that deserves a separate blog, but here goes) is in the fact that some “data scientists” delegate the data cleaning and preparation to “data preparation” or “data engineers”. It gets worse when the latter do not have a clear career path to the former, like sous-chefs becoming chefs... Data preparation should be done with a purpose, and unless the high and mighty “data scientist” can communicate the purpose effectively and in great detail (probably also requires some EQ), there is a risk that the data preparation will not be that fit for purpose.

Basically I believe that data cleaning and preparation is part of the role of a “data scientist” especially since “data science” is by nature iterative and iterations may involve obtaining and preparing data that was not included initially.

Quite a while ago I did an easy to understand view of the work of a unicorn (“data scientist”); as you can see, data preparation and transformation is part of the process. I can understand that someone who is good a solving business problems may not be very good at getting data in the most efficient way from various systems, or writing production ready code, but surely preparing data is part of the role after all, most people will tell you that this is 70%-80% of the work...(4)(5) 



So why I am upset enough to write this blog?

Basically I believe analytics/”Data Science” has the power to unlock enough value to create win-win (win) situations (organisation, customer/society, and (consultancy/ vendor)), and getting the framework for data science is critical in that regard.

From the presentation I attended, it would seem that the government has got some things right, some wrong, and some contradicting each other. I do hope they sort things out; unfortunately, the Peter principle may be at work.(6), or may be it’s HiPPOs (7) or both since there often is a high correlation between the two (A hippo named Peter...)

Actually ya, this might be the topic of my next blog, although I am also itching to write about AI/Automation and retraining...

P.S. Did I mention that the presenter said that SLR is part of AI?
(See it pays to read all the way to the end... now please clean the coffee from your device)

3.       https://www.coursera.org

No comments:

Post a Comment