Wednesday, 30 November 2016

Industry lingo as the preferred common language as basis for collaboration in data science



HBR came up with an article directed at people in the business: “Better questions to ask your data scientists”(1).But is this the right approach?

As an analogy, when you go to the see a doctor, do you think it makes more sense for the doctor to talk your language or for you to speak in doctor-language (medical terms)?  

By talking in plain language, the doctor can uncover symptoms that you might not feel worth mentioning, that you might not feel are important or related. Or do you know enough medicine to be able to describe your symptoms properly and completely in medical terms?



I know I don’t know enough about medicine, and even less about medical terms. Call me a chicken, but I’d rather not get paracetamol/panadol for a headache that’s the beginning of a brain tumour. ‘cluck!’

To me, a doctor consults with you, the diagnosis is a collaborative process between 2 parties, but who bridges the gap in terms of language makes a huge difference in the outcome.  

Would you rather be treated for what you really have, or what you think you have?

Now, how does this relate to “data science”?

I go back to the Drew Conway definition of data science(2) which looks like this, but here the "data scientist" is shown as a unicorn:
 

A “data scientist” is someone who, on top of hacking Skills/Computer Science and Mathematics/Statistics skills has substantive domain knowledge. Basically, the “data scientist” should be able to speak to the business in business language. 

One of the things that I have learnt while working with clients from various organisations and industries is that, very often, the clients don’t mention issues they don’t think “data science” can help them with. That’s not because these issues are not important to them, but simply because they do not know that “data science” can help solve them; they do not know what they do not know.

Similar to the case of getting an accurate medical diagnosis that will help cure the underlying medical issues, I believe that the data science process is in essence consultative. The best outcomes are always from collaboration between the business and the “data scientist”. 

It is the role of the “data scientist” to understand where the client is coming from, dig deeper, ask relevant questions, know the data that is required to tackle these issues, and maximise the benefits the client can get. For that, domain knowledge is critical. (And of course that’s just a small part of the “data science” process.(3))



Furthermore, the interactions and collaboration between the business and the “data science” are not limited to the “initial diagnostics stage”, but through-out the whole data science process. Therefore a common language is very important to facilitate collaboration, and in my opinion, it should be the “data scientist” who speaks business language rather than the business speaking “data science language”.

 

Sunday, 27 November 2016

The brown box, or why I prefer models/algorithms that can be explained

This is one of my 3 dogs, she has taught me an invaluable lesson that reinforces my preference for non-"black-box" types of models/algorithms. 

Each one of my dogs has her own personality and behavioural patterns, the brown one (shown above) likes to pee after a meal, the white one before a meal; the black one has no set patterns I can detect, but she takes her time to eat. So if I am not in the vicinity of the girls when they are eating, the white one will aggressively try to steal the black one’s food.

That allows the brown one the freedom to pee where she chooses. And often she chooses to pee where she knows she is not allowed to (the no-pee zone). The other two have learnt, through positive and negative reinforcement where they are and are not allowed to pee. 

However, the brown one figured out that, since she only gets punished for peeing in the no-pee zone if she gets caught, the solution is not “do not pee here”, but “do not to get caught peeing here”.

Hence, nowadays, she often chooses to have her post-meal pee behind a low wall (in the no-pee zone), I can’t see her doing it, and therefore can’t catch her in the act and punish her.

What does that have to do with models/algorithms being explainable?

I see my dog as a brown box. She received data:
1 the experiments when she had a pee in the right places and was rewarded,
2 the experiments where she had a pee where she was not allowed and was punished, and
3 the experiments when I was not there to see her, and she was neither punished nor rewarded. 

And from this data her algorithm came up with: “Do not get caught peeing in the wrong place”. That's obviously not what I expected.

How can I change that outcome? 

The usual solution is to feed her more data. More cases when she is rewarded or punished, basically I have to watch her more often. Also, I could install cameras and fit her collar with some device, so when she is found to be peeing where she shouldn’t, she is punished. Basically I would build a model/algorithm to recognise the peeing position, teach it where peeing is allowed and not, and run a rule whereby if peeing is taking place in an area where it is not allowed, then punish the dog.

It can be done; it is expensive, may be morally reprehensible, (and even worse if my model doesn’t recognise the peeing position accurately enough confuse my dog), but in any case, I can’t be sure what the brown box will come up with. May be she’ll find the blind spot of the camera...

I’m not the only one who’d try to feed more experiments to the box. This is what google did after their self-driven car crashed earlier this year. They could not know for sure what caused the crash, assumed it was because the car expected the incoming large vehicle to slow down so as to allow the self driving vehicle to go past; it didn’t. Hence more examples of large vehicles not slowing down were fed to the self driving vehicle. The hope is that the issue was diagnosed correctly and that the new data used to further train the machine would remedy the issue. But we can’t know for sure, since we do not know what decisions were actually taken, it’s hope: "From now on, our cars will more deeply understand that buses (and other large vehicles) are less likely to yield to us than other types of vehicles, and we hope to handle situations like this more gracefully in the future." (highlight not in original text) (http://www.bbc.com/news/technology-35692845)

Basically you can’t expect what issues might occur, you just have to deal with the consequences, and try to train the algorithm again. But isn't it better to know before things go awry? Knowing what's inside would help.

Other issues with "black box" type models

Another issue I have with “black box” type models, is that since you don;t know what drives the algorithm, you cannot anticipate the reaction to a change; the parameter that changes may have no impact on the algorithm, or might have a huge impact, unless it’s something that has happened in the past, it is very hard to know what will happen. Contrast this with a model with clearly defined drivers and coefficients; the model might become less accurate, but the general impact can be ascertained in advance, and prepared for. Or you could actually change a parameter, in order to get a desirable outcome. 

For example, let’s say that the distance between my brown dog and myself is important in her decision to pee or not (the closer I am the less likely she is to pee in the no-pee-zone), then I could influence her behaviour by moving closer to her. But the brown box doesn’t reveal her decision making mechanism, nor the drivers of her behaviour.

A third reason is that when algorithms leave the ‘lab’ and encounter the real world, there are many more factors to deal with. And, especially if you are trying to use your model to select people to receive a specific offer, it makes sense to positively influence the delivery of the message. I saw this for real in one of my previous roles. We had rolled out a model but found that the response, while very good, was still below expectations. At that time, the outcome of the model was rolled out simply via a set of customers each call centre agent would call, given a script. The change we made was to actually brief the people doing the selling about the main drivers of the model, convince them that it made sense, and their conviction helped convince the prospects and sales increased. 

This third reason doesn’t apply in my case, I just have to deal with the consequences of the algorithm using a mop and pail.