Thursday, 20 December 2018

Tale of the left handed samurai


A few days ago I was asked to prepare a piece that would help people who have not really been exposed to thinking using data on how to go about doing it.

Since the presentation did not take place, I decided that rather than let the slides go to waste, I would share it here. Please note though that my presentation style is very spoken with just visual cues on the slides rather than full explanations, so do forgive the short comings.


Why left handed? Because many people believe that left-handed people are more creative, and I believe that analysis is a creative endeavour, at least analysis that creates and impact.

And why a samurai? Well I hope that by the end of this post, you should have your own answer.

Analysis/Analytics is a way of thinking

Everyone analyses things before making a decision, whether consciously or not. So this is nothing new.

In a business context however, it is worth thinking whether there are business implications to the idea, whether it may have some benefit. There is no point, in a business context, to do things just for the sake of doing it, RoI and so on.

Once the potential business implications have been cleared up, the next step is to decide upon a metric or set of metrics we would use to decide whether the idea works or not. The metrics are usually very closely linked to the business objective; if the idea we are testing is to do with increasing sales, then it makes more sense to use sales as a metric, or sales growth rather than employee satisfaction for example.

Once the metric is chosen, or a list in order of preference and including some proxies if the actual preferred metric would not be available, then the next step is to look for the data needed for analysis; this includes both history and granularity.

Next is the number crunching, based on the algorithms chosen.

And finally we reach a certain conclusion.

There are a couple of things worth pointing out.
  1. The first step always has to be the business context; the assumption is that analytics is designed to help the business. This also means that there is need for some level of subject matter expertise, some creativity, at a minimum to translate the business issues into something solvable by analysis of data
  2. I deliberately chose metrics, data, and algorithm before any analysis starts and in this order, We need to have a clean and objective view of the situation, pick the best, second best (... or more) approaches and tackle them in that order. It is not difficult to find a combination that gives us the answer we'd like to have, but that would not be real analysis, that would be fishing, not analysing. We let numbers tell their story, we do not torture them until they say what we want to hear.
  3. While I have described the process, or at least one crank of the wheel as linear, it is not necessarily so. For example, if we had chosen to use year on year growth as our metric, but find out that we only have 6 months of history, then we should go back and evaluate whether we can have a good enough analysis leading to a conclusion if we use only 6 months of data, in which case we should amend the metric to monthly or quarterly sales growth for example.

But analysis cannot exist on its own.


Analysis is part of a process, cycle of having hypotheses, analysing and assessing their viability then reaching a conclusion.

The process can then be repeated so as to get better and better answers, either by testing different things or by refining ideas.

There are 2 points to bear in mind:
  1. The key is the approach of experimentation; we cannot expect to have the perfect answer the first time, every time. Another way of looking at it is that "no" is a good answer too. It may dent our ego if the hypothesis was "dear to us", but in the spirit of experimentation, a conclusive "no" is as important as a conclusive "yes". And this brings us to the second point.
  2. It is worth reiterating (I touched on it earlier), is to be as objective as possible, While a hypothesis may be "dear to us", they should all be analysed and evaluated objectively. I am not saying passion should be excluded, passion is great in hypothesis forming, but hypothesis testing ( even predicting) should be done with a cold heart and mind.
What about the process of discovery you could ask, just blindly letting the data tell its story? I think it is perfectly acceptable, but as we interpret the data, a story will be formed, whether externally (say from experience or the past) or as coming from the data, and this will lead to a hypothesis, and so the cycle begins.

It is also important to remember again, that analysis, at least in a business context, cannot be purely for the sake of analysis.



I keep saying it, but analysis should be done to help the business, and we should avoid analysis-paralysis.

If, as a result of the analysis, given our choice of metrics and methodology there is no conclusive answer, then we should review. For example, we may decide to rephrase the business issue, or decide to collect more data to allow for a more definitive answer.

If we have a conclusive answer then we can exit the process and go to a next step; remember there is nothing wrong with deciding that the analysis of data does not support the hypothesis, experimentation is about learning and moving on.

It may not be easy to switch to an experimentation driven methodology, but the rewards are worth it.

The ultimate aim of analysis is to take action; analysis, in a business context, only truly comes to life when there is a resulting action, the experiment is tested in real life, and to me that’s one of the most exciting times, when you really get to see whether your analysis has been accurate and how much it actually helps the business.

The other side of the coin is that if action is taken without proper analysis, then it can be likened to gambling. Of course people with experience can use this to help guide their actions, but what is experience if not an accumulation of data. The danger is that, given how more lasting/easier to recall memories in the human brain are usually associated with emotion, we may be remembering a distorted view which only an objective analysis may reveal. Also, situations change, and in fluid environments, data analysis is invaluable as a tool to guide decision making and a precursor to action.

But just taking action, is not enough; the result matters.


Finally, once action is taken and the experiment is run, it is critical to gather the results and compare them to what is expected and learn whether things went as expected.

Furthermore, analysis of results can lead to new hypotheses, refinement of hypotheses and kick starts a virtuous cycle of improvement.

Still, why the Samurai?

Because it is very important to have the right attitude when analysing the data and the may be romanticised image of a Samurai as someone who is zen-like but decisive helps. It is important, when analysing data, to be able to put everything aside and focus in what the data is saying.


So who wants to be a left-handed Samurai?






Monday, 10 December 2018

Lies, Damned lies, and football statistics (with a sprinkling of fake news)


“People who don’t understand football analyse with statistics”(1), so said  Jose Mourinho, 4 times world’s best coach, 2 times champions league winner, 3 times English champion, 1 time English FA cup winner, 4 time English league cup winner, 3 times Spanish champion, 3 times Spanish cup winner, 2 times Spanish super cup winner, 2 times Italian Champion, 1 time Italian Cup winner, 1 time Italian super cup winner, 2 times Portuguese champion, 2 times Portuguese cup winner, 1 time UEFA cup winner, 1 time Europa League Winner, 1 time UEFA Super Cup winner, 2 times Portuguese Super Cup Winner, 2 times English Super Cup winner (2)

On the other hand, Pep Guardiola kind of referred to statistics when arguing that his team is not dirty: “Normally when a team has 65 or 70 per cent of the ball we cannot kick the opponent. We can kick each other, okay, but we have the ball. Normally when for every 10 minutes you have the ball for seven of them there is less option to make fouls. I don't think we're a team that make a lot of fouls in games.”(3).

So Mr Guardiola, 2 times world’s best coach, 2 times champions league winner, 1 time English champion, 1 time English league cup winner, 3 times Spanish champion, 2 times Spanish Cup winner, 3 times Spanish Super cup winner, 3 times German champion, 2 times German cup winner, 3 times FIFA club world cup winner, 3 times UEFA super cup winner and 1 time English super cup winner (4) on the other hand, uses statistics (sort of) when assessing his team.

And we have the adage that my friend Ramesh reminded me of “Lies, Damn lies, and Statistics”

Given his past record with respect to lies (5), let us consider the argument of Mr Guardiola. Taking data from whoscored (6), I focused on the number of fouls committed per game, and made a distinction between home games and away games.

In the chart above, the teams who commit more fouls per game are on the exterior whereas those who commit less fouls are closer to the origin. It is easy to see that Mr Guardiola was right: Manchester City is one of the teams that make the least fouls. It is an undeniable fact.

So is Mr Mourinho wrong then?

This is where context and subject matter expertise are important. I am very fond of the Drew Conway data science diagram (), and it emphasises the need for subject matter expertise.

What subject matter expertise would you ask?

Well, enough to understand that Manchester City play a possession based strategy, basically they keep the ball for huge chunk of each game. This element provides context.

This is football (or as Americans call is soccer) and players are not allowed to tackle players who do not have the ball, basically your opponents are much more likely to try and attack you when you have the ball; when you do not have the ball, you are unlikely to be attacked. The more time the ball spends in your possession, the less likely you are to commit a foul.

Hence, what matters is not fouls per game, but fouls per number of minutes the opposition has the ball.

Now the situation looks totally different doesn’t it. Manchester City is not among the teams that commit fewer fouls per minute out of possession; they commit more than their fair share of fouls when the opposition has the ball; in fact, if you look only at home games, they commit the most fouls adjusted to possession than any other team in the premier league.

Also interestingly, Chelsea and Liverpool also foul consistently. It makes sense, if your tactics are around overloading opponents, it makes sense that if the opponent gets the ball (by passing your press), you would be very overloaded too, hence the tactical foul to allow you to regroup and balance the situation.

In fact, that was what Gary Neville was referring to when he said that Manchester City is a cynical team (and that he likes that).

Actually what I find interesting is the fact that Manchester City’s triangle is very asymmetric. They foul much more at home than away. They are much more aggressive at home, having scored twice as many goals as they have away; Chelsea and Liverpool are much more balanced.

Anyway I can happily disagree with Mr Guardiola, after all you wouldn’t expect a coach to agree that he asks his players to commit fouls, but I would have expected him to keep quiet rather than manipulate the data in his favour. I thought the temple of “fake news” is located at the white house., apparently it has a branch at the Etihad (and I am not commenting on the FFP and other allegations by Der Spiegel (7) such as “We do what we want”)

Was this a case of “Lies, Damn Lies, and Statistics”? “Fake news” yes, deliberate misdirection/white lying may be, but it’s not the fault of statistics, it’s the fault of the person who chose the metric (number of fouls per game rather than number of fouls adjusted for possession) rather than the metric itself.

So was it an illustration of what Mr Mourinho was saying, that “people who do not understand football analyse with statistics”?

Well, since I am currently spending a lot of my energy trying to make an organisation increase its adoption and usage of statistics in decision making (not a football club though, any takers?), I would neither agree or disagree, and hide, as usual behind “it depends”.

It depends on what Mr Mourinho actually meant. Saying the people who do not understand football analyse with statistics is not the same as saying that people who understand football do not analyse with statistics. Please note that since we are dealing with absolutes, the intersections will be shown as changes in colour of the affected regions (for example red overlap with yellow makes orange, and red with blue makes purple)


The world is made up of people who understand football and those who don’t.
Now let’s add people to analyse football with statistics, Jose Mourinho’s words can be seen as:
There is a perfect overlap between people who do not understand football, and those who use statistics to analyse football.
But his statement is equally valid if:
In this case there are people who understand football and do use statistics to analyse it; presumably Mr Mourinho has at least one such analyst in his team.
So what am I saying?
I deliberately started this blog with a seemingly controversial statement by Mr Mourinho, who is someone with many detractors. It is possible that a proportion of people would have interpreted his words as the orange and blue diagram above, just because of what they perceive Mr Mourinho to be, that is negatively. Hence they may not have seen his statement as representing the last diagram above, which is not very controversial.
On the other hand, if you just go on a search engine and look for Guardiola and Gary Neville, you will find many more articles on the response of Mr Guardiola, than on the statement by Mr Neville. Again, Mr Guardiola has a better image and people tend not to analyse his statements as critically, whereas Mr Neville can be polarising, hence the focus on the rebuttal of his statement.
But as you can see, at least in this case, Mr Guardiola was dealing in “fake news”.
Conclusion(s)
While the source of any data should be looked at, personal feelings towards the person delivering the message should not get into the picture. Data is data and should be analysed without prejudice
That being said, if the person who does the analysis has “mis-spoken” frequently in the past, then it makes sense to review their data a little bit closer, after all frequency of mis-speaking is a characteristic…
One of the simplest ways of analysing data is to put it in a proper context, and this takes some understanding of the data, the process of data creation, some subject matter expertise, and an open but critical mind.

P.S. While I wrote this blog last week, this weekend, Chelsea played and beat Manchester city. In total Chelsea committed 12 fouls, Manchester City 11, whereas possession was 39%-61%; hence Chelsea made more possession adjusted fouls than Manchester City. Chelsea won, by the way and disrupted Manchester City, restricting them to only 4 shots on target, their average being 6.1.

  1. Pep Guardiola’s changing defense against his convictions for taking performance enhancing drugs, and the power of unstable urine in all 4 tests as proposed by his still close collaborator, Mr Manuel Estiarte http://www.sportingintelligence.com/2017/04/25/sharapova-guardiola-doping-darkness-and-light-250401/