Monday 15 August 2016

Analytics: Getting the basics right – an example of the importance of proper data preparation






I recently saw an infographic that seemed to show that travelling by Singapore Airlines was, on average, cheaper than travelling by Malaysian Airlines. Actually according to the infographic, the per km cost of Singapore Airlines is lower than that of Malaysia Airlines. (and by almost 50%) The full article can be found here: A deep data-driven drive into airline prices:


25 most popular airlines on Rome2rio and the median price per kilometre for their fares 





I was puzzled and decided to look a little bit deeper into the issue.

To me, unless doing a pure discovery piece, an analyst/’data scientist’ should start by asking him/herself “what am I trying to figure out?” Then after the data supports an answer, the question should be “how can I best convey this information graphically?” Ideally, the infographic should be self-explanatory, a picture is worth a thousand words. And to remain in the realm of clichés, I’m saying that a piece of non-discovery analytics should ‘begin with the end in mind’, the end being the answer, not the infographic.

Let’s say I want to find out what airlines are the cheapest. What metric should I use?

The median cost per kilometre as used in the article is a great one. Using the median is a quick way of getting rid of outliers caused by last minute bookings for example. Looking a bit deeper into the article, the authors mention using only economy fares and removing outliers by only including prices that are less than twice the minimum fare. Using the distance as a denominator is a nice way of making the airlines comparable, you know, the apples and oranges thing...

Now, let’s take a look at the data we have on hand. The analysis was done using millions of bookings on Rome2rio.com. Rome2rio is a website that allows door-to-door booking, using not only airplanes, but other means of transport such as trains, cars... 


So the first question is: can the rome2rio data be used as is to represent the world?

The first thing that looks odd is the “most popular” airlines as per Rome2rio. Looking at the IATA figures:



There is a huge difference between the IATA list of airlines and Rome2rio’s. Even when we consider only Asian airlines:



It looks like that data that Rome2rio has is very different from worldwide or Asian-wide numbers. The dataset therefore does not allow conclusions to be made on for air travel in general, but only on the specific segment of travel that Rome2rio customers are part of. And there is nothing wrong with that.

What I am talking about is a form of self-selection. People who are looking for door-to-door type travel and to/from ‘exotic’ geographical areas (Kenya, Russia, Jordan, Iceland...) are different from others, say like me who just wants a flight and I will settle the rest on site. 

The title of the infographic does state that it’s based on Rome2rio data, just that it would be nicer to state the difference rather than, in the article. stress the millions of data points which tend to give an impression of universal use. Basically the dataset is biased. My previous post talks about biased training data, and how we can attempt to remedy this. I won’t go there in this one.


So now I realise that given the dataset I have, I can only make conclusions for Rome2rio travelers. I have removed all the oranges of non-Rome2rio type travelers. 

The infographic is re-interpreted as saying that, for Rome2rio-type travelers, Singapore Airlines is cheaper than Malaysian Airlines. Still it’s puzzling.

A possible solution to the puzzle can be found in the article itself. The authors focus on a selected group, flights within different geographies: Australia, and also on airlines serving local routes in US, Turkey, Russia, Indonesia, Thailand, India. And this makes more sense in many ways.

Basically you need to ensure you are comparing like with like; even among apples, there are green apples and red apples and you wouldn’t think of them as similar, would you?
 

Hence breaking down the analysis regionally makes sense from a usability point of view; after all, if I am travelling in Asia, I wouldn’t care if American Airlines is cheaper than Singapore Airlines, American Airlines is irrelevant to my route. But is that enough from an analysis point of view?

To be able to compare stuff meaningfully, they should be as similar as possible. In terms of airline ticket prices, apart from the class of travel that has already been accounted for, there are other important considerations to the cost. 

One of the most important is the number of stops, especially given the way airline hubs are usually in their home countries. Aircraft fuel consumption peaks at take-off, so it wouldn’t make sense to compare Singapore to London via Singapore Airlines and the same route via Malaysian Airlines (although still on that route Malaysian Airlines seems cheaper). 

Another consideration would be the distance traveled. Airlines usually pick planes for the routes they travel so as to take into account the costs of fuel and the potential load they are able to carry. It’s a balancing act, on one hand the more fuel you carry the further you can fly, but on the other, the heavier you are and the more fuel you burn (especially at takeoff when the load is at the maximum). Hence there is a tipping point beyond which it is better to take have a stop-over and refuelling and that tipping point varies depending on the type of aircraft.

Airline ticket prices are a function of not just the cost but also the level of competition on routes. For routes to/from remote areas, the prices are likely to be much higher simply because of lack of competition; it does not make sense to look at oligopolistic markets. These should be excluded from the analysis as they are outliers that can unnecessarily skew the analysis (since we are looking for a price in general).

Similarly the timing of the flights also has to be taken into account. The flights should be in the same time period or in a time period that accounts for seasonality and weather conditions.
The basic idea is that we should ensure we are comparing red apples with red apples, and green apples with green apples.


Basically there should be a lot of thinking that takes place before data is analysed. The data preparation stage should not always follow a purely mechanical approach. To me an analyst/’data scientist’ should get involved in data preparation because the way you prepare the data can have a huge impact on the outcome of the analysis, and as you work with the data, you sometimes want to go back to the way the data was prepared and change things, hence I prefer to get my hands on raw data and prepare it as the situation requires.




No comments:

Post a Comment