Monday, 30 September 2024

Analytics failure costs Singapore ~$100m

In my previous blog (1), I showed the direction Singapore has chosen to take regarding the new world of AI. Singapore chose to weaken traditional labour structures and support personal responsibility instead. I mentioned that in order for these 2 structures to be comparable, there needs to be transparency and accountability across the line, so individuals can come close to replicating informational resources the old structure (Unions) had.

I had left out the obvious fact that, if you want to adopt AI, you need to know what you are doing, else you will get hurt. Well, it only took a week for this obvious fact to slap me in the face.

One of the crown jewels of Singapore, the Mass Rapid Transit train system has had a major failure, costing Singapore around $100m (2) and the cost is still going up as the days go. And I place this failure firmly in the realm of an analytics failure.


What happened?

It is very simple.

  • A train broke down at 0930 on Wednesday.
  • In order to free the tracks, the affected train was dragged towards an appropriate area.
  • However, the break down had caused a part of the train be dislodged and fall onto the tracks.
  • During the dragging of the train, this caused damage to the tracks for at least 1.6km (3)

Why do I say it is analytics failure then?

Before I go there, let me go a little bit back in time, to the '1st grand MRT failure'



The 1st grand MRT failure

I am not going to go through minor issues, if you are interested you can refer to (4)(5), but to the 2011 case. Within 3 days 2 incidents occurred.

Many fingers were pointed at the then CEO, MS Saw Phaik Hwa, whose expertise lay making $ having been regional president of DFS (Duty Free Shop) Ventures – the closest she got to managing transport was probably DFS shops at Changi airport. She launched the now ubiquitous shops at MRT stations. Everyone knows she had no engineering degree, but the aim was to monetise the assets of the SMRT.

She was criticized for not prioritizing maintenance. But during the commission of enquiry on the 2011 failures, while she was accused of neglecting maintenance – the cost of maintenance rose by 3% per annum (6) – she argued that she simply approved the figures proposed by the maintenance team.

She posited a few possible reasons for the 2011 failures (that affected only around 200,000 people (8) compared to half a million now) and mentioned unexpected rise in readership.


As it can clearly be seen, Singapore population did indeed increase quite a bit leading to 2011 (9); the growth rates 2006-2011 are higher than 2001-2005. In fact, in 2009, Singapore first crossed 5m population.

My own non-expert but numbers driven opinion is that maintenance budgets may not have taken increased wear and tear due to increased ridership.

Indeed, at that time, the focus, as highlighted by expert witness Professor Lim from NTU was on strength of ‘preventive maintenance’ on the part of SMRT.

Preventive maintenance is what you do with your vehicle, you have a time (yearly, half yearly) or usage (km used) based schedule to maintain it. The basic idea is that, in most cases, most issues occur after a set period (time or usage), hence the idea is to maintain before that threshold and identify and fix issues before they become serious.

As I mentioned before, CEOs are an indication of the direction an organization is likely to take. Ms Saw Phaik Hwa was replaced by Mr Desmond Kuek, a career military man and an engineer (10)


So what has changed since then?

You can get some historical perspective by reading what the former Straits Times transport correspondent wrote (11). He highlights the recent completion of the 10 year renewal programme, and the issues that have cropped up since.

But to me, what is more enlightening is what he said SMRT does/did right, that is “In his post, Tan called for full transparency from the authorities, questioning why the incident occurred despite SMRT’s use of predictive maintenance systems designed to prevent such failures. “We have been told SMRT now practices preventive and predictive maintenance… So, what happened to that fateful train?”

SMRT has included predictive maintenance among the tools at its disposal.

This is totally in line with Singapore adopting the best techniques, and is now leading the world in GenAI adoption (12).

In fact, the SMRT Chairman, only last year, stressed the need to balance costs and reliability, to avoid “over maintenance” (13). This is exactly where predictive maintenance can help. It is not a replacement for preventive, but an additional tool that should help manage costs better.

A couple of things I’d like to point out before I go further.

The current SMRT Chairman, Mr Seah Moon Ming is an engineer by training and had a career in MINDEF (Ministry of Defence) and ST Engineering among other government related posts.(14)

The current CEO, Mr Ngien Hoon Ping is also an engineer and also comes from an army background and is the third ex Singapore Armed Forces high ranking officer to helm the SMRT (15).

The focus is squarely on efficiency in maintenance.

So how did this incident occur?


What caused the 2nd grand failure?

To me, it is preventive maintenance.

Yes, analytics is causing Singapore $100m and counting.

Let me explain myself.

I am not saying preventive maintenance is bad.

On the contrary, it is a potential cost and even life saver. It is a very useful tool. As all tools, how is it used matters.

Now, preventive maintenance is not new to SMRT (16); even since the days of Mr Desmond Kuek, preventive maintenance has been put in place and AI used to make more sense of the data generated.


To be clear, some highlights relevant to the current case:

The system SMRT installed in 2018 was from HK polytechnic university (17) and as first in the world (as usual for Singapore) (18) and to be clear the capabilities are “Apart from installing an optical fibre sensing network in tracks to monitor the trains, sensors are also installed in in-service trains to monitor the tracks on which the trains run.

The system SMRT has ‘listens’ to the train, and to the tracks, and feeds live updates of data for processing: real time, trains and tracks.

The tools therefore do not seem to be a problem.

The 1st thing that SMRT publicized once the incident occurred, even before any possible cause was investigated, is that the train that broke down was 35 years old.

To me, on the contrary, this means that SMRT has enormous amount of data on this type of train and the preventive maintenance models on this 35 year old train should be top notch: more accurate and reliable data means more accurate and reliable models, especially in slowly changing systems.

I am not saying the models failed. There is much much more to implementing, using and maintaining any model with predictive capabilities than simply just signing a document and taking delivery of a system.

Think 3Ps People, Product, Processes

Product

The collaboration between SMRT and HKPoly is still going strong (Dr Tan Kee Cheong (18) is still with SMRT and was even adjunct at Hong Kong Poly (19)). Hong Kong Poly is also at the cutting edge of research and application on railways (20)). Therefore, there is no reason to believe that the product, that is the predictive maintenance system from HK Poly has any major issues.

Rather, I think the issue has all to do with people and process.

Let me start with process

Process

Let’s recap what happened (21)

  • Train developed fault
  • Train was being moved to the depot
  • A component, axle box dropped onto the tracks
  • The boogie frame dropped and caused wheels to shift
  • This damaged rails and tracks for at least 1.6km as the damaged train was moved.


Axle box:

Predictive maintenance watches the health of axle boxes ““Sensors installed at City Hall MRT will also scan the entire North-South and East-West lines’ train fleet for defects such as wear and tear to the wheels or axel defects.””

The predictive maintenance system should have flagged potential issues with the axle and prevent axle to break.

1.6km of damage (at least)

A piece of equipment was dragged on tracks for at least 1.6km, and there was no alert from any sensor that the noise, or the vibrations coming from the track were not normal? I am pretty sure the sensors picked the issue. But why was the damage allowed to continue so long?

The slew of sensors along the track should have detected the damage as it was occurring and minimized the impact.

So what happened?

I think the issue is with the volume of data SMRT deals with, ““allow SMRT to tap on multiple streams of data from all of its assets to predict the need for maintenance activities””. The process for dealing with the volume of data is likely flawed.

And this leads me to people.

People:

An analytical system is not a fire and forget kind of thing. Its performance has to be measured, the system within which it operated has to be evaluated, and the analytical models adjusted accordingly.

This takes some organizational commitment and some skill on the part of the analytical team. This is where most models, even if properly implemented, degrade and may fail past the short run.

Let me explain a little bit.

To predict whether a piece of material will fail, simple survival analysis type models are sufficient for a single component, all the way to digital twins to account for interactions within. Every time you maintain the piece of equipment, you take data on the state of the equipment, and if possible, the waste whether exhaust, or oil/lubricant, and use the chemical analysis as input. There even are systems that do preventive maintenance purely based on the sound of equipment (22).

For those of you who have been to Ikea, remember the chair testing machine? (23).


Preventive maintenance counts when the chair usually breaks and tells you it is good for say 80% (depending on the risk) of that number, predictive maintenance looks at the wear and tear on the flexible component and advises when it is deteriorating. This is the lab (showroom) world.

Now when the piece of equipment interacts with a changing world, then the external components that affect the equipment also need to be included.

For example, let’s say this chair is at my home. If I suddenly put a large amount of weight, then my predictions regarding my chair go out of the window, the environment the chair existed in has changed, and I need to adjust my calculations accordingly.

A slightly more complex system has to be built.

And someone needs to know when the parameters within the model have to be adjusted.

This is what people are for, to keep the predictions usable.

Predictive maintenance suffers from the fact that it is hard to model to start with, given failures are (hopefully) rare; modeling rare occurrences has its own challenges. Now add to the fact that ideally you need to model external components. People are even more crucial; imagine what may affect the model, and try improving the model by testing if these features make the models perform better.

It is a continuous process.

A model is meant to represent something. As things or the environment change, so must the model. Luckily, for analytics, it is part of the process that people should follow to test the changes in the model and make sure they are successfully captured and the model improved.


Summary:

In sum, the case of the SMRT incident illustrates the importance of people continually thinking and improving systems and the processes around them.

It’s the people, not the technology.

While Singapore is leading GenAI adoption, and is putting a structure in place to go along the chosen trajectory it is crucial basic steps are not missed else the structure may crumble.

The case of SMRT has shown that even in areas where Singapore is world class (24), there still are gaps in the ability to use analytics and keep using it in the medium to long run. And in my view, this case stems from issues with people and processes.

Use of data via the application of analytical models whether pure statistical models, ML, AI… needs to be thought through and people with the right expertise and creativity are needed to ensure these models keep performing as can be expected.

Just having the IT skills to deploy a model, especially out of the box, or to follow documents to launch them is not sufficient.

“Use your blain!”

Let this be a $100m lesson.



  1. https://www.linkedin.com/feed/update/urn:li:activity:7244502068500045824/
  2. Singapore GDP per head is USD82,000 yearly, (https://www.macrotrends.net/global-metrics/countries/sgp/singapore/gdp-per-capita), let’s say SGD100,000, say around $50 an hour. 516,000 commuters are affected a day (https://www.channelnewsasia.com/singapore/east-west-line-disruption-smrt-faulty-train-timeline-4638131), and let’s be nice an assume each loses 1 hour of aoutput daily, so that is SGD25m a day. The issue has been going on for 4 days already (excluding weekends), hence SGD100m.
  3. https://www.lta.gov.sg/content/ltagov/en/newsroom/2024/9/news-releases/update_on_EWL_recovery_works.html
  4. https://www.straitstimes.com/singapore/transport/water-in-tunnels-human-error-other-major-train-service-disruptions-in-s-pore-s-history
  5. https://www.nlb.gov.sg/main/article-detail?cmsuuid=0888e6b3-5912-4ceb-b34e-1238a0b2ea8f
  6. https://sgtransportcritic.wordpress.com/2021/12/16/dec-2011-breakdowns-2021/
  7. https://ifonlysingaporeans.blogspot.com/2012/05/mrt-breakdown-coi-day-18.html
  8. https://sg.news.yahoo.com/saw-phaik-hwa-defends-lavish-spending-in-tnp-exclusive.html
  9. https://www.macrotrends.net/global-metrics/countries/SGP/singapore/population
  10. https://en.wikipedia.org/wiki/Desmond_Kuek
  11. https://www.theonlinecitizen.com/2024/09/27/christopher-tan-criticizes-mrt-breakdown-following-decade-long-renewal-program/
  12. https://www.asiabusinessoutlook.com/news/singapore-tops-generative-ai-adoption-worldwide-nwid-7254.html
  13. https://www.smrt.com.sg/news-publications/newsroom/smrt-in-the-news/%E2%80%98we-don%E2%80%99t-want-overmaintenance%E2%80%99-smrt-chairman-flags-need-to-balance-rail-reliability-with-costs/4
  14.  https://en.wikipedia.org/wiki/Seah_Moon_Ming
  15. https://sg.news.yahoo.com/ngien-hoon-ping-third-consecutive-saf-man-smrt-ceo-070030116.html
  16. https://www.todayonline.com/singapore/smrt-taps-predictive-technology-prioritise-maintenance
  17. https://www.scmp.com/presented/news/topics/polyu-innovating-better-world/article/2065348/optical-fibre-sensing-technology
  18. https://www.smartcitiesworld.net/news/worlds-first-onboard-train-track-monitoring-system-in-singapore-1026
  19. Dr. Chee Keong Tan - Head Network Systems Maintenance - SMRT Trains | LinkedIn
  20. https://www.globalrailwayreview.com/news/135103/mtr-corporation-mtra-and-hong-kong-polytechnic-university-sign-mou/
  21. https://www.straitstimes.com/multimedia/graphics/2024/09/ewl-train-breakdown/index.html?shell
  22. https://www.tandfonline.com/doi/full/10.1080/18824889.2020.1863611
  23. https://www.youtube.com/watch?v=4s_gyzshNPQ
  24. https://www.channelnewsasia.com/today/big-read/public-transport-connectivity-mrt-lines-buses-commute-big-read-4445081


 



No comments:

Post a Comment