Wednesday 25 April 2018

Yes, facebook has taken liberties with the data they collect about you, but how safe is your DNA?




A while ago, I wrote about a new insurance product launched in Singapore that required you to submit your DNA as part of the deal – you got ‘personalised’ advice in exchange. The ad ridiculously showed two identical-looking twins receiving different advice (since identical twins share the same DNA...). (1) In that blog-post, I mentioned that the insurance company was at pains to stress that they had no access to the DNA, but I raised the prospect of someone buying that company collecting the DNA and not being bound by the same rules. And unfortunately this prospect is very real.

Let’s take a step back, am I talking about DNA or facebook?

These few weeks have been exciting for people interested in data and “Big Data”, since the extent of the data collected by Cambridge Analytica via facebook, very often without the subjects being aware (2). I have been going on about the need for us to own our data but this really takes the cake; you were not only giving away your data but that of your connections too (53 Australians took the test – and possibly gained something – but the data of 311,127 was harvested. Similarly 10 New Zealanders did so, and data from 63,724 as harvested. I am not saying there were national boundaries, but these numbers give an idea of the pandemic).

Ok, so people’s surfing habits, likes comments, photos they posted in public were accessed and used, but what use can be made of this data? As the time magazine article (1) mentioned, one use was for Mr Trump’s presidential campaign. And as this article shows, the efforts started in 2014 (4), and were very effective as confirmed by Mr Trump himself (5):
But they had this expression ‘drain the swamp.’ And I hated it, I thought it was so hokey. I said, ‘that is the hokiest, give me a break, I am embarrassed to say it.’ And I was in Florida where 25,000 people were going wild, and I said, ‘and we will drain the swamp’ — the place went crazy. I couldn’t believe it. And then the next speech I said it again and they went even crazier. ‘We will drain the swamp… we will drain the swamp,’ and every time I said it I got the biggest applause

So we can at least say that the data facebook ‘allowed’ Cambridge Analytica to harvest from the subjects was, at least, ‘useful’.

So what does that have to do with DNA?

Basically if you think that someone getting their hands on your surfing history and using it for their own purposes without your consent is bad, what if they get their hands on your DNA?

The organisation that holds the DNA for myDNA from Prudential is Prenetics Limited (7). Recently I read that Alibaba and Ping An insurance are the major investors in Prenetics (8). On one hand, I find it amusing that Ping An possibly have access to data that Prudential help collect. On the other I find it scary that the data of these people (of course I did not purchase myDNA) is now in the hands of another insurer.

Anyway, Prenetics claims that the DNA of over 200,000 people across South East Asia, China and Hong Kong were in their hands as early as October 2017 (9).

But, I am sure some nice people will say, there is a legitimate reason to do research into DNA; hospitals and universities have been doing so to the benefit of mankind for years. Yes, but I would argue that the CEOs of hospitals and universities have different experiences as compared to the CEO of Prenetics (Mr Danny Yeung) and that may affect how the data is being used:
Prenetics started out as ‘Multigene’ in 2009 when it span out from Hong Kong’s City University. Yeung joined the firm as CEO in 2014, after leaving Groupon following its acquisition of his Hong Kong startup uBuyiBuy, and it has been in startup mode since then. Prenetics has raised over $52 million from investors which, aside from Alibaba, include 500 Startups, Venturra Capital and Chinese insurance giant Ping An.”

This, I will admit, is pure speculation on my part. For all I know, Prenetics really wants to help mankind and bless everyone whose DNA they hold with better health and lower health care costs (prevention rather than cure). But I have other reasons to be sceptical.

Basically, even if humans ‘decoded’ the whole DNA sequence (which hasn’t been achieved yet (10)), even if you have inherited a predisposition to a condition, nobody can tell where you will actually get affected by it:
Genetic testing can provide only limited information about an inherited condition. The test often can't determine if a person will show symptoms of a disorder, how severe the symptoms will be, or whether the disorder will progress over time.” (11)

And to make things more interesting, the pieces of the genetic code that have not been sequences were considered useless or too hard to analyse given technological limitations, but are now being re-evaluated. Does that sound familiar? For people in the “Big Data” space (especially proponents of the “Data Lake”), it should.

One of the arguments of the “Data Lake” is that we do not know what data can be useful; even if we cannot extract is and use it now, we might as well keep it since it might be useful.
When I first started in this line of work, the kind of conversations I would have would be along these lines:
Q: “What data do you need?”
A: “Just give me what you have and I’ll analyse”
Q: “That is impossible, tell me what data do you need?”
A: “Ok, can I have the list of pieces of data that you have?”
Q: “That is impossible, tell me what you want and I will see if I have it...” ad nauseam

Now technology and acceptance of the usefulness of data have advanced and it is possible to “keep all the data” in a “Data Lake” or “Data Swamp” as some friends call it (Drain it! Drain it! Sorry I got caught for a moment).

Pieces of data that we would have had trouble analysing a few years ago such as weblogs, or pictures, or voice recordings can now be analysed relatively easily. But these pieces of data were routinely considered to be useless.

It is the same thing with DNA data. And to make it worse, there is the link between being at risk of some condition as per your DNA profile and actually getting that condition.

Basically, there is way too much data that would be needed to transform this ‘risk’ into something that can be measured with ‘enough accuracy’. That is what insurance companies try to do when they ask questions about your lifestyle, smoking, drinking... but these are very crude.

So is it fair that you could be penalised because of a feature of your DNA make-up? Are we slaves of our DNA?

What I am getting at is not the importance of DNA data, but rather at the care that must be taken when conclusions are made, and people penalised for things they may not be aware of.

To make things more fun, not only is Prenetics in China, Hong Kong and South East Asia, but it has recently acquired DNAFit (12). This impacts Prenetics in 2 ways. Firstly geographically, DNAFit’s market presence is mainly in Europe and is expanding to the USA. Secondly DNAFit goes direct to the consumer whereas Prenetics tended to reach the consumer via Insurance or Medical companies. (In fact even Linkedin is one of DNAFit’s customers).

The impact of direct-to-consumer DNAkits is debatable (13), but “a little learning is a dangerous thing” (14), add to this the emotional weight of ‘learning’ not necessary pleasant things about your own self...

So what I am saying is:
  1. As individuals we should have control over the data we produce by living (web/call/messaging behaviour, surveillance footage...
  2. But we should also have control over data we produce by existing (DNA).

I think there are many gaps between the general public (who have no issues with being facebook’s product in exchange for a quiz (15)) and those who have some idea of what can be done with such data; the same for DNA. And it is critical for people to be educated or educate themselves on this. As long as there is such an asymmetry of information, together with major issues with how people/machines use the data (people/machines, not technology or data itself), the cost of exploitation can be very high.

I would like to end this post with the poem by Alexander Pope (14):

A little learning is a dangerous thing ;
Drink deep, or taste not the Pierian spring :
There shallow draughts intoxicate the brain,
And drinking largely sobers us again.
Fired at first sight with what the Muse imparts,
In fearless youth we tempt the heights of Arts ;
While from the bounded level of our mind
Short views we take, nor see the lengths behind,
But, more advanced, behold with strange surprise
New distant scenes of endless science rise !
So pleased at first the towering Alps we try,
Mount o’er the vales, and seem to tread the sky ;
The eternal snows appear already past,
And the first clouds and mountains seem the last ;
But those attained, we tremble to survey
The growing labours of the lengthened way ;
The increasing prospect tires our wandering eyes,
Hills peep o’er hills, and Alps on Alps arise !


7 https://www.prudential.com.sg/en/prumydna/mydnapromotnc/ see point g: ““myDNA report” means the personalised report that Eligible Customers receive from Prenetics Limited”

Tuesday 17 April 2018

#metoo, the question of identification, understanding and biases we and AI may not be aware of


I am more than halfway through a “data science” blog when I read a piece of news where a reporter was kissed on both cheeks during a report on the Hong Kong rugby sevens (1). Another thought piece mentioned that the reporter looked humiliated afterwards. (2)

What I found most interesting is that the headline called the men “rude”. Rude? Rude is not saying hello to people, not saying “thank you” or “please”. Is kissing someone, in an obviously pre-planned way, just rude? Especially given that the person was apparently humiliated, then it ought to be more than that, no? Where is the #metoo movement?

Then I recalled another recent incident on American Idol where a judge, hearing a contestant has never kissed anyone before because the contestant believed that this would require being in a relationship, tricked the contestant into a kiss on the lips. (3). Again, there was minor backlash, but no #metoo against the judge. (4)

When you compare this to the groundswell of the #metoo movement, you cannot help but wonder... The crux of #metoo is identification, you identify with something, someone... May be it’s not just sexual harassment, or even sexual harassment of a woman.

Earlier this year I read this open letter that I highly recommend to everyone to read. (5). Emma Watson talks about feminism and what I like most is the idea that introspection is needed to understand your own position, especially things you may not be aware of.

When I gave my UN speech in 2015, so much of what I said was about the idea that “being a feminist is simple!” Easy! No problem! I have since learned that being a feminist is more than a single choice or decision. It’s an interrogation of self. Every time I think I’ve peeled all the layers, there’s another layer to peel. But, I also understand that the most difficult journeys are often the most worthwhile. And that this process cannot be done at anyone else’s pace or speed.

When I heard myself being called a “white feminist” I didn’t understand (I suppose I proved their case in point). What was the need to define me — or anyone else for that matter — as a feminist by race? What did this mean? Was I being called racist? Was the feminist movement more fractured than I had understood? I began...panicking.

It would have been more useful to spend the time asking myself questions like: What are the ways I have benefited from being white? In what ways do I support and uphold a system that is structurally racist? How do my race, class and gender affect my perspective? There seemed to be many types of feminists and feminism. But instead of seeing these differences as divisive, I could have asked whether defining them was actually empowering and bringing about better understanding. But I didn’t know to ask these questions.

Basically, what I am trying to say is that the stories above (the reporter at the rugby sevens and the American idol story) didn’t create that huge an outcry possible because of some bias which people may or may not be aware of.

So how does that relate to “data science”?

Well, it seems blogs get tagged by keywords, and I would rather this doesn’t get tagged as a political post, I have to add a “data science” bit; and for this purpose I will use the words data science without quotation marks.. So here comes the data science bit... 

As data science gets more and more automated, as ML and AI become more popular (especially to non data scientists), one of the hidden dangers is bias.

You are what you eat, even if you are a machine or are artificial.

It takes a lot of effort to even identify biases from a bunch of data. That’s something I mentioned before in the context of Human Resource Analytics where the impact could arguably be the worst (6).

There have been many articles such as on whether computers can be racist (7); the answer is yes if the data set which was used to train is happened to have a bias towards or against  a certain race even if it was purely unintentional. For example if your area of the world has virtually no orange people and one happens to apply for a role (say president) and gets accepted, a machine could pick up that orangeness makes one suited for presidency (The probability of becoming president given the race is orange is 1 haha). 

Serious thought is being given to the topic, Barocas and Selbst (8)  argue that “Addressing the sources of this unintentional discrimination and remedying the corresponding deficiencies in the law will be difficult technically, difficult legally, and difficult politically. There are a number of practical limits to what can be accomplished computationally.” Ransbotham (9) from Boston College also argued that having more data doesn’t necessarily remove sampling bias. 

In sum, when we, whether as data scientists or normal human beings want to analyse and issue, it is a good idea to understand our own biases and the biases in the data we have, so that we can do justice to interpreting the information we have and generating results.

(3) https://www.youtube.com/watch?v=ce3_D3IG96w I guess people who didn't know of this case would have assumed the genders were reversed. While some men would have loved to be kissed by Katy Perry, not everyone would, especially people who believe you need to be in a relationship before you kiss someone.
(4) Just FYI the contestant did not get past the round.

Thursday 12 April 2018

Taxing teachers and national defence personnel but subsidising private companies


Recently, it has been announced that teachers will be charged for parking at their place of work (schools) (1) and so will military personnel (2) (3). On the other hand, as I highlighted in a previous post (4), public space has been reserved for parking of privately owned and for-profit operated bike leasing companies.

I found it a bit strange.

Yes there is a move to decrease the number of cars and increasing the use of ‘greener’ transportation methods (electric cars, car-pooling but also bicycles)(5) but is that a reason? Do we want our teachers to cycle to work so they are fitter and set an example for their students and fight obesity in schools? 

Do we think that portly people in military attire are unsightly (6) and aren’t the regular incentivised programs (7) sufficient to tackle the problem?

Even if that was the case, then why, on the other hand, subsidise the commercial profit making businesses that own the bikes and play in the ‘bike-sharing’ space?

I know that for many people the word ‘subsidy’ is quasi-taboo, but I am not using it lightly. In fact, one of the reasons why teachers have to pay for parking at their places of work is precisely to ‘remove the subsidy’ that they had been enjoying:
"Such practices are tantamount to providing hidden subsidies for vehicle parking and are not in line with the requirements laid down in the Government Instruction Manuals," the AGO had said.(8). 

Saying teachers are being taxed therefore is an exaggeration, but to the person having to pay for something they did not have to pay is equivalent to a tax; bottom-line you have less to spend.

As for the military bases, the rule doesn’t apply across the board, but only to specially chosen bases where ““Due to their proximity to public amenities, the car parks in these camps are deemed to have market value,” the ministry said” (9). For teachers it is across the board.

Do I really mean that the money taken from the pockets of teachers and military personnel gets transferred to the pockets of the owners of the for-profit bike-sharing companies? Of course not.

But do I find that the policies, while each standing on their own arguments contradict each other? Yes. And that is my point.

Remember the fish-ball-stick incident? (10) It illustrated the fact that different government agencies were not coordinated and that coordination is now the job of the Municipal Services Office (MSO). 

Actually, interestingly, in one of my previous roles, we had proposed to the government agencies a system that would take the feedback received from the public and automatically distribute it to the right authority or combination of authorities to deal with. But then we were told the MSO was already on the way. (You see, analytics can even help clear fish-ball sticks, just attach a drone haha).

Ahum, so am I saying that this tax and subsidise issue could have been prevented, or at least highlighted to the relevant authorities?

To put it simply, yes. What I did was simple: I realised what the implications of different policies were, what space in economics they occupied, then I saw that their positions were in opposition not to say contradictory. Is it difficult to build a simple analytics based system to do that? No, it is not that complicated and there are quite a few algorithms that can help get the topics, stuff like LDA (11) or LSA (12), or a simple Bayesian classifier (13). Then it’s a question of comparing documents on the same topic.

“Data Science” to the rescue? Anyone in the government would like to know more? :D


3 Actually in the latter case the work places where the payment is being implemented has increased, not a totally new policy
5 Note that companies like grab (and uber) do not make the world greener, on the contrary. The price point of such companies is lower than regular taxis but higher than busses/trains which are more efficient means of transport. SO moving people away from taxis to grab does not make a huge green dent, but moving people from busses/trains to moves people to less green means of transport. That could be the topic of another blogpost... J
8 as (1), AGO means Audit-General’s Office
9 as (2) above, the ministry being the ministry of defence.