Sunday, 18 February 2018

Know how to address your relatives for Chinese New Year, or ontologise like a "data scientist" while visiting



One of things I usually say about “Data Science” is that a lot of what is being done today has been in existence for a long while, but if being used differently/more extensively partially aided by technological advances.

Ontologies are no different; they have been around for a long while. Most of us have come across taxonomies.

Put simply a taxonomy is a way of classifying stuff in a hierarchy of concepts, such as parent-child, class-sub-class, for example a family tree, useful for Chinese New Year(1), GXFC!



And since this is the lunar year of the dog , let’s look at the order of canidae:


So what’s the difference between a taxonomy and an ontology?

Basically an ontology can capture much more than parent-child type relationships. It is designed to capture ideas.

In fact the earliest known versions of ontologies date back to the ancient Greeks. Ontology is the philosophical study of the nature of being, becoming, existence or reality as well as the basic categories and their relations (2). Ontology is the study of being, existence, essence, and the relations between the different components. 

Ontologies physically manifested and be used as systems to capture knowledge in a certain area, and allow the representation of relationships between the concepts in that area. It can be used as a knowledge base, and that’s one form where it can be useful in “Data Science” for example.

Another thing I always say about “data science” is that the software doesn’t matter much; most software will have the most common formulae/algorithms, hence it is how you use the algorithms, knowing which to use when that is critical, not the software that is just a tool.

However, not all formulations of these algorithms are identical, hence for people new to some software mistakes can creep in unnoticed and bad decisions made.

A simple example will illustrate what I mean. For example if you are doing a simple hypothesis test based on the normal distribution and you key in the mean and standard deviation as you are used to. The p-values/CI are calculated accordingly. Now imagine the software was expecting the variance rather than the standard deviation; for the computation of the limits, it will take the square-root of the ‘variance’ and you are more likely not to reject the null than you should.

Hence enabling a user to quickly reference the parameters of a formula is very useful.


But an ontology can do much more than that. It’s not just different versions of one formula, but also the ability to know what are the closely related formulae/algorithms. This allows the users to pick the best algorithm for the problem they  are trying to solve.


Continuing the example above, what is some of the assumptions of the normal distribution have been violated? Even if you used the correct parameters, the result would likely still be invalid because the formula should not be used. Again this could lead to wrong conclusions.

This is where ProbOnto (3) can be useful. Probonto is an ontology of major probability distributions clearly enumerating the parameters that each formulation expects and the relationship between the formulae.


For example, if you are planning to use a binomial distribution B(n,p), if n, the sample size is large enough, it can be approximated by a normal distribution with mean np and variance np(1-p).


Or if X is a lognormal variable with a certain mean and variance, then log(X) follows a normal distribution with the same parameters. Lognormal distributions are often used in looking at pricing stocks and options; they are at the heart of the Black-Scholes model.

In sum, an ontology can be used as a great way of storing knowledge, as shown by ProbOnto. But is it only useful to forgetful people with some statistics background? And if so, why this blog from me?

Well, ontologies can also have very practical applications even if used purely as knowledge bases. They can be used to capture human knowledge in a systematic way and be used to solve business problems.

Let me take a simple example. If an organisation wants to automatically assess the impact of news on its business how can it do that quickly? 

Organising large volumes of text, extracting key words and arranging them in groups based on the context and enabling measuring the distance between words is what word2vec (4) does. While word2vec has been used for some domains such as in the domain of genes and proteins (5) and radiology (6).

However, not all organisations have the skill, know-how, large enough dataset and time to do their own implementation. However what many organisations do have is in-house know-how and experience.from their own people. Ontologies can be used to represent the knowledge from the experts and represent the words within the context at hand. 

For example, think of the case of the news of a proliferation of chilo infescatellus/sugar cane shoot borer(7) in India, threatening a large part of Indian sugar output 6 months down the road.

If you are an investment advisor, this means that output of sugar will fall; you don’t care about the insect causing the issues, prices are expected to rise, as such customers should buy sugar. Also, our ontology would tell you that alternate sources of sugar could be Brazil and China, and Cosan (8) might be a good stock to buy given their importance on the Brazilian market as well as their diversification into ethanol and other bi-products (ability to quickly change output mix).

But on the other hand, if you are in the chemicals and fertiliser business, the type of insect matters; you will know that there is little pesticides can do, hence given the type of infestation, there is no impact on your business, unlike say if the infestation was of scipophaga excerptalis/sugar cane top borer(9).

It is difficult to expect an out of the box algorithm to be able to work as well in both these 2 contexts, and that’s where ontologies built using human experience and knowledge can bridge or even adequately fill the gap.


  1. https://mustsharenews.com/addressing-relatives-cny/
  2. https://en.wikipedia.org/wiki/Ontology
  3. https://sites.google.com/site/probonto/home
  4. https://arxiv.org/pdf/1301.3781.pdf
  5. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287
  6. https://www.sciencedirect.com/science/article/pii/S1532046417302575
  7. https://en.wikipedia.org/wiki/Chilo_infuscatellus
  8. https://en.wikipedia.org/wiki/Cosan
  9. https://en.wikipedia.org/wiki/Scirpophaga_excerptalis
 

No comments:

Post a Comment