One of things I usually say about
“Data Science” is that a lot of what is being done today has been in existence
for a long while, but if being used differently/more extensively partially
aided by technological advances.
Ontologies are no different; they
have been around for a long while. Most of us have come across taxonomies.
Put simply a taxonomy is a way of
classifying stuff in a hierarchy of concepts, such as parent-child,
class-sub-class, for example a family tree, useful for Chinese New Year(1),
GXFC!
And since this is the lunar year
of the dog , let’s look at the order of canidae:
So what’s the difference between
a taxonomy and an ontology?
Basically an ontology can capture
much more than parent-child type relationships. It is designed to capture
ideas.
In fact the earliest known
versions of ontologies date back to the ancient Greeks. Ontology is the
philosophical study of the nature of being, becoming, existence or reality as
well as the basic categories and their relations (2). Ontology is the study of
being, existence, essence, and the relations between the different components.
Ontologies physically manifested
and be used as systems to capture knowledge in a certain area, and allow the
representation of relationships between the concepts in that area. It can be
used as a knowledge base, and that’s one form where it can be useful in “Data
Science” for example.
Another thing I always say about
“data science” is that the software doesn’t matter much; most software will
have the most common formulae/algorithms, hence it is how you use the
algorithms, knowing which to use when that is critical, not the software that
is just a tool.
However, not all formulations of these
algorithms are identical, hence for people new to some software mistakes can
creep in unnoticed and bad decisions made.
A simple example will illustrate
what I mean. For example if you are doing a simple hypothesis test based on the
normal distribution and you key in the mean and standard deviation as you are
used to. The p-values/CI are calculated accordingly. Now imagine the software
was expecting the variance rather than the standard deviation; for the
computation of the limits, it will take the square-root of the ‘variance’ and
you are more likely not to reject the null than you should.
Hence enabling a user to quickly
reference the parameters of a formula is very useful.
But an ontology can do much more than that. It’s not just different versions of one formula, but also the ability to know what are the closely related formulae/algorithms. This allows the users to pick the best algorithm for the problem they are trying to solve.
Continuing the example above,
what is some of the assumptions of the normal distribution have been violated?
Even if you used the correct parameters, the result would likely still be
invalid because the formula should not be used. Again this could lead to wrong
conclusions.
This is where ProbOnto (3) can be
useful. Probonto is an ontology of major probability distributions clearly
enumerating the parameters that each formulation expects and the relationship
between the formulae.
For example, if you are planning to use a binomial distribution B(n,p), if n, the sample size is large enough, it can be approximated by a normal distribution with mean np and variance np(1-p).
Or if X is a lognormal variable
with a certain mean and variance, then log(X) follows a normal distribution
with the same parameters. Lognormal distributions are often used in looking at
pricing stocks and options; they are at the heart of the Black-Scholes model.
In sum, an ontology can be used
as a great way of storing knowledge, as shown by ProbOnto. But is it only
useful to forgetful people with some statistics background? And if so, why this
blog from me?
Well, ontologies can also have
very practical applications even if used purely as knowledge bases. They can be
used to capture human knowledge in a systematic way and be used to solve
business problems.
Let me take a simple example. If
an organisation wants to automatically assess the impact of news on its
business how can it do that quickly?
Organising large volumes of text,
extracting key words and arranging them in groups based on the context and
enabling measuring the distance between words is what word2vec (4) does. While
word2vec has been used for some domains such as in the domain of genes and
proteins (5) and radiology (6).
However, not all organisations
have the skill, know-how, large enough dataset and time to do their own
implementation. However what many organisations do have is in-house know-how and
experience.from their own people. Ontologies can be used to represent the knowledge
from the experts and represent the words within the context at hand.
For example, think of the case of
the news of a proliferation of chilo infescatellus/sugar cane shoot borer(7) in India, threatening a
large part of Indian sugar output 6 months down the road.
If you are an investment advisor,
this means that output of sugar will fall; you don’t care about the insect
causing the issues, prices are expected to rise, as such customers should buy
sugar. Also, our ontology would tell you that alternate sources of sugar could
be Brazil and China, and Cosan (8) might be a good stock to buy given their
importance on the Brazilian market as well as their diversification into
ethanol and other bi-products (ability to quickly change output mix).
But on the other hand, if you are
in the chemicals and fertiliser business, the type of insect matters; you will
know that there is little pesticides can do, hence given the type of
infestation, there is no impact on your business, unlike say if the infestation
was of scipophaga excerptalis/sugar cane top borer(9).
It is difficult to expect an out
of the box algorithm to be able to work as well in both these 2 contexts, and
that’s where ontologies built using human experience and knowledge can bridge
or even adequately fill the gap.
- https://mustsharenews.com/addressing-relatives-cny/
- https://en.wikipedia.org/wiki/Ontology
- https://sites.google.com/site/probonto/home
- https://arxiv.org/pdf/1301.3781.pdf
- http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287
- https://www.sciencedirect.com/science/article/pii/S1532046417302575
- https://en.wikipedia.org/wiki/Chilo_infuscatellus
- https://en.wikipedia.org/wiki/Cosan
- https://en.wikipedia.org/wiki/Scirpophaga_excerptalis
No comments:
Post a Comment