Unstructured Data and Data Hoarding: don't let the ogre scare you
This article in the local newspapers piqued my interest
since it’s about data hoarding.
Step 1: "Discover users and resources - Determine what is important by rolling up your sleeves and digging through the piles of data" is probably the most critical step in the process. But who can tell what is useful, and especially what will be
useful.
There are 2 major contributors to volume of data: what to
keep and for how long. (Embedded in ‘what to keep’ is also ‘in how much detail’
but let’s skip that for now)
For much analysis, a certain volume of data over time is
required. You do not want to invest huge resources in a fad, especially if your
time to market is not that rapid. Basically, to me, the length of time you need
for analysis depends on how fast the environment you are playing in changes.
Thus the length of time that we need to keep data is linked to the business you
are in.
In terms of what to keep, there is much more debate. For
example, how many people would have, a few years ago, decided that the stream
of data that is constantly being output by sensors in a manufacturing process
were worth keeping? They were consumed immediately and discarded. But nowadays,
predictive maintenance is something that is relatively easy to do using precisely
the huge volume of accurate sensor data that has been kept. And it is worth
keeping this data since doing maintenance before a breakdown, even if it
involves taking some components off-line, is much less costly that having to
fix a broken machine and the associated impact on the production line.
Another example would be the huge volume of emails that
employees engage in over time, especially changes in the patterns of these
emails, in terms of frequency, direction, content, sentiment… The classic Enron
email analysis is a clear example. How many organisations were analyzing employee
email for more than flagging insider trading or breaches of corporate policy ‘now’,
as opposed to understanding the patterns and detecting malpractice and
collusion? Today this is a component of Human Resource Analytics.
I picked these 2 examples precisely because many people consider
these (machine logs, body of emails) to be cases of unstructured data. While these
may have been considered too difficult to use in the past, these are routinely
used nowadays (hence some people prefer the term ‘semi-structured’). Today the
frontier may lie at audio/video files, but I am sure everyone knows of cases
where such data can be made very useful (and easily searchable in the case of
audio files).
In sum, I would caution against underestimating Step 1.
Discovering what is important is not a trivial task. The beauty of discovery is
that we are only limited by our imaginations. Hence I would say: hoard as long
as you think is relevant to your environment/market, and as much as possible,
and don’t let the ogre of ‘unstructured’ scare you.