Wednesday, 13 May 2015



Unstructured Data and Data Hoarding: don't let the ogre scare you





 

This article in the local newspapers piqued my interest since it’s about data hoarding.

Step 1: "Discover users and resources - Determine what is important by rolling up your sleeves and digging through the piles of data" is probably the most critical step in the process. But who can tell what is useful, and especially what will be useful. 

There are 2 major contributors to volume of data: what to keep and for how long. (Embedded in ‘what to keep’ is also ‘in how much detail’ but let’s skip that for now)

For much analysis, a certain volume of data over time is required. You do not want to invest huge resources in a fad, especially if your time to market is not that rapid. Basically, to me, the length of time you need for analysis depends on how fast the environment you are playing in changes. Thus the length of time that we need to keep data is linked to the business you are in.

In terms of what to keep, there is much more debate. For example, how many people would have, a few years ago, decided that the stream of data that is constantly being output by sensors in a manufacturing process were worth keeping? They were consumed immediately and discarded. But nowadays, predictive maintenance is something that is relatively easy to do using precisely the huge volume of accurate sensor data that has been kept. And it is worth keeping this data since doing maintenance before a breakdown, even if it involves taking some components off-line, is much less costly that having to fix a broken machine and the associated impact on the production line.

Another example would be the huge volume of emails that employees engage in over time, especially changes in the patterns of these emails, in terms of frequency, direction, content, sentiment… The classic Enron email analysis is a clear example. How many organisations were analyzing employee email for more than flagging insider trading or breaches of corporate policy ‘now’, as opposed to understanding the patterns and detecting malpractice and collusion? Today this is a component of Human Resource Analytics.

I picked these 2 examples precisely because many people consider these (machine logs, body of emails) to be cases of unstructured data. While these may have been considered too difficult to use in the past, these are routinely used nowadays (hence some people prefer the term ‘semi-structured’). Today the frontier may lie at audio/video files, but I am sure everyone knows of cases where such data can be made very useful (and easily searchable in the case of audio files).

In sum, I would caution against underestimating Step 1. Discovering what is important is not a trivial task. The beauty of discovery is that we are only limited by our imaginations. Hence I would say: hoard as long as you think is relevant to your environment/market, and as much as possible, and don’t let the ogre of ‘unstructured’ scare you.