Big Data requires Veracity + Vector to improve predictions.

Big Data folklore reminds us to consider 3 dimensions of "big": Volume, Variety and Velocity. So lots of Data, most likely unstructured rather than easily read or interpreted by a machine or human, and arriving at ever accelerating speed. So "Big Data" in a very literal sense.

Whilst strategies, and technology innovations have emerged to provide scalable "Big Data" infrastructure - NoSQL, Hadoop, Storage Virtualisation, Cloud etc. our ability to analyse has not kept pace with the explosion in data.

Performing analytics on bad data, or irrelevant data in the current situational context, is worse than having no data at all. If the data is "Big" then you will have consumed yet more time, effort and resources (human and machine) than otherwise - and still end up with incorrect results, and therefore compromised conclusions leading to erroneous decisions. So Big Data turning into Bad Data rather than Good Data.

Trying to build intelligent business solutions, adding a degree of predictive foresight, rather then simply extrapolating insight from historical hindsight - is predicated on Good Big Data. The nature of what is considered "Good Data" will be discussed in a future blog entry, but for now let's make this about the imperative for "Trusted Information" or e.g. "Single View of < entity >". Bad Data that has either been discounted, or filtered out of "gold-mastered" datasets can become valuable Good Data again in another analytical context e.g. as an indicator of potentially fraudulent activity. So don't lose data if you can't predict when/if it may be valuable to you in the future. Conversely Good Data can become Bad Data just as easily, again when the context changes.

Quite apart from having to deal with the first 3V's of Big Data - prediction is inherently difficult. Accurate prediction is heavily dependent upon judicious use of Pure and Applied Mathematics. We rely on the ability of algorithms to help us make sense of "big data", but our analytical perspective is limited by a number of factors (environment, process, experience, prejudice, preference, situation) and always by the resolution of the observed data, in context, in a particular geospatial location (i.e. place, and time). Prediction is about identifying patterns and trends in data captured and being observed now, fast enough do something about it now or in near real-time. Competitive advantage will be gained by businesses with the ability to make better decisions, faster.

In terms of directing next best action in response to external stimuli, this could be expressed as adding a Vector - combining with Velocity (or Speed with Direction), i.e adding a 4th "V" to Big Data, thus:

  • Big Data = 3V's = (Velocity, Volume, Variety)
  • Good Big Data = Trusted Information = 3V's + Veracity = 4V's
  • Better Prediction = 4V's + Vector = 5V's

The ability to make accurate predictions is a function of our ability to count instances of observations in relevant data sets. At this point in the discussion it's worth remembering a famous "anti-pattern":

Not everything that can be counted counts, and not everything that counts can be counted.
- Albert Einstein, (attributed) US (German-born) physicist (1879 - 1955)

Just because you can count something doesn't make it relevant to your analysis or decision making. You might be counting the right things, but looking in the wrong place. That is your observations are being taken from a data set that is either (a) no longer valid (b) was never valid or (c) never will be valid. The data in question could be "trusted" but still not valid (see nature of bi-temporal data in terms of effectivity over time), or relevent - which is much harder to determine. Or you could be looking for the wrong things in the right places, or for the right things at the wrong time; it doesn't matter the result is the same: Bad Analytics, leading to Bad Decisions.