Along with medical applications and geography-based social networks, “Big Data” is the hottest topic in tech these days. Everyone seems to be talking about it – though if you listen closely you’ll realize, as is often the case with new technologies, that nobody really knows what it is.
In fact, Big Data has been around for decades – but until now it has mostly been in the hands of a small elite: mostly academic and government researchers manning supercomputers in secure facilities. But, as usual, Moore’s Law has done its work. . . and now billions of servers, personal computers, tablets and smartphones around the world are each generating gigabytes of raw data almost on a daily basis. And with the presence of the global broadband Internet, those mountains of data are now merging into vast data stores of literally unimaginable size.
It is this huge collection of raw information that most people think of when they mouth the words “Big Data.” But that is incorrect; that is merely A Whole Lot of Data. It only becomes Big Data when that raw material has been processed in some way as to glean out valuable meta-data that is invisible in everyday life, but which emerges when you sort through the behavior of millions or billions of data sources.
That brings to me to my definition of Big Data: Any body of information that is so big it cannot be analyzed directly for profitable use in its raw form.
Let’s deconstruct this definition a bit. Note that we begin with a ‘body of information’. I’ve phrased it this way to recognize that there have always been these large information caches. William the Conqueror’s Domesday Book of the 11th century, the first great census of the Medieval world, was for its time a vast body of information. So was the Oxford English Dictionary eight centuries later. These giant repositories of knowledge were all-but impenetrable to their owners because of the lack of automatic search tools. At best, they could be accessed only in piecemeal, the conclusions drawn rarely more than anecdotal.
The modern census, a phenomenon of the 19th century, was at least initially another largely impenetrable body of information. Thanks to early adding and computational machines, it was possible to extract a few pieces of meta-data from these pools – such as total population. But the amount of data captured from each data point necessarily had to be small (i.e. a half-dozen questions about family size, age, etc.) or the data pool quickly grew out of control once again. Indeed, the censuses of the late 19th century grew so unmanageable that the U.S. Census bureau nearly used up one decade (the 1880s) – thus almost starting the next census – before they completed processing the previous one. It was the desperate need to solve this problem that led Herman Hollerith to develop his Tabulator and give birth to the modern computer.
Even today’s cheapest laptop or tablet is millions of times more powerful than the Hollerith Tabulator, but the challenge of Big Data has never really gone away. That’s because while each generation of processor grows ever more powerful at crunching volumes of data, it also becomes even more efficient at creating that data. And that’s just the start, because most of those billions of processors are now connected via the World Wide Web, and that interconnection multiples the amount of data being created. The good news is that this interconnection also allows to us to multiply our processing power – and use that expanded power, with the right application tools, to crunch all of that expanded data. . . and, with luck, uncover trends and truths heretofore lost in the noise.
It is these tools that are what we really mean these days by “Big Data”. And it is their creation and early implementation that is the subject of all of those magazine articles, blog entries and conference reports that you have been reading. These Big Data folks believe two things:
1. That the right tools can be found to crunch all of this raw data in a fast and efficient way – a challenge that will grow even greater by the year as billions more processors get imbedded into the natural world and start emitting data; and
2. That there really are valuable nuggets of metadata hidden out there in those mountains of data, that they can be found, and their value will be greater than the cost of their discovery.
Hence my use in the definition of the terms “analyze” and “profitable”. Whatever the breathless coverage by the media (and by entrepreneurs in the industry) about the importance of Big Data, those two core beliefs I just listed have not been proven viable. At present, we have no idea what tools we will ultimately need to drill down through all of that data out there, how far down we will have to drill to find anything useful, and once we do find something useful, whether it can produced in a form that is useful to corporate customers, governments, healthcare professionals and everyday consumers – and finally, even if it useful, whether it will show sufficient return on investment to justify continuing the pursuit.
This is not going to happen overnight, whatever you may have read. A year from now, we may begin seeing the first public results of this research. When you will see Big Data a part of our daily lives the way we see past tech revolutions like the Web, social networks, GPS and smartphone apps? Maybe five years.
And we may need every day of that half-decade. There is a technology law, first proposed years ago by my friend, technology journalist and author Mike Malone. It’s not as famous as Moore’s or Metcalfe’s Laws, but I think it might be particularly useful here. It says that All technology revolutions arrive more slowly than we predict, but arrive quicker than we are prepared for them.
My gut tells me that Big Data is real, that we will find the right tools to explore it (and that they will be available to almost everyone via the Cloud), that we will find useful results almost from the beginning – and that will grow even richer the deeper we drill. And that a whole second generation of tools will convert those results into content that will form the basis for a whole new boom of entrepreneurial start-ups. There is too much historic precedent – such as the work of epidemiologists in the eighteen century looking at illness tables – to argue otherwise.
But my hunch is also that all of this will take at least five years, and maybe more, to pull off. And that we will need every bit of that half-decade, and probably a whole lot more, to deal with the larger cultural implications of Big Data. Eugenics, racial theory and a whole host of other misguided, and often murderous, nonsense emerged from the misuse of the cruder forms of Big Data two centuries ago. There are a whole bunch of potential problems waiting in the wings with this newest wave of Big Data – the biggest one being privacy. Indeed, in my cynical moments, I believe that the only thing standing between us and our complete loss of personal privacy is the fact that we are still too dumb to devise the right Big Data tools. That won’t last long – indeed, you’d be amazed, and shocked, at what is being tracked by Big Data already. It is time to start worrying about these matters right now.
In my next two blog entries, I plan to look even deeper into what needs to be done to help Big Data achieve its destiny; and then, into what needs to be done restrain Big Data once it gets there.
This blog originally appeared on Forbes.com