Saturday, January 21, 2012

The big fuss about 'big data'

There has been growing interest in the idea of big data in the past few years. Indeed as McKinsey wrote about 'big data' (Read here) in May 2011, there has been an exponential rise in data available to businesses for taking better decisions. Indeed there will be a shortage of big data analysts in the US (and much of the Western world).

However in the growing swell of interest in big data, I find all sorts of companies and people talk about their data as 'big data'. This brings me to think: "How big should a dataset be to qualify as 'big data'?"

According to Wikipedia: "In information technology, big data[1] consists of datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage,search, sharing, analytics,and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime."Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.

To me,it is not just important how big the data is, it is critically important how fast the data is generated and how fast it needs to be analyzed.

As this blog article from ikasoft correctly puts it:"The answer is not "Now the data is big" -- the answer is "Now the data is fast!"  Google didn't become Google because their data was big -- Google went to MapReduce so they could keep growing the number of sites-crawled while still returning results in < 100 milliseconds, and now they're going to Google Instant because even 200 milliseconds isn't fast enough anymore.   Consider all the action we're seeing today in NoSQL data stores -- the point is NOT that they are big -- the point is that apps need to quickly serve data that is globally partitioned and remarkably de-normalized.   Even the best web-era app isn't successful if it isn't fast."

To me, only companies who generate terabytes of data every second (Google, Linkedin, facebook, twitter, Akamai, Yahoo, etc) are truly in the age of 'big data'. Companies who have terabyte+ databases over a year's time period can still stick to their RDBMS databases (and should quit calling themselves 'big data' companies).

Would love your thoughts!

No comments: