Big data is becoming the new normal. Businesses have the ability to collect large volumes of diverse data at low costs. Sources include application, operating system, and network traffic logs, enterprise applications, mobile devices and user interaction tracking. Analyzing that data, however, is much more challenging than collecting it.
What should you do with all this data?
Is it time to analyze everything?
That is a tantalizing goal we probably will never realize but as always, it is a good time for targeted analysis. Use a combination of exploratory analysis and hypothesis driven approaches to identify valuable topics to investigate while avoiding others.
Business intelligence at the scale of big data requires data management practices, an understanding of how to perform exploratory analysis, and the ability to generate and test hypothesis about your data.
Managing Big Data for Business Intelligence
Extracting, collecting and integrating data has always been a challenge in business intelligence. Working with big data, which by definition comes in various forms and in large volumes, has many of the same challenges as well as others. Lets start from the assumption that you want to “analyze everything” so you will be collecting and storing as much data as possible.
We could summarize much of the raw data and store only the summaries but that risks losing valuable information. Nathan Marz and James Warren make the case for storing raw data in their book Big Data: Principles and Best Practices of Scalable Realtime Data Systems. They argue that storing all the raw data allows us to recover from errors, e.g. a bug in the summarization program, and it also allows us to reuse data in ways we might not have anticipated. Summarizations of any form will lose information that could otherwise be derived at a later time.
The desire to keep raw data has to be balanced against the cost of storage. Storing terabytes and petabytes of data for extended periods could bust your budget. Consider using compression and deduplicating backup programs to reduce the size of data sets when they are not being analyzed. Historical data that you would like to keep but will access infrequently could be stored in low cost archival storage, like Amazon Glacier, but there are trade-offs, including slow retrieval times. Another option is to summarize older data, but keep raw data for more recently generated data. This still leaves you the option to re-analyze data in new ways while reducing storage costs.
Dan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education. Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.
See here for all of Dan's Tom's IT Pro articles.
(Shutterstock cover image credit: Big Data)