Cloud Computing Lets Us Rethink How We Use Data

By - Source: Toms IT Pro

Inaugural post for Dan Sullivan's Tom's IT Pro "The Silver Lining" blog about Cloud Computing.The Silver Lining: Piecing Together Cloud ComputingThe Silver Lining: Piecing Together Cloud Computing

We are changing our relationship to data because of cloud computing.  We are taking on different types of problems because we have more data, more storage, and more computing resources. For decades we’ve used relational database like Oracle, MySQL and Microsoft SQL Server.

We’d never question the superiority of transactions, ACID (atomicity, consistency, isolation, durability) and integrity of relational databases.  But not everything we do in a database needs guaranteed transactional consistency.

Imagine you are charged with designing a system to collect data on temperature, air flow and electricity use in a building every few minutes from hundreds of locations. The system will be used to make the building more energy efficient. Now imagine you lose a few data points every day.  The cause isn’t important but it could be a glitch with a sensor, a dropped packet, or an incomplete write operation in the database.

Do you care?

If you consume the data for analysis, you probably don’t care. After all, no one data point will make or break your analysis. You are interested in calculating statistics for a large number of measurements.  As long as you aren’t losing data points from a single location or at a particular time, your analysis will be unaffected. Randomly distributed errors are expected.  That’s why we use statistical tests in the first place.

Realizing that not every application requires the integrity and consistency of a banking system is liberating and, more importantly, cost effective. You do not have to pay for the overhead of relational database transaction management, concurrent read consistency, and other features we have come to expect in applications where every piece of data matters. When statistics begin to matter more than book keeping, the way we deal with data changes as well.

We don’t just need reports with low level details.  In fact, in data intensive applications we don’t want details–we want a way to describe large volume data sets. We want the average power usage in a part of the building. We want to know how it varies over time. We want to know what other events correlate with those changes.  We are fighting the wrong battle if we are arguing for ACID compliant databases to save every scrap of data. IT professionals have solved transactional systems. We know how to build those.  Business does not let us rest on our laurels though. There are demands for new kinds of systems that require us to rethink how we collect and work with data.

The distinction between storage and analysis tools is blurring. Hadoop is an implementation of MapReduce that depends on a distributed data store like HDFS. Working together, the two are more than the sum of their parts.  Hadoop is also a platform for building other tools such as Mahout, the collection of data mining tools for Hadoop.

Transactional systems require us to be good data stewards, caring for each individual piece of data.  Analytic and big data applications require stewardship but a different kind. We need to capture and analyze the right kind of data in sufficient volumes that we can make reasoned inferences about customer behaviors, conditions in the physical plant, and our company’s position in the market.

Dan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education.  Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.

See here for all of Dan's Tom's IT Pro articles.

(Shutterstock image credit: Cloud Computing)