Programming for the Cloud: MapReduce and Beyond

Programming for the Cloud: MapReduce and Beyond
By Dan Sullivan March 15, 2012 11:45 AM
Table of contents
  • 1. Large Scale Data Analysis
1. Large Scale Data Analysis

One of the key selling points of Cloud Computing is reduced infrastructure costs–the applications you run in your data center now can be run on Cloud servers for less–but that is only one advantage of the Cloud. Another is the option of running applications in parallel on multiple servers. 

If your application domain allows for it, you could run multiple instances of your application in the Cloud and divide the work among your virtual servers.  Extraction, transformation and load (ETL) jobs are good candidates for this because there are few dependencies between data items.  Sometimes though, you get the greatest benefit by designing your applications with Cloud architecture in mind. If you have the option of developing an application for the Cloud you can start with any number of tools, frameworks, and languages with support for parallel and distributed processing.

Large Scale Data Analysis with MapReduce

MapReduce is a popular framework for breaking down large, data intensive tasks into a series of steps that are relatively easily parallelized.  The steps are in the form of map operations that calculate an output value based on an input, for example, given a sales transaction count the number of items sold, and reduce operations on the output of the mapping function, such as calculating the average number of items sold per transaction.  Problems are solved by defining a series of map-reduce operations that progressively solve parts of the problem on a subset of the data and then combine the results which in turn are then input to the next set of map-reduce operations.

MapReduce is especially useful in a Cloud environment where you have access to a large number of servers that can be coordinated to run MapReduce applications.  Hadoop, an Apache project, is probably the most well-known implementation of MapReduce but libraries are available for a number of languages, including C++, C#, Perl, Python, Ruby and others.   Hadoop and related projects, such as the Hive data warehouse system, the Pig platform for data analysis, and the Mahout project for data mining demonstrate the utility of MapReduce for large scale data analysis.

The MapReduce programming framework was popularized by Google but has its roots in functional programming techniques dating back to the late 1950s with Lisp and its map and apply functions.  Since then, it has become apparent that some of the advantages of functional programming techniques are especially useful when writing concurrent programs. 

Dan SullivanDan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education.  Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining. See here for all of Dan's Tom's IT Pro articles.

(Shutterstock cover image credit: Cloud Computing)

Comment on this article
Comments