Why Hadoop is the A-list of Big Data
An open source software framework that supports distributed and data intensive applications.
As data volumes grow, companies are looking for cost effective ways to store and analyze this data. Relational databases can scale to meet many requirements but often with a premium cost. From analysts to executives, we all seem to be looking for a way to leverage growing amounts of data without breaking the IT budget. This is where Hadoop comes in.
Hadoop is an open source software framework for building distributed and data intensive applications. The core components of Hadoop include the Hadoop Distributed File System (HDFS) for storing data and the MapReduce computing system for analyzing data.
As the name indicates, the HDFS is designed to support file storage across multiple nodes in a cluster. Data is replicated across nodes providing reliability even when using low cost non-RAID disks. (Hadoop can use several other distributed file systems but HDFS is standard with the Apache Hadoop distribution.)
MapReduce is a programming model popularized by Google for distributing computation across a cluster of servers. The basic idea behind MapReduce is that many computational problems can be broken down into smaller problems which are solved in parallel. The results of these many small scale computations are then combined to produce a new set of outputs which are then processed in a similar manner. The first step of breaking down the problem and solving a sub-problem is the map part of MapReduce; combining the results of those computations is the reduce step.
Consider the problem of searching server log files for adverse events and producing a report listing each type of adverse event and the number of times each type occurred. You can solve this problem by scanning each log file for lines with a pattern indicating an adverse event. For simplicity we’ll assume that if the word ‘ERROR:’ appears at the beginning of the line then an adverse event has occurred. We’ll also assume that the word following ‘ERROR:’ indicates the type of adverse event, such as ‘LOWDISK’ to indicate low disk space or ‘FAILPING’ to indicate problems connecting to another server. The problem can be solved in two steps: filter all log files for adverse event lines and then count the number of adverse events by type.
In terms of the MapReduce model, the first step is a mapping from input log files to output lines which contain the term ‘ERROR:’ at the beginning of the line. The results are then sorted into separate lists of adverse event types. During the second step, the items in each list are counted. This ‘reduces’ the output of the first step from a list to a count of items in the list.
You would be right to think this kind of problem could be easily solved with Linux command line utilities like grep and sort, at least if you are working with a relatively small number of files or you are in no rush for the results. The advantage of Hadoop is the ability to implement large scale parallel processing with relatively little effort, at least when compared to some other models for parallel processing.
Hadoop has significantly reduced the barriers to entry to parallel processing. Designing parallel data processing programs has become a matter of identifying operations that can occur in parallel, combining results from those operations, processing those combined results and repeating the process until the problem is solved. Hadoop is designed to run on clusters of commodity servers. You do not need specialized hardware, supercomputers, or extravagant budgets to take advantage of Hadoop. The combination of ease of use and low cost hardware has led to wide scale interest and growing adoption of Hadoop.
Dan SullivanDan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education. Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.
See here for all of Dan's Tom's IT Pro articles.
(Shutterstock image credit:Chalk Drawing)