Using MapReduce Programming without Java
You don't have to be a Java guru to get the benefits of MapReduce on Hadoop. The MapReduce model of parallel processing lends itself to many kinds of problems. Although Java is commonly used for MapReduce programs, you don't have to be a Java guru to get the benefits of MapReduce on Hadoop.
Three alternatives methods for MapReduce development include Pig, streaming MapReduce and domain specific languages, like Scalding.
Pig is a platform for working with big data in Hadoopwithout resorting to Java coding. Pig, which is the product of the Apache Hadoop project, maps scripts written in Pig Latin to a MapReduce model. Pig Latin is a declarative programming language for working with large data sets. If you are comfortable with SQL or ETL tools then PIG Latin should be a quick study for you. PIG Latin is not a general purpose programming language like Java, Python or C. You won't be using it to write complex application logic but it's well suited for data manipulation tasks.
Pig' Latin's functionality can be roughly grouped into three areas: loading data, manipulating data, and storing data.
The basic data loading command works with structured data, like tab delimited files, as well as unstructured data, such as natural language text files. The loader command can work with compressed files saving you from having to decompress before loading.
Once data is loaded into Pig schemas, you can get to work with data transformations. Pig Latin has both relational and arithmetic operations. You can use SQL like constructs, such as FILTER, GROUP and JOIN. Since this language is designed for big data you have features not found in conventional SQL. For example, the SAMPLE command is used to randomly select a subset of a data set; useful for computing statistics on a sample of a large data set. The arithmetic and logic operators includes the functions you'd expect: arithmetic, boolean, and type casting functions.
Once you load and manipulate data, you'll likely want to store the results somewhere. Pig Latin supports basic functions for saving results to the Hadoop file system or displaying them interactively. (At the risk of taking the platform name too literally, Pig's interactive tool is called Grunt.)
Pig is a good choice for bulk data processing that is common when you are first analyzing data or merging multiple data sets. If you have more complex logic you need to implement you should consider streaming MapReduce or Scalding.
Dan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education. Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.
See here for all of Dan's Tom's IT Pro articles.