Product and service reviews are conducted independently by our editorial team, but we sometimes make money when you click on links. Learn more.
 

Hadoop Now Runs Directly Against Google Cloud Storage

By - Source: Toms IT Pro

Google continues to build on its Cloud Storage platform by offering Apache Hadoop developers a simpler way to manage big data clusters and file system through a new Google Cloud Storage Connector for Hadoop

Announced yesterday, the preview release of this connector is meant to allow developers to "focus on your data processing logic instead on managing a cluster and file system," stated Jonathan Bingham, Product Manager in a Google Cloud Platform blog post.

The open source Apache Hadoop project has two major components. The first is MapReduce, the programming framework for splitting up large sets of data into smaller blocks for parallel processing across large numbers of computers (cluster). The second component of Apache Hadoop is the Hadoop Distributed File System (HDFS) that intelligently manages the storage of data across the machines.

Hadoop's MapReduce traces its history back to Google's MapReduce and HDFS has its roots in Google's File System (GFS). Google is leveraging its understanding of the underlying file system technology to provide a service using a connector library that will allow developers to run Hadoop MapReduce tasks directly on data stored in Google's Cloud Storage and avoid any of HDFS's perceived disadvantages.

Benefits, as stated by Google, of using their File System (current version is Colossus) are direct access to data without a need to transfer it into HDFS and continued access to the data in Google's Cloud Storage even after shutting down a Hadoop cluster. Google's Cloud Storage is highly available and globally replicated and includes interoperability between Hadoop and other Google services. Since the data would be directly available to Hadoop on Google's Cloud Storage, there are no HDFS related management or startup performance penalties.

Google mentions cost savings for both storage and compute. Since Hadoop and HDFS run on Google Compute with per minute billing, cost for additional virtual machines that are running for the purposes of managing the extra set of data being stored for HDFS use will add up over time.

While Google is recommending the use of Google Cloud Storage for Hadoop, HDFS is still available if developers want to continue to use it as Hadoop's default file system.  

Since the Google Cloud Storage Connector for Hadoop is currently in preview release, Google cautions developers that it could make some backward incompatible changes to the connector that would not be covered by an existing SLA.

[ Get IT news updates right in your inbox -- Sign up for Tom's IT Pro's Weekly Newsletter ]

_________________________________________________________________________________________

ABOUT THE AUTHOR

Bill Oliver has worked in IT as a techie, trainer, manager, and in business roles supporting IT for 20+ years.  For the past 12 years his focus has been on the business side of IT Contracts, Software Licensing, and all things related to IT Purchasing.

More by Bill Oliver

Comments