The Silver Lining: Data Analytics with Google BigQuery

By Dan Sullivan June 19, 2012 3:10 PM

Google's BigQuery is a data service that lets you run SQL-like queries on extremely large data sets.It's hard to find a discussion of data analytics these days without being overwhelmed with the word 'big.'  Big data, being analyzed on big clusters of servers in big clouds with tools designed to scale to big data sets.

Enough – we get it. Big data is common and we need tools that are easy to use and readily available. Hadoop is a popular platform for large scale data analysis but it is not for everyone. For those data analysts that do not have the interest in learning the ins and outs of Hadoop, MapReduce and Pig then Google's BigQuery is an option that might be good fit – with emphasis on the might.

Google's BigQuery is a data service that lets you run SQL-like queries on extremely large data sets.  It is designed to support small numbers of large tables (on the order of billions of rows) but it is not a conventional database management system. You can create tables with columns defined by basic data types but there is no user mechanism for creating indexes.  BigQuery makes it easy to import data and load large data sets but forget about conventional updates and deletes to rows. BigQuery tables are built for read performance not transactional processing.  You can append data to existing tables and that, combined with an emphasis on read performance, makes BigQuery a good tool for data analytics.

Once you have data loaded into BigQuery tables, you can use a SQL-like language to query it.  You can query with a syntax familiar to database developers, including SELECT statements with the usual FROM, WHERE, HAVING, GROUP BY, ORDER BY and LIMIT. Joins are supported as well but with an important restriction: only one large table is allowed in a BigQuery join.  If you need to join to large tables then you will need to restructure your design or work with a different platform.

You have a couple of  options for executing queries interactively. BigQuery includes a browser tool for working interactively with your data sets. It's a simple interface with a text box for composing queries and menu-based commands for basic functions like importing data and creating tables.  If you prefer the command line to a browser, you can install the bq command line tool that lets you work with BigQuery projects, datasets and tables. This is an especially good choice if you use Linux command line tools to preprocess data and construct files for loading into BigQuery.

For programmatic access, BigQuery has a REST API and client libraries for Java, .NET, PHP, Ruby, Python, and other languages.

The fairly substantial subset of SQL syntax supported by BigQuery along with multiple ways to interact with the service make it a good option for wide range of needs as long as you can live with the constraints that come with the platform.

BigQuery's tables are immutable.  When you need to change data you will need to delete tables and reload.  If you have a mix of fixed and changing data you might be able to separate data into two or more tables, one for immutable data and others for mutable data. 

You will organize your work into projects, datasets and tables. A project is an administrative level data structure for billing purposes.  Projects are made up of datasets which are collections of tables.  Datasets are the level at which we control access to data. Unlike most relational database management systems, you cannot restrict access to tables.  You will want to group tables into datasets with the same access control parameters.

Another consideration are the limits Google imposes on the amount of processing you can during a particular time period. There is a limit of 20,000 queries per day per project with a maximum response size of 64MB.  Imports are limited to 1,000 requests per day; each request can have up to 500 files of up to 4GB each.  Exports are limited to 50 per day.  If you want to work with BigQuery, but will exceed these limits, you will have to contact Google for premium support services.

As with other cloud services, BigQuery is charged on a per use basis. One advantage of BigQuery's pricing model is its simplicity. You are billed for storage and queries. Storage is billed at $0.12 per GB per month; queries are billed at $0.035 per GB processed but the first 100 GB per month is not charged.  BigQuery processes data by columns, not rows, so you are charged only for the amount of data that is processed in the columns referenced in a query.  This approach does not penalize you for tables with a large number of columns, but it does run up your costs for referencing columns unnecessarily.  When using an infrastructure as a service, like Amazon EC2, we control the time we leave virtual machines running to control costs. In data analysis services like BigQuery, we control the scope of the data processed to control costs.

BigQuery is well suited for ad hoc analytics.  There are no complex schemas to design. Users do not need to become well versed in tuning strategies to improve query performance.  Other providers are taking a similar approach to large data sets. 1010Data, for example, offers what they call the “Trillion-Row Spreadsheet”  which allows users to work with large data sets interactively in a spreadsheet form.  Amazon's Dynamo database offers a similar service to BigQuery with a scalable data table service with a price per usage cost model.  Dynamo is not limited to data analytics; it provides more flexible support for table updates and more OLTP like operations than BigQuery.

BigQuery is a good choice for data analytics, especially ad hoc analysis.

You do not have the full flexibility of a relational database management system but you have enough SQL for many kinds of basic selection and aggregation.  Just watch the amount of data you process in your queries and be aware of restrictions on joining large tables.

Dan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education.  Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.

See here for all of Dan's Tom's IT Pro articles.

(Shutterstock image credit: Analytics)


Comment on this article
Comments