What Happens When Spreadsheets Meet Big Data?
Spreadsheets were an early driver for PC use because they were easy to use and fit a wide range of problems. Those same characteristics make them a good option for working with big data as well.
1010Data’s trademarked term, the “Trillion Row Spreadsheet” ™, captures the aspiration of cloud-based spreadsheets. Clearly the days of a 32K row limit in a spreadsheet are in the past, but just being able to load data into a spreadsheet is not enough. A key question about big data spreadsheets is, “what can we do with the data once it’s loaded?”
For starters, by moving large data sets to the cloud, you have the scalability that comes with cloud infrastructure. Even the most powerful desktops have more restrictive limits than a set of cloud-based servers when it comes to the amount of data they can manipulate.Since the 1010Data offering is modeled on spreadsheets, you retain the visual metaphor of rows and columns that is so familiar to many users.
Also, unlike batch-oriented processing systems, such as Hadoop, spreadsheets are interactive. This can help improve exploratory analysis because it promotes more what-if types of analysis than you might find in batch-oriented processing systems.
One of the advantages of cloud-based systems is that they promote collaboration and data sharing. The 1010Data spreadsheet supports access controls which allow users to share data with other users, including those outside the organization.
Datameer offers an analytics platform which leverages Hadoop for the backend and a spreadsheet interface for the frontend.Analysts do not need to master the details of MapReduce or learn a data manipulation language such as Pig to work with Hadoop.The Datameer spreadsheet interface maps data sources to worksheets allowing one to readily integrate data from multiple sources.
In addition to conventional spreadsheet features, such as a library of functions, the Datameer spreadsheet supports joins across data sources.Like the 1010Data spreadsheet service, Datameer’s product is schema-less so end users can work across data sets and data types. This places more responsibility on the end user to ensure columns used in joins are logical choices and that data is consistently represented in the columns.
Spreadsheets are also basic visualization tools and the same holds true for their cloud-based big data analogs. Cloud-based spreadsheets for big data include components for tables, graphs, maps, tag clouds and other visualizations.
As your analysis tasks become more complex it will be important to track the provenance of your results. For example, you may load data from your enterprise applications, integrate it with third-party demographics data and join that to other data provided by your business partners. Once you have integrated these multiple sources you may want to share your results with others.
No doubt there will be questions, such as where did this result come from? what conditions were used to filter the data? and what transformation was applied to the raw data? Datameer spreadsheet includes a feature that tracks your workflows and can display a visual representation of steps used to derive a result.
Dan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education. Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.
See here for all of Dan's Tom's IT Pro articles.
(Shutterstock image credit: Cloud Data Folder)