Online MarketplacesOnline marketplaces—such as Infochimps, Windows Azure Marketplace and Amazon’s Public Data Sets—offer access to data sets in a wide range of areas. Do they offer much for business, however?
Business executives and analysts are increasingly interested in big data analytics, especially with the promise of finer insights into customers, markets and operations. Big data analysis depends on data, a lot of it. You will find some of it internal to your organization in application logs, click stream data, and instrument data.
Social media providers, like Twitter, can be a source of data about customer attitudes to trending topics.Internal sources and social media can be good starting points but at some point you may realize that you need different kinds of data that you generate in house and that social media is too unstructured to provide precise information about your target subject area.
When you reach that level of demand for big data it is time to consider what online marketplaces have to offer.
Infochimps.com combines a data analysis platform and related services with a data marketplace. There are a wide variety of data types available from Infochimps, including business, demographics, health, as well as economics and finance.
The business listings contain widely recognized data sets like Consumer Price Indices and the Producer Prices Indices as well as industry specific statistics such as the Aerospace and Electronic Cost Indices and the Hydrocarbon Oils Bulletin.Some of the data sets are country specific with hundreds of data sets focused on the United Kingdom.
There is a wide range of demographic data sets, from census data and crime statistics along to consumption and fertility rate data. For marketers, there are almost 90 data sets that map Internet Protocol (IP) addresses to demographic characteristics such as employment status, housing value, occupations, and home ownership.
If you dive into the health data sets you will find data sets on disease incidents, health care expenditures, physician specialties, and sites subject to environmental regulations.
In the finance and economics arena you have the choice of thousands of data sets, from stock exchange data (e.g. NASDAQ Exchange Daily 1970-2010 Open, Close, High, Low, and Volume) and European Union market access data to the national and state budgets.
Infochimps has created a data ecosystem that includes data directly available from Infochimps, either by download or by API access, or from offsite sources. Some of the sources are free and others are provided for a fee. Infochimps also offers Geo APIs for adding geographic location data to your data sets as well as Social APIs for accessing Twitter analytics and raw data.
If you are looking for a third party data source, Infochimps is a good start.
Windows Azure Marketplace
Microsoft has created a data and application marketplace as part of its cloud offerings. The marketplace is still small but if you are using the Windows Azure cloud this is a logical starting point for finding third party data sources – or could be in the future.
With about 130 data sets, Windows Azure Marketplace still has to prove that it can attract the large numbers of diverse data sets that compete with larger scale operations like Infochimps. The Windows Azure Marketplace has breadth with subject areas ranging from business and finance to government and weather. Some of the data sets are so specialized that the target user base is difficult to identify; see the Statistics for grade in education in Sweden as an example.
Amazon Web Services Public Data Sets
Amazon Web Services generates revenues by charging for compute and storage services but it never hurts to give away a few free samples. Rather than charge users to store individual copies of popular data sets, Amazon makes some available for public use.Life sciences researchers will find a substantial portion of the free data sets dedicated to their field with data sets that include 1,000 Genomes Project, the Cannabis Sativa and data about the genetics of human livers.
Text miners will also find a number of useful resources including the Google Books Ngram data set which includes data on the frequency of words or phrase (up to five words long) found in the text of Google Books.Amazon makes available a set of 5 billion Web pages in the Common Crawl Corpus and the Enron email data set of over 1.2 million emails and almost 500,000 attachments, both of which are useful for developing and testing unstructured data analysis tools.
There are other data sources with economic time series, labor statistics and census data. These may be more useful for testing your analysis techniques than driving decision making since they are not kept up to date. The Federal Reserve Economic Data, for example, was last updated in June 2009.
Dan SulivanDan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education. Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.
See here for all of Dan's Tom's IT Pro articles.
(Shutterstock image credit: Statisitics)
Check Out These IT Videos