The Cloud: Don’t Fall for Turnkey Data Mining

By Dan Sullivan May 23, 2012 11:51 AM

Data Analytics

A little knowledge is a dangerous thing – especially when it is applied to data analytics.

 As cloud computing matures we are seeing more services for data mining and statistical analysis and more opportunities for misuse.  Google offers a prediction engine for classifying data. The new SaaS provider BigML bills its offerings as “machine learning for everyone.” Revolution Analytics has created a scalable version of the popular R statistics package which you can run yourself in Amazon EC2.  These are three examples of the advances cloud computing enables.

There are definitely benefits to be gained using these and similar services, as long as we use them properly.

A simplistic view of data analytics as a service is that it lets you upload your data to a service, it crunches some data mining or statistics algorithm and generates a model for predicting whatever you wanted predicting. You could use these tools to predict credit worthiness, cross selling opportunities,  news stories of interest to executives or just about anything that you can represent using example cases described by attributes.  Once you have your model, you could just start running your new data through it to generate predications and classifications but that would be a mistake, possibly a big mistake.

There are different ways to build predictive models and they have different advantages and disadvantages.  Google’s prediction engine is something of a black box because we don’t know how predictions are made. The documentation offers some advice on improving results. How accurate are the predictions?

The better your data, the better the predictions.

When we know the algorithm used we can anticipate potential limitations. For example, decision trees are commonly used in data mining. They have several advantages, including the fact that the results are easy to understand.  Finding an optimal decision tree is an intractable problem so algorithms use heuristics to come up with good approximations of an optimal solution.  This means that the same data set can generate different decision trees depending on the heuristics used and the order in which the data is presented.

So how do we know which of all the possible trees that can be generated are best? We don’t.

That’s why one effective technique is to build multiple decision trees and use the results from all of them to make a classification.  The combination of decision trees is known as a random forest. These perform quite well for many classification tasks.

When we use data analytic services we ideally would know the algorithm used and have data to evaluate the results. Google’s prediction service may not disclose details about the algorithm but it does give users information about the accuracy of the classifications in the form of a confusion matrix. This is a data structure with information about the number of instances that are classified correctly and incorrectly as false positives and false negatives.  We can use data to understand how well the model is working.

We can run into problems if we blindly take the results of data mining and statistical analysis without measuring accuracy and validating results. In the next post, we’ll consider methods for validating the results of data mining models.

Dan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education.  Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.

See here for all of Dan's Tom's IT Pro articles.

Additional The Sliver Lining blog posts:

Science fiction mainstay helps explain some common but divergent views on .

Think you’re ready for Big Data analysis, right? Not necessarily.

Days of building apps with Linux, Apache, MySQL and , Python or PHP over thanks to the Cloud.

A particularly apt biological metaphor for the state of today.

Software as a service is and will be the most innovative and profitable segment of cloud computing.

Inaugural post for Dan Sullivan's Tom's IT Pro "The Silver Lining" blog about Cloud Computing.

(Shutterstock image credit: Analytics)

Comment on this article
Comments