The Cloud: How to Validate Data Mining Models
Data mining as a service not a “just add water” solution to your analytics problems.
Data mining is becoming more readily accessible to everyone. Service providers are offering machine learning in the cloud and all you have to do is bring your own data. I talked about the potential problems with this approach in an earlier post. Now I want to describe things you can do to validate your data mining models.
First, I want to say again I don't have any problem with data mining as a service. It's a great application for the cloud and the more options we have the better. I am more concerned with marketing material that makes data mining sound like a “just add water” solution to your analytics problems. Data mining is a practice built on algorithms for building models and techniques for evaluating those models. It's easy to talk about the former and forget about the latter. Just because a data mining algorithm spits out a model doesn't mean it's a good model for your needs.
I'm going to limit the discussion to classification or prediction services since they are probably the most likely to be used by someone just getting into data mining. Other types of algorithms, like market basket analysis, are useful for many business applications but I won't get into those here. The first thing we should keep in mind is that there are different kinds of classification algorithms and some may work better with your data than others. You need to evaluate how well the models you build actually work.
Look for two key attributes of classification models: accuracy and reliability. Accuracy is a measure of how often the model gets it's predictions right. Reliability is a measure of how consistent the model is with different data sets. A model that has high accuracy on one data set but lousy accuracy on others is not much use.
You build a data mining model with training data and validate it with validation data. One common method is to split your data into a training set and a test set. You could, for example, randomly pick 70% of your data for training and 30% for testing. You build the model using the 70% and then run the classifier on the test data. The percent of how many items in the test set are correctly classified is your accuracy.
A single test probably won't reflect how the model will do in production. A common technique is to divide your data set into 10 distinct subsets. You then pick 9 of the subsets for training and 1 for testing. You repeat the process until all 10 subsets have been used for testing. The average accuracy of the ten tests gives you the accuracy rate of your model. This technique is known as 10-fold validation.
Another method for testing is to randomly choose a subset of data for testing. You train on the remaining data and then test with the randomly selected subset. Repeat this process multiple times.
Data mining is more than running some data through a classification algorithm. These algorithms produce models that will vary in quality. Part of that variance is a function of the algorithm and part is a function of the data you use. It is important to evaluate your model before deploying it in production. By all means use data mining services; just don't forget to validate the models when you do.
Dan SullivanDan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education. Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining.
See here for all of Dan's Tom's IT Pro articles.
Additional The Sliver Lining blog posts:
A little knowledge is a dangerous thing – especially when it is applied to data analytics.
Science fiction mainstay helps explain some common but divergent views on .
Think you’re ready for Big Data analysis, right? Not necessarily.
Days of building apps with Linux, Apache, MySQL and , Python or PHP over thanks to the Cloud.
A particularly apt biological metaphor for the state of today.
Software as a service is and will be the most innovative and profitable segment of cloud computing.
Inaugural post for Dan Sullivan's Tom's IT Pro "The Silver Lining" blog about Cloud Computing.
(Shutterstock image credit: Analytics)