How would you like to cut your Cloud computing costs by 20%, 30% or maybe even 50%? That is the lure of using spot instances in Amazon’s EC2 Cloud where you can bid a price less than the standard rates for virtual machine instances.
When Amazon has unused computing capacity the company makes some of that capacity available at a cost lower than on-demand or reserved instances. The lower cost, known as the spot price, varies over time. Customers offer a bid price for spot instances and as long as their bid is above the spot price they will have access to a spot instance.
This sounds like a bargain, and it is, if you can live with one key constraint–if the spot price rises above your bid price your instance is shutdown.
Spot instances are a viable option for some computing tasks but you will need to choose carefully how you decide to bid for spot instances based on the applications you will run. If you expect constant uptime for your instance, but still want to cut costs, you can bid at or above the on demand price for Amazon compute instances. You will pay at most the spot instance prices so this strategy can cut costs and minimize the risk of disruptions. If you have some back office batch operations that can tolerate variations in availability then you can save even more. This article discusses ways to achieve those savings while reducing the chances of losing work when your spot instances are shut down.
There are several different ways to design your applications to function in fault tolerant ways. A common approach is checkpointing. As the application runs, it saves state information out to persistent storage, such as a relational database or an S3 storage block. Database developers use this approach routinely with transactions. One of the design issues that application architects have to decide is how frequently to write checkpoint information to persistent storage.
If your application runs for extended periods without writing state information you risk losing a substantial amount of work that has been done since the last checkpoint. If you write persistent data too frequently you may find that you are incurring more overhead than necessary to preserve your work.
If Amazon reclaims your spot instance you will not be charged for the hour in which the instance was reclaimed. This means if you checkpoint at least once an hour you will lose at most an hour of work but at least you will not be charged for that hour. If spending another hour re-running a job will adversely affect your workflow, then you should consider increasing the frequency of your checkpoints.
You will also need to design a method to detect when a job has been interrupted and picked up from the last saved checkpoint. When a job starts it can read checkpoint data to determine the last completed operation. In some cases this can be a small amount of data, such as the name of the last file processed. Assuming you have a natural ordering to your input files, such as ordering by filename and creation time, you can readily determine the next file to process and start the processing. If you do not have a natural ordering for your inputs then you will need to write more complex data structures to ensure you do not miss or re-run parts of your job.
Dan Sullivan is an author, systems architect, and consultant with over 20 years of IT experience with engagements in systems architecture, enterprise security, advanced analytics and business intelligence. He has worked in a broad range of industries, including financial services, manufacturing, pharmaceuticals, software development, government, retail, gas and oil production, power generation, life sciences, and education. Dan has written 16 books and numerous articles and white papers about topics ranging from data warehousing, Cloud Computing and advanced analytics to security management, collaboration, and text mining. See here for all of Dan's Tom's IT Pro articles.