Picture the scene: you’re in charge of a development team at a start-up, and you’re using AWS to provide your infrastructure so that you can expand your capacity as your company grows. Everything is going well until the CFO wanders over and asks you what’s going on with the AWS bill. You assume the CFO is overreacting, so just to keep the them happy you check over the bill, and discover it’s gone pretty quickly from “reasonable” to “tragic”. How did that happen? And how can you fix it, before people start looking at you as a handy cost saving?
Firstly, it’s worth thinking about how this whole situation came about, just so that it doesn’t happen again. It’s a sensible enough idea to give developers the authority to create cloud computing resources as they need, but they need to be careful that they don’t get carried away, creating resources that never get shut down, and so keep costing you money. At least you didn’t believe any developers that said things like “Memory and CPU are cheap!” did you? I mentioned this in System Performance – Part 1 – Outrun The Bear – saying this sort of thing is a red flag you need to be aware of. If anyone told you this, and you believed them, think of it as a “learning experience”, never believe anything they say ever again, and don’t tell anyone else.
So what are you going to do about the bill? Here are some steps you may find useful:
AWS gives you the ability to add “tags” to almost any kind of resource – servers, lambdas, databases, load balancers. Tag everything, so that you can see what every resource, and every type of resource, is costing you.
You’ll need some sort of schema, or pattern, so that you can work out the costs by resource type, department, environment, and so on. This will make it more obvious what exactly is costing you money. One simple schema that you might want to base your tagging on is below. Assume your company is called MyCompany, and use that as a prefix:
|MyCompany:Department||The department of the company that is responsible for the resource. For example it could be “Development”, “Sales”, “Accounting”, etc.|
|MyCompany:Environment||The development environment where the resource belongs. This might be something like “Development”, “Test”, “Staging” or “Production”.|
|MyCompany:Name||A readable name for the resource.|
|MyCompany:Owner||The name or email of the person responsible for the resource. For example a development machine will probably have a developer’s details in this tag. An accounting machine might have the CFO’s details.|
|MyCompany:System||The system to which the resource belongs. For example “Build”, “CRM”, “Accounting”, etc.|
|MyCompany:Purpose||The specific purpose for the resource. “DB”, “Video Server”, “Application Server”.|
Once you’ve decided on the tags you’re going to assign to each resource, you need to start assigning them. If you’re already using something like Terraform to create and update your system, start adding the tags to that. Otherwise, if you’re still manually creating and updating resources, you need to start adding tags manually as well. Don’t be tempted to put this off, and if it needs to be added to your product backlog it goes at the top, because it’s something that can actually stop your company hemorrhaging money.
It’s worth mentioning at this point that it’s possible to create lambda functions to automatically add tags to newly-manually-created resources. If your resources are manually created, add a task to implement these near the top of your backlog, otherwise you’ll be manually adding tags until you do.
While you’re adding tags to manually-created EC2 instances, you should make use of the recommendations that AWS will give you. Each EC2 instance has a feature called the “AWS Compute Optimizer”; you just tick the box and AWS will start to analyze that instance. It takes around two weeks, but then you’ll start to get recommendations based on the CPU and network activity for each instance.
If you need 10 servers to run your system at maximum demand, there’s no point in paying to run that many when there’s nothing much going on in your system. Using Elastic Beanstalk allows AWS to start and stop extra machine instances when the CPU usage on the current machines exceeds a given level. This means you can optimize your computing resources and minimize your costs.
If you organize your development and tests systems to work in the same way, you can minimize the costs of them too. After all, it’s highly unlikely your development or systems are running 24 hours a day and 7 days a week, so make sure you don’t pay for them to do that.
AWS lets you create billing alerts which email you when your bill goes up by more than a given amount in a day. Create one, set it for a few percent increase (you can adjust it) and use it! If your development team is constantly creating more instances or increasing the resources they’re using, you need to know about it.
AWS database servers don’t give you the same “AWS Compute Optimizer” options that EC2 instances do, so you have to do it manually.
To start, all of your databases should be accessed using aliases: something like mysalesdatabase.mycompany.com. Why? Because every time you recreate a database server, AWS gives it a new resource ID. With an alias, you just update the target resource ID in the alias DNS entry. If you’re not using an alias, you need to change the configuration for every system that accesses that database. That’s a level of risk that it doesn’t make sense to tolerate. If you’re not using aliases, create them and then get the software updated to access them.
You’ll also want to enable “Enhanced Monitoring” on your database instances; just use the default parameters for the moment. This lets you see much more information about the operation of the database. Also, if someone has changed the “Provisioned IOPS” settings to greater than the default, check to see whether you actually need more than the minimum. If you’re not sure, set an alert to trigger when the IOPS level exceeds 1000. If it never fires, you can reduce the limit.
Once you’ve made these changes, you can start checking up on the database servers in your system. As a rough guide, if your database CPU rarely exceeds 50%, and there are no query queues even during busy periods, you should be able to move to the next RDS instance size down. Also development and test servers should be quite a lot smaller than production servers, and production servers often don’t need to be as big as they are. Even if servers DO need to be big, things don’t always stay that way. A software update or refactoring a dreadfully inefficient piece of business software might dramatically affect the demand on the database server. I’ve seen DB servers drop from 80% CPU usage to 15% when a particularly poor piece of software was rewritten and reconfigured. If that happens to you, you should try reprovisioning the server so it’s one, or even two sizes smaller. Take it one step at a time, and you can halve your server bill with each step.
While you’re changing the database server settings, make sure your allocated storage is a reasonable size. You might be using hundreds of Gb, but if someone has reserved Terabytes of space, you’re paying too much.
- Several different API types, including low cost HTTP APIs
- Security provided by the API Gateway’s use of IAM, JWT and AWS Cognito User Pools
- Auto-scaling to handle any level of demand, up to a limit you specify
API Gateways save you the trouble, and cost, of maintaining separate EC2 instances and related infrastructure. They also increase and decrease the available resources to exactly fit demand. If you have a service that can be contained in a Lambda, the API Gateway lets you make it externally accessible with all of the security and authorization that you need.
Paying for on-demand AWS resources is great when you’re starting out, but it gets expensive fast. Certainly by the time you have a production system, you should be buying capacity in advance. Remember that EC2 instances bought in advance can usually be upgraded to other types, although you should check with AWS before you make too many changes. Prepaid RDS servers, on the other hand, often cannot be changed, so if you need to change the RDS instance type part way through the prepay period, you may want to use the original for testing or development.
Several years ago Netflix created a utility called JanitorMonkey. Its job is to search through your AWS system, and to find resources that are not being used. Once it has found them, it will let you know about them, and can be set to shut them down. Cloud Custodian also does a similar job.
Both of these utilities, and many like them, can automate much of your cost saving work. I’ve worked for companies before who have set up automatic weekly emails listing servers and other resources that are not being used, who owns them, and how much they cost. There’s nothing like a weekly reminder, with figures attached, to persuade people to switch off unused systems.
If you manually allocate resources, then every change you make involves manually updating every resource that depends on that changed resource.
Instead, if you define the system in Terraform, or maybe in Docker and CloudFormation, you can define dependencies, tags, accounts and so on. Any changes you make to one resource affect only that, and any parts you don’t change – tags etc. - stay the same and don’t need to be updated. This saves a lot of development time, as well as simplifying deployments where only the differences you have made get updated.
It’s easy to do a lot of work to reduce your AWS bill, get a reasonable result, and then move on to the next thing. The problem with this is that your team are probably constantly making changes to the environment, and your changes will gradually be overwritten. The best way to stop this happening is to set aside some time every week, ideally at the same time, to run through all of the above steps again. You’ll get into the habit of re-examining the systems, and you’ll get to know it better. Ideally you should involve members of the development team, so that they can see the effect their changes are having, and to emphasize the need to reduce wasted resources and costs.
These steps should be enough to solve your immediate cost problems, and to give you a steady footing to build your future AWS strategy on. My personal record is a 30% reduction in an overall bill over the course of a few months, and that was done by occasional updates, not weekly !
If you have any questions about any of these methods of reducing AWS costs, or if you have any other questions – especially if you’re in Canada – feel free to contact me.