Cost Optimized Cloud Development Environments at Siemens

Setting the scene

Development environments are there for teams to experiment, learn, implement, test, and validate code before it is promoted on its way from development to production. Along the way the need for capacity, performance and availability increases which increases the cost of production.

Kathleen DeValk, Chief Architect and Head of Global Architecture at Siemens shared with us some of the techniques the team uses to lower their cloud costs without sacrificing the freedom to innovate.

Tag Early and Often

In order to understand your costs, all resources need to be tagged. This allows you to easily track costs and associate them to the software or development team that incurred the costs. Tagging is a small effort up front versus a high effort to "fix it" later. Often times, teams will invariably go back and add the tags, once the costs increase and finance is asking why your AWS bill is so high? Why not just do it in the first place? Even a preliminary tagging concept is better than no tags. There is a simple script that can be used to detect instances that have no tags, and custom scripts can be created to verify specific tags.

EC2 Instances Pricing and Right-Sizing

At the heart of AWS compute is Amazon Elastic Compute Cloud (Amazon EC2). Many services will run in a container on an EC2 instance. However, not all instances are created equally. You pay for these resources by the hour or by the second (see EC2 pricing page); they can be tracked in the AWS billing console.

It is important to understand that compute in the cloud can be purchased in different ways with different cost considerations.

On-Demand Instances

On-demand resources are the most common, and when your development team is just getting started, this is often the first thing developers will try. An On-Demand resource is available on request and remains "yours" until you shut it down or terminate it. They are the most flexible and a good choice for running something for a limited time with guaranteed availability. However, it's important to consider if your development instance has an up-time requirement? What would happen if the environment were unavailable for a short while? Would all work come to a halt? This is not usually the case, although it is possible that there are sometimes mission critical development activities, typically the development process can tolerate a failure.

Your development environment may be an isolated deployment, then the impact if a node was unavailable would be minimal. In some cases where proper CICD is used, there may be a window of time in which the software is under test and that time window maybe small but may reoccur multiple times per day. In this case, a cost optimization to consider is "spot instances".

Spot Instances

Spot Instances are transient resources that can be purchased at a significantly reduced price (at up to 90% discount compared to On-Demand prices), but there is a caveat. If your instance is outbid on the marketplace, you will receive 2-minute notification before you lose your instance. When considering "I might lose my instance" many people worry that they would have significant problems if an instance was ever to loose. However, if you are designing stateless services and applications that are intended to run in a distributed, high availability, load balanced configuration you are designing services that can tolerate such incidents. EC2 instances should always be considered as ephemeral. They could fail, reboot or get lost even when using on-demand or reserved instances. Something can and will go wrong, which is why you design for resilience, redundancy and failover. Designing for these failures is essential in the cloud and your spot instance usage can benefit from this best practice design concept. Learn best practices for handling EC2 Spot Instance interruptions from this blog.

Reserved Instances (RIs)

Reserved instances (RIs) are purchased as a commitment to run that instance 24x7 for a period of time (1 or 3-year term) with the highest savings available on a 3-year term. However, if you are not running the instance, you still pay the hourly price for that instance. Meaning you pay whether it's running or idle. Reserved Instances provide you with a discount (up to 72%) compared to On-Demand instance pricing.

RIs are shared within an account. What if I have a reserved m4.8xlarge instance, but I only need to run it 8 hours per day and 5 days a week? It may still make sense for me to purchase such an instance if my team is 24x7 because when I am not using the reserved instance, someone else can. If I have 3 m4.8xlarge and each runs 8 hours with no overlap, I could reserve one instance and share it because the m4.8xlarge that is running is billed at the RI price.

The break-even point for an RI is calculated by comparing the on-demand price of total required hours versus the RI price of 24x7. RIs are not always the right choice. It depends on what the instance type is, how long you plan to run it, and whether the RI can be shared.

Here is an example calculation to assist in determining when to use an RI. Let's take US East (N. Virginia) as example for t3.xlarge standard 1 year with no upfront:

  • No Upfront total cost: $0.104 per hour * 24 hours * 365 days = $911.04
  • On-Demand cost: $0.1664 per hour
  • Number of hours for On-Demand to reach No Upfront total cost: $911.04 / $0.1664 = 5,475 hours = 228.125 days = 62.5% of one year

Using RIs in development is not always the most cost-effective option. Development is not production because instances do not have to handle production capacity. A smaller instance is often acceptable for the majority of development and test unless you are executing scalability, boundary, load or performance tests. However, even these do not have to be run continuously in development.

Development involves experimentation and testing; some testing may include trying different instance types in order to find the optimal resource. This is known on AWS as "right-sizing". This is the process of identifying the most cost-optimized compute resource needed to execute your tasks. If your code is very well written, highly optimized, memory and CPU efficient you may be able to use very inexpensive resources. By having a good understanding of your compute resource needs and creating a scalability calculation based on load characteristics you can gain a better understanding of the required resources at various scale points and under various conditions. You may be able to scale-back your development resources to lower costs or even run tasks with limited High Availability (HA) configurations except when running HA tests. AWS Cost Explorer Rightsizing Recommendations in your Billing and Cost Management console will help identify rightsizing opportunities for your idle and underutilized instances.

Keep your environment configuration separate from your infrastructure provisioning code. Automate provisioning of your compute and deployment of your code. This allows you to quickly and easily change the configuration, re-provision and redeploy your services to test different scenarios without locking yourself into an expensive compute contract. For example, you may be able to spend pennies per hour on T2 instances in development. T2 instances are a family of very low-cost AWS EC2 compute resources.

You can use a hybrid model of these approaches. Consider reserving a small pool of lower compute resources where you need an always-on environment and use spot resources for specialized test scenarios involving larger compute instances or auto-scaling. Using RIs or on-demand instances for HA configuration and spot instances for the elastic auto-scaling may also be an option if your code is well designed to failover or your infrastructure is self-healing.

Turning out the lights when you leave the room

Make sure that you shut down resources when they are not in use. It seems obvious, but it can appear to be a daunting task. However, with proper automation it becomes quite easy and reproducible. The fear of "turning it off" comes from the way we are used to interacting with physical machines and servers. My laptop took weeks to setup my development environment; if I had to wipe it and start over it would be a nightmare. That is why reimagining, upgrading, or replacing my laptop is a daunting task.

The cloud is different. There is no computer. There is no server. There is a configuration and script that creates servers and puts all my favorite tools and code there. It may go away at any time and I don't care because with a click of a button I can make it come back with minimal effort. This is the key to happiness in the cloud.

With the cloud, I can easily "turn on" the development environment. Consider an IDE such as eclipse. With it you can write your code, but in order to test your code you build and start a runtime. With the runtime you can execute a unit test, launch an app server and perform interactive debugging. The cloud development "environment" is the same concept. If you have a service and you want to test it, you build -> deploy -> test. The issue is that in the deploy step you expect that you will deploy to something. What if the something you are deploying to doesn't exist or is simply not running. Automate the process and include the ability to provision your resources on deploy so you can turn them on only when you need them.

I often hear arguments such as: "We are a global team, operating 24x5 therefore someone always needs my environment running." What about the other 2 days a week? These two days account for about 28% of total time over a year. Imagine your infrastructure cost for development is 1 million/year and you can save 28%, you could potentially realize a savings of 280k / year.

An excellent way to determine the window of time where your development environment can be de-provisioned or hibernate is to review the usage data. AWS provides tools that tell you how well utilized your resources are. For example, AWS Trusted Advisor will tell you if a resource is underutilized. Amazon CloudWatch is another tool that allows you to review the utilization for specific time periods.

Look at your most expensive compute resource, go to cloud watch and observe the metrics for last weekend. Was it in use? Now look at your hourly cost for that resource multiplied by 48 weekend hours. Consider that money you are spending. If that were coming out of your vacation budget or your Christmas fund, would you really want to waste it? Every situation is different, and maybe there is some need that justifies that cost, only you can decide. The goal here is to raise awareness of where your infrastructure costs are going and to provide some tips on how you can cut costs with a little creative thinking.

Servers are not Pets

Repeat the mantra "treat servers like amoeba and not pets". If your pet dies, this is very sad. I can attest to this after losing my 19-year-old feline. If your server dies it should not matter. It's not your pet, it's not even a real server, it just a bunch of bits in memory. If I could clone my cat and know she was exactly the same, I might do just that. Instead I adopted a new kitten and she is nothing like the old one. In the cloud, if I configure immutable servers through automation, I recreate the exact same server with no differences, repeatedly, at any time with no risk.

There might be specialized complex configurations or dependencies on 3rd party software that are simply not designed to work this way as because they were not designed for the cloud. If you have a dependency or specialized system, even though it is possible to fully automate the process of provisioning, bootstrap software deployment, configure the server and deploy your code, the process may take significant time and this time may not be efficiently spent at every deploy & test cycle in your CICD process. In this case there is an alternate option. You can hibernate your systems or create snapshots or golden images. These can be used to simplify the "turn on" step so that it can be executed quickly. This golden image is also a recommended time savings options when using auto-scaling or spot instances. Consider that it has a trade off in that a new image is needed any time the infrastructure is updated. This includes OS patching and other operational maintenance.

If there is an image of the "working node", when an instance is lost, a new one can be provisioned quickly along with the complete software bundle and configuration, then autoscaling and self-healing can speed up. Many teams working on auto-scaling rules need to consider the threshold for scale-out to include the time it takes to provision, bootstrap, deploy and configure the node to add to the compute cluster. If this process takes 10 minutes, then I need to ensure that I start early so I don't exceed my node capacity limits waiting for the scale-out to complete. If I can cut this time to 1 minute, then I can respond faster and ensure that I scale-out when I really need to.

High availability in development

High availability (HA) requires redundancy of resources. If I have to achieve 4 9s of availability (99.99) I must have my service deployed with multiple instances spanning multiple availability zones (AZ s) so I can be isolated from single points of failure of the node, instance, physical server, rack, and data center. This can quickly cause your costs to double or triple, depending on your HA model. Say I have an HA ECS cluster spanning 3 AZs, with 1 instance per AZ and a 2-node (2N) load requirement, meaning I want at least 2N in a HA fail-over mode so that I do not have reduced capacity. In this case the minimum configuration is 3 instances across 3 AZs with 1/AZ. That means 3x the cost in development if I run the same HA configuration. You may have to test the fail-over scenarios, but these tests can be separated from the everyday system resource utilization and you can reduce cost for functional testing and debugging, and only incur the additional costs when you are executing the performance, scalability, failover and HA tests. The idea is to lower the cost of what is running all the time and incur the extra costs only when the specific HA test scenario is executed. Prior to the test, use the magic button to provision and deploy an HA configuration, run test, track results & learn then redeploy the low-cost configuration.

You might now realize a reoccurring theme, the "magic deploy button" that is fully automated is the key to everything. Without this, every recommendation becomes a painful manual process which is impossible to replicate. With automation, it is a click of a button that does everything for you with perfect reproducible results.

Reproducing production events

I once observed a team experience a scalability issue in production. While performing a root cause analysis, a few hypotheses were established which needed further testing to verify. As a result, the team decided to execute a test on their scalability configuration in development. After 1 week an alert was sent from the cost monitoring tool to indicate a sudden spike in development costs for that team's AWS account.

The team's solution was to increase system capacity and execute a test to reproduce the production issue. This resulted in many more, larger nodes running in development than is normally expected. This is OK if you are shutting off the resources when they are not in use and only provisioning such an environment for test duration of the test. It can become very expensive to replicate a production-like environment in development. Establish a test scenario, provision for the test, execute, collect results and return to the low-cost deployment.

Summary

A little thought and effort in the beginning will provide significant long-term benefits. You just need to take the first step to start realizing these benefits.

There are many ways to reduce the cost of cloud development environments. They include:

  • Automated processes that will Provision, Deploy, Test and Destroy
  • Turn off resources when you don't need them, especially nights and weekends
  • Maximize resource utilization and don't over-provision
  • Separate high performance, scalability and HA tests from every-day tasks
  • Design for a loss of an instance or resource to improve system resilience

Applying these principles can reduce your costs and provide additional benefits:

  • Enable experimentation so your architecture can evolve as requirements change
  • Provide repeatable results every time you provision and deploy to an environment
  • Prevent divergence in environments by promoting immutable infrastructures
  • Reduce risk promoting code to production – automation ensures it is always the same

Learn other cost optimization best practices here.