Mar 6, 2024 8:48:13 AM

Using Automation to Support a Legacy App in the Cloud

As a cloud engineer one of the most interesting things to hear is that an app was lifted and shifted from on-prem to the Cloud. If done properly it can be extremely beneficial, but even if done right there can be some areas that still need constant attention. Without a solid set of tools to help automate the small things, it can feel like you’re running around without a head trying to keep things monitored and free of issues. In the following paragraphs I will cover how we use some of our custom tools to help support a legacy application in the cloud.

The biggest service in use when supporting a legacy app will most likely be EC2. As with all AWS resources EC2 requires monitoring to ensure app availability and consistent uptime. Monitoring EC2 instances can be relatively easy with AWS CloudWatch (CW). CW pumps in instance metrics that can be used in a number of ways. You could create alerts based off the metrics, you could run custom queries against the metrics to display graphs of important data, you could even anticipate an issue before it happens using the right combo of metrics. However, sometimes you encounter issues that
CW can’t solve. Let’s for a moment consider what happens if an EC2 system status check fails? When you launch an EC2 instance it must pass two health checks, an instance reachability check, and a System check. It is imperative that these tests are both passing for an instance to be in good standing. You may get an alert, but then you’d need to reboot the instance manually and when seconds count, that may not always be the best solution. One of the tools we leverage to accomplish this is we have a custom lambda function that runs on a Cron schedule managed by Event Bridge. So, every 5 minutes we invoke a lambda and check our prod application servers for status checks. If they pass, no action is taken, however if they fail, then the instance is rebooted automatically using a boto3 API call.

One of the other tools we use, with probably the biggest impact would be our custom SSM execution tool. If you have had an experience with SSM you will know it is a very impressive and powerful tool. If not, SSM allows interaction with your instances whether it be remote sessions or command execution. It uses an internal executable agent that facilitates these actions. While this is a great service, one of the pitfalls is what if you have more than 10 servers? For one of our clients, we have ~112 servers that need to be maintained, so manually copying instance IDs to a list, or to the SSM console execution page would be a tiresome effort. With our tool, you can simply use the filter parameters to invoke an SSM job on a group of similarly tagged instances, run a custom job of any kind using the Single Execution mode, or even run your own custom SSM jobs with any number of parameters using a custom execution template schema we created. This trims down execution on any number of servers to seconds from however many minutes it would take to arrange manually drastically speeding up maintenance time.

Whenever an on-prem app is converted and deployed to cloud there needs to be several considerations that must be made. The most important of which is ensuring that changes that are made are not lost when things go haywire. For example, if the app you are supporting uses an AutoScaling group, the instances and their storage volumes essentially become ephemeral. If the instance is replaced for any reason, then the changes are blown into the void. While the effects of manual instance changes are minimal, they can have consequences of varying effects. If a customer updates their app, then something causes the instance to go down and a new one is stood up, the new updates are lost. So an effective way around this is to utilize a tool from EC2 called EC2 imagebuilder. If you are familiar with Packer, this does the same thing except it is a native AWS product. We use the pipeline to create custom build/update SSM docs that are run during the build phase. We set up an automation to help our customers get used to the process. When the customer is ready to update their environment, they simply drop the new files/objects into an S3 bucket. We have an S3 action set up that triggers a lambda to run the imagebuilder pipeline for the respective server type requested. (Web server files are dumped in a Web directory and so on...). This removes the customer from the upgrade operation and
ensures that the current instances stay in service until the time the update is ready to deploy.

Automation and tooling makes for a nice experience supporting apps/services, but you can also leverage them during the build process to enhance your own experience as an engineer. In my experience, customers coming from on-prem would have a set range of IPs that they are allowed to use. This does not always correlate with the available IPs in the subnet. It was frustrating creating infrastructure and having to go back and redo it because the IP was out of range. So, we created a tool that takes in the target subnet to deploy in, and the desired range provided by the customer. The tool then spits out a list of IPs that are within the range provided, but also are available in the target subnet effectively taking the effort away and allowing it to be directed elsewhere.

Automation can even be used for account cleanliness too. For example: sometimes when network interfaces are built, they don’t have the flag selected to delete on termination. This causes the number of ENIs in the account to grow, but it also reduces the number of available IP addresses in the subnet. If this goes on long enough you end up with hundreds, even thousands of unused ENIs and claimed private IP addresses. Our fix for this was a tool that iterates through a list of all ENIs in a set region and deletes them if they are not in use. This does not mean deleted in the sense they cannot be used again, it just means that they can be allocated to a different resource (new EC2, LB, etc...). This was a fun problem to solve, one of the mentionable findings is that if you invoke this as a lambda, depending on how many ENIs you have, it may time out or outright refuse to run. (a local run with your aws creds works around this)

To further the point in the previous paragraph, what about EBS Snapshots? They can pile up if there is no retention lifecycle defined (sometimes this is not on the forefront of a customer’s mind, or they get back to this after deploy). The fix for this is easy, add a lifecycle policy, but what about the snapshots that are already there? This effort required a tool that would allow us to ensure that we were deleting the proper snapshots and not anything with current data. The tool we created takes a running list of one thousand snapshots (this is an API limit with boto3, but deleting in batches is a good idea so we can monitor what is being deleted) and compares the creation date to the current date. If the snapshot is older than a customer defined length of time, they will be deleted.

Overall, building, supporting, and maintaining a legacy application in the cloud can be quite the benefit, but you may experience some pitfalls especially when it comes to the small things. It is helpful to free up time in relation to tasks that can otherwise take focus off the more important things. Using Automation and Custom tooling can make all the difference between an app that is good, and an app that is great.