For one of our customers, a digital healthcare company and provider of a solution for patient engagement, we’ve developed an AWS RDS integrity testing tool based on a serverless approach. We have used a combination of AWS Step Functions, AWS Lambda, AWS SSM, and AWS KMS. In short, this is not solely a story about technological concept but also about a way of dealing with such problems in a startup that finds itself under the pressure of other priorities.
It’s all about the data you store
We all live in a world where data has become a new currency. Any potential loss or database downtime means loss of income. Which brings us to the key question: how will you, as an organization, ensure that, once the failure springs up, you will have the quality and the completeness of data? Essentially, we all are aware of failures, but how many of us ever tested a restoration from previously created AWS RDS snapshots. Not 100% of us, certainly. Personally, I’ve seen many examples of a “completely believing in the magic power of the cloud” attitude and heard many times people saying: “we do not need to worry about the data, we’ve got a regular system snapshots done by AWS”. Once the failure comes up (developer deletes a tiny part of the data “accidentally”), it’s always too late for guessing. I won’t get into details and dwell on how you feel when you have to act fast, especially when dealing with a problem you have never tested before. Let’s just say that your heart races like a speeding train.
Better safe than sorry
“An old Russian proverb says: “Trust, but verify”. For our internal and external ISO 27001 audits, we needed a clear report of the integrity of our encrypted backups. Chaos Gears implemented a method to make the integrity of our backups easy to demonstrate to internal and external auditors”.
Up to this point, our client had RDS databases launched in 8 different AWS regions. Keeping the data consistent and being able to restore it from a particular backup in case of an outage, always top the list of our priorities. Unfortunately, in this case, there was no habit of regular and internal testing. So, the right time came with an ISO audit.
Keep it simple
The balance between tasks automation and time spent on building automated flows is a question that always sparks long debates. Our team decided to avoid developing something that would introduce additional maintenance overhead and, more importantly, one that would certainly kill our monthly billing, just because we wanted to save some time. That’s why we leveraged AWS Step Functions to close the whole workflow in a single place.
NOTE: For those who have never worked with AWS Step Functions, it is a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business-critical applications.
If costs are fine, then you’re on the right track
When you use Step Functions, you are charged based on the number of state transitions required to execute your application. After the free tier, which includes 4,000 state transitions per month, you pay $0.025 per 1,000 state transitions. Our client would accept an automated approach if a short development period could overlap with low-end costs.
Let’s design the workflow
Step Functions is a great way to build and step through series of AWS services in a matter of minutes. In this case, it also addressed few concerns that our client specified as highly important from his perspective:
- new solution cannot lose its state
- dealing with errors and timeouts must be done in an easy way
(NOTE: Step Functions includes built-in retry conditions that allow us to set a number of times you want a certain function to be retried before it goes to a failed state and error handling)
- not investing much time in building the flow and operating it afterwards
- receiving a human friendly report after each test and storing it securely
- the solution has to be auditable
Below, you’ll find what we’ve developed as an initial version of the flow which, right now, is on a roadmap for future development.
- Extract and save credentials from the event into AWS SSM service. Encryption is done via AWS KMS
- Try to restore the RDS from the latest snapshot
- Check RDS status during the restoration
- Choice step:
- If new RDS is not restored yet, then jump into the WAIT step
- If new RDS is successfully restored, jump into the next step
- WAIT step – wait for n seconds and then go back to point 3
- Get the RDS endpoint
- Fetch the credentials from AWS SSM based on the input parameter name then establish the connection with the new RDS endpoint. Run the following simple SQL queries:
- Get Postgres version
- Retrieve a number of tables
- Retrieve a number of users
- Save the report in S3 bucket
- Parallel step – destroy the previously created RDS and send Slack notification
IMPORTANT: All of the steps mentioned above work as Job with a try-catch approach. In a nutshell, if any of the steps from 1 to 9 fail, then it is immediately directed to the “Destroying RDS Instance/ Slack Notification” step.
Basically, our company is full of enthusiasts, who like to take small steps but get tangible results. So, here’s what our next priorities look like:
- define more complex SQL queries for tests
- get rid of passing by credentials in point 1 and change it into something more reliable
- deal with Step Functions limit, which is 90 days. After that, you can no longer retrieve or view the execution history. There is no further quota for the number of closed executions that Step Functions retains.
Problem-solving is all about seeking relatively easiest solutions, provided they exist. Honestly, I wouldn’t treat leveraging serverless AWS services as a remedy for every problem. However, speaking from my own experience, it definitely allows us to quickly check the idea against the solution. With the example depicted above, I was able to provide an already working version for tests after just 2 days, and I didn’t have to think about non-relevant setup issues.