11 min read 29 September 2021

How we built a multi-regional first-line of support for one of our clients

How we put together a serverless application running AWS Lambda under the baton of AWS Step Functions.

AWS Lambda
AWS Step Functions
AWS Cloud Development Kit
TypeScript

Rafał Król Cloud Site Reliability Engineer

At Chaos Gears, we help our clients, companies of all shapes and sizes, utilize the AWS cloud to its full potential so they can focus on evolving their business while AWS does the heavy lifting for them.

One of these customers, a startup from the medical industry, has gained a global reach and thus, to serve its clients operates in multiple AWS Regions (currently ten with more scheduled to come) spanning many a timezone.

At the center of each region, there’s an EC2 instance that, by design, occasionally maxes out on the CPU. When such happens, a member of the operations team runs a set of checks to determine whether the instance is reachable; if it is not, it gets restarted. Once restarted and back online, which takes a few minutes, the same round of checks recommences.

More often than not, this proves sufficient, and the on-call engineer handling the issue can get back to work, sleep, or whatever else they were doing when the situation arose. Being a startup lacking the resources to man a follow-the-sun operations team, our client came to us requesting a simple, adjustable, and cost-effective solution that would relieve their engineers from this operational burden.

This post looks at such a solution, a multi-regional first-line of support.

Infrastructure as code

In today’s world of agile software development, we treat everything as code, or at least we should be doing that. Hence, the first decision we made was to bet on Cloud Development Kit (CDK), a multi-language software development framework for modelling cloud infrastructure as reusable components, as our infrastructure as code (IaC) tool.

Our client’s software engineers were already familiar with TypeScript (the language we chose to build out the infra with), which meant they’d comprehend the final solution quickly. Moreover, we avoided the steep learning curve of mastering a domain-specific language (DSL) and the additional burden of handling an unfamiliar codebase.

The recent introduction of CDK integration with AWS SAM, which is now in public preview, allows developing serverless applications seamlessly within an AWS CDK project. On top of all of that, we could reuse the existing software tooling like linters and apply the industry’s coding best practices.

Serverless

The adage says that “No server is better than no server,” and with that in mind, we turned our heads towards AWS Step Functions, a serverless orchestrator for AWS Lambda functions and other AWS services.

The challenge at hand was perfect for an event-driven architecture and we had already envisioned the subsequent steps of the verification process (URL health check, Route 53 health check, SSH check, restart, etc.) as distinct Lambda functions passing the event object between themselves.

We needed the glue, and with AWS Step Functions, we effortlessly combined all those pieces without worrying about server provisioning, maintenance, retries, and error handling.

Managed services

We had the backbone figured out, but we still had to decide how to monitor the CPU usage on the EC2 instances and pass the knowledge of a breach to AWS Step Functions State Machine.

It screamed of Amazon CloudWatch alarms for the metric monitoring bit and EventBridge (formerly CloudWatch Events) for creating a rule for routing the alarm event to the target (a state machine in our case).

Business logic

When the CPUUtilization metric for a given instance reaches 100%, a CloudWatch alarm enters the alarm state. This state change gets picked up by an Amazon EventBridge rule that triggers the AWS Step Functions state machine. Upon receiving the event object from the Amazon EventBridge rule, the state machine orchestrates the following workflow:

Three checks run, one after another (a URL check, a Route 53 check, and an SSH check).
If all checks succeed during the first run, the execution ends silently (the `All good` step followed by the `End` field).
When a check fails, the EC2 instance is restarted, and we recommence from the beginning with a second run.
If all checks succeed during the second run, a Slack notification is sent, and the execution ends (the `Slack` step followed by the `End` field).
When a check fails during the second run, the OpsGenie alert is created, and the execution ends (the `OpsGenie` step followed by the `End` field).

Here’s the diagram depicting the complete solution:

All the resources, plus the Lambda functions, an S3 bucket for the Lambda code packages, and all the necessary IAM roles and policies are created and managed by AWS CDK and AWS SAM.

Furthermore, this solution can be deployed effortlessly to multiple regions using AWS CDK’s environments.

Dissecting the code

I clone the repo and enter it:

git clone https://github.com/rafalkrol-xyz/cpu-check-cdk.git && cd cpu-check-cdk

In the project’s root directory, I see the tsconfig.json file responsible for configuring the TypeScript’s compiler and the .eslintrc.json file holding the configuration for ES Lint, a popular JavaScript linter. These two configuration files serve the entire project since we use TypeScript for both the infrastructure and application layers. AWS CDK’s support for many popular general-purpose languages (TypeScript, JavaScript, Python, Java, C#, and (in developer preview) Go) enables and encourages the DevOps culture by making the end-to-end development experience more uniform as you can use familiar tools and frameworks across your whole stack.

Now, let’s take apart the bin/cpu-check-cdk.ts file, the point of entry to our CDK app, whence all stacks are instantiated:

We import all the necessary dependencies — in CDK v2, which is now in the developer preview, all the CDK libraries are consolidated in one package.

#!/usr/bin/env nodeimport 'source-map-support/register'import * as cdk from '@aws-cdk/core'import * as iam from '@aws-cdk/aws-iam'

We check whether all the necessary environment variables have been set:

import { SLACK_TOKEN, SLACK_CHANNEL_ID, TEAM_ID, API_KEY, R53_HEALTH_CHECK_ID } from '../config'if (!SLACK_TOKEN) {  throw new Error('SLACK_TOKEN must be set!')}if (!SLACK_CHANNEL_ID) {  throw new Error('SLACK_CHANNEL_ID must be set!')}if (!TEAM_ID) {  throw new Error('TEAM_ID must be set!')}if (!API_KEY) {  throw new Error('API_KEY must be set!')}if (!R53_HEALTH_CHECK_ID) {  throw new Error('R53_HEALTH_CHECK_ID must be set!')}

We initialize the CDK app construct:

const app = new cdk.App()

We grab the regions to which to deploy along with corresponding instance IDs to monitor from CDK’s context:

const regionInstanceMap: Map[string, string] = app.node.tryGetContext('regionInstanceMap')

We create a tags object with the app’s version and repo’s URL taken directly from the package.json file:

import { repository, version } from '../package.json'const tags = {  version,  repositoryUrl: repository.url,}

Finally, we loop through the map of regions and corresponding instance IDs we created in step IV. In each region, we produce eight stacks: one for every Lambda function, one for the state machine, and one for the metric, alarm, and rule combo.

import { StateMachineStack } from '../lib/state-machine-stack'import { LambdaStack } from '../lib/lambda-stack'import { MetricAlarmRuleStack } from '../lib/metric-alarm-rule-stack'for (const [region, instanceId] of Object.entries(regionInstanceMap)) {  const env = {    region,    account: process.env.CDK_DEFAULT_ACCOUNT,  }  const lambdaStackUrlHealthCheck = new LambdaStack(app, `LambdaStackUrlHealthCheck-${region}`, {    tags,    env,    name: 'url-health-check',    policyStatementProps: {      effect: iam.Effect.ALLOW,      resources: ['*'],      actions: ['ec2:DescribeInstances'],    },  })  const lambdaStackR53Check = new LambdaStack(app, `LambdaStackR53Check-${region}`, {    tags,    env,    name: 'r53-check',    policyStatementProps: {      effect: iam.Effect.ALLOW,      resources: [`arn:aws:route53:::healthcheck/${R53_HEALTH_CHECK_ID}`],      actions: ['route53:GetHealthCheckStatus'],    },    environment: {      R53_HEALTH_CHECK_ID,    },  })  const lambdaStackSshCheck = new LambdaStack(app, `LambdaStackSshCheck-${region}`, {    tags,    env,    name: 'ssh-check',    policyStatementProps: {      effect: iam.Effect.ALLOW,      resources: ['*'],      actions: ['ec2:DescribeInstances'],    },  })  const lambdaStackRestartServer = new LambdaStack(app, `LambdaStackRestartServer-${region}`, {    tags,    env,    name: 'restart-server',    policyStatementProps: {      effect: iam.Effect.ALLOW,      resources: [`arn:aws:ec2:${region}:${process.env.CDK_DEFAULT_ACCOUNT}:instance/${instanceId}`],      actions: ['ec2:RebootInstances'],    },  })  const lambdaStackSlackNotification = new LambdaStack(app, `LambdaStackSlackNotification-${region}`, {    tags,    env,    name: 'slack-notification',    environment: {      SLACK_TOKEN,      SLACK_CHANNEL_ID,    },  })  const lambdaStackOpsGenieNotification = new LambdaStack(app, `LambdaStackOpsGenieNotification-${region}`, {    tags,    env,    name: 'opsgenie-notification',    environment: {      TEAM_ID,      API_KEY,      EU: 'true',    },  })  const stateMachineStack = new StateMachineStack(app, `StateMachineStack-${region}`, {    tags,    env,    urlHealthCheck: lambdaStackUrlHealthCheck.lambdaFunction,    r53Check: lambdaStackR53Check.lambdaFunction,    sshCheck: lambdaStackSshCheck.lambdaFunction,    restartServer: lambdaStackRestartServer.lambdaFunction,    slackNotification: lambdaStackSlackNotification.lambdaFunction,    opsGenieNotification: lambdaStackOpsGenieNotification.lambdaFunction,  })  new MetricAlarmRuleStack(app, `MetricAlarmRuleStack-${region}`, {    tags,    env,    instanceId,    stateMachine: stateMachineStack.stateMachine,  })}

Thanks to using a basic programming concept of the for loop, we could save ourselves from unnecessary duplication by keeping things DRY. Nice and easy, and all in one go, regardless of the number of regions to which we would want to deploy, and mind you, there are 25 available (with six more to come).

I won’t be going through all the CDK and Lambda files (though I strongly encourage you to give the code a thorough review). Notwithstanding, let us see how easy it is to define a stack class in CDK looking at the lib/lambda-stack.ts file:

We import the dependencies:

import * as cdk from '@aws-cdk/core'import * as iam from '@aws-cdk/aws-iam'import * as lambda from '@aws-cdk/aws-lambda'import { capitalizeAndRemoveDashes } from './helpers'

You’ll have noticed that there’s also a helper function called capitalizeAndRemoveDashes amongst the CDK libs. Since CDK uses a GPL, we can introduce any amount of custom logic, as we could do with a regular application.

The lib/helpers.ts file looks as follows:

/** * Take a kebab-case string and turn it into a PascalCase string, e.g.: my-cool-function -> MyCoolFunction * * @param kebab * @returns string */export function capitalizeAndRemoveDashes(kebab: string): string {  const kebabSplit = kebab.split('-')  for (const i in kebabSplit) {    kebabSplit[i] = kebabSplit[i].charAt(0).toUpperCase() + kebabSplit[i].slice(1)  }  return kebabSplit.join('')}

We extend the default stack properties (like description) with our ones, setting some as mandatory and some as optional:

interface LambdaStackProps extends cdk.StackProps {  name: string,  runtime?: lambda.Runtime,  handler?: string,  timeout?: cdk.Duration,  pathToFunction?: string,  policyStatementProps?: iam.PolicyStatementProps,  environment?: {    [key: string]: string  },}

We start a declaration of the LambdaStack class with a lambdaFunction** read-only property and a constructor.:

export class LambdaStack extends cdk.Stack {  readonly lambdaFunction: lambda.Function  constructor(scope: cdk.Construct, id: string, props: LambdaStackProps) {    super(scope, id, props)

We create a resource name out of the mandatory name property that will be passed in during the class initialization:

const resourceName = capitalizeAndRemoveDashes(props.name)

We create an IAM role that the Lambda service can assume. We add the service-role/AWSLambdaBasicExecutionRole AWS-managed policy to it, and, if provided, a custom user-managed policy:

const role = new iam.Role(this, `Role${resourceName}`, { assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com') })role.addManagedPolicy(iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSLambdaBasicExecutionRole'))if (props.policyStatementProps) {  role.addToPolicy(new iam.PolicyStatement(props.policyStatementProps))}

We initialize a construct of the Lambda function using the role defined in step V and stack properties, or arbitrary defaults if stack properties were not provided:

const lambdaFunction = new lambda.Function(this, `LambdaFunction${resourceName}`, {  role,  runtime: props.runtime || lambda.Runtime.NODEJS_12_X,  handler: props.handler || 'app.handler',  timeout: props.timeout || cdk.Duration.seconds(10),  code: lambda.Code.fromAsset(`${props.pathToFunction || 'src'}/${props.name}`),  environment: props.environment,})

Finally, we expose the Lambda function object as the class’s read-only property we defined in step III. (We’re also sure to close our brackets to avoid the implosion of the universe.):

    this.lambdaFunction = lambdaFunctions  }}

Conclusion

In this blog post, I showed how we put together a serverless application running AWS Lambda under the baton of AWS Step Functions to relieve our client’s engineers from some of their operational burdens so that they could focus more on evolving their business. The described approach could be adapted to serve other needs or cover different cases as AWS Step Functions’ visual workflows allow for a superquick translation of business requirements to technical ones.

By using AWS CDK as the IaC tool, we were able to write all of the code in TypeScript, which puts us in an excellent position for future improvements. We avoided the trap of introducing unnecessary complexity and kept things concise with codebase approachable and comprehensive to all team members.

Lastly, please be sure to check out the GitHub repository.