18 min read 14 May 2019

Meet Charlize, our Slack bot intern

How to get started with SlackOps to improve DevOps flows in a small team working within many AWS environments.

Amazon EC2
Amazon API Gateway
Amazon DynamoDB
AWS Tools and SDKs
Boto3 (AWS SDK for Python)
Slack
Python

Karol Junde CTO & Co-Founder

We launched Chaos Gears as a bunch of nerds eager to help companies adopt cloud solutions as an accelerator for their innovations. In doing so, we automate as much as possible, to save time and stay DRY.

Moving from our prior jobs was not only a technological shift, but — probably more importantly — a shift in culture. We realized that the main enemy in just about everything we do is time itself. Time can be spent on innovation and internal development, and this translates to a bigger chance that you’ll succeed.

As a startup which — amongst others — augments external teams in different kinds of AWS projects, we aim to automate even tiny workflows, tasks, actions — so long as they are repetitive. We continuously improve those automations and expand them.

This approach isn’t a silver bullet, but being able to pinpoint the best candidates for automation goes a long way. And so we did.

Building a serverless Slack bot

We’re humans who are always the bottleneck in processes, no matter how big the team is. Each company working with technology struggles with failures, and we’ve noticed that we had been repeating the same “remedy” tasks whenever something broke.

Instead of having to ask other colleagues about those AWS environments, or requesting others to invoke a Lambda via API, restart an instance or do some other job, our team decided to leverage AWS services and combine them with Slack to build a bot intended to help us easily access and use the automations we put in place.

We’ve called her Charlize — and she was bound to become our new synthetic team member.

Registering a Slack app

Creating a Slack app (which will end up being a bot) begins with 3 easy steps:

Go to https://api.slack.com/apps and click the big green Create new app button. Enter a name for your app and click the next prominent green button.
The Basic Information link on the left hand side of your app’s settings page contains information you’ll need — such as the CLIENT_ID and CLIENT_SECRET needed to authenticate all requests made by your app.

Basic configuration of a Slack application

Then it’s time to set up a Redirect URL for your app. This is the endpoint for Slack to pass a unique temporary code to whenever a user installs your app in their workplace. Your server will then send back this code, along with your CLIENT_ID and CLIENT_SECRET, exchanging that code for an access token.
The Redirect URL must be accessible via https (i.e. TLS). Slack (incorrectly) insists that localhost by itself is not secure. For local development/testing, this means you will need either:

a public proxy (something as simple as redirectmeto.com can do the trick),
a backend supporting TLS connections,
or a public tunnel which supports TLS connections (software like Ngrok).

We used API Gateway as the entry point to our AWS backend, with Lambda handler meant to handle all Install events:

serverless.ymlyaml

functions:  install:    name: ${self:custom.app_function}-install    description: Install Slack integration    handler: gear_install.handler    role: LambdaRole    environment:      tablename: ${self:custom.app_function}-tokens-${self:custom.stage}    events:      - http:          path: /install          method: get

While there are many OAuth scopes you could request for a full-blown integration with Slack, simple integrations typically get by with just the incoming-webhook access scope — but keep in mind that it only allows for unidirectional communication.

Note:

The incoming-webhook scope is designed to allow you to request permission to post content into the user’s Slack workspace. It intentionally does not grant read access, making it is perfect for services that want to send posts or notifications to Slack workspaces that might not want to give read access to messages.

We, however, are going to build an installable Slack App that is not tied to a specific workspace — and we will be requesting the actually needed scopes in a slightly different (i.e. installation) flow, explained further down.

For now, this is our starter config:

When a user clicks your Add to Slack button, your CLIENT_ID gets sent along with the request to Slack’s servers. Slack then redirects the user back to your Redirect URL, along with a single use code parameter in the query that we need to process.

We’ve coded a Lambda function which leverages AWS SSM Parameter store to keep the CLIENT_ID and CLIENT_SECRET, well, secret and centralized — and sends a request with them and the code to Slack’s oauth.access RPC endpoint. Once we have that token, we store it in an AWS DynamoDB table via put_dynamo_items().

def handler(event, context):    logger.info("------Event: {0}".format(event))    code = event['queryStringParameters']['code']    tablename = os.environ['tablename']    token_response = get_token(code, tablename=tablename)    return token_responsedef get_token(code, tablename):    logger.info("------Getting token..")    if code == '':        ("------Code value is null")        output = {            "statusCode": 400,            "body": "---Error. Code value is null"        }        return output    else:        url = 'https://slack.com/api/oauth.access'        payload = {            "client_id": get_param('CLIENT_ID'),            "client_secret": get_param('CLIENT_SECRET'),            "code" : code        }        data = urllib.parse.urlencode(payload).encode("utf-8")        req = urllib.request.Request(url)        info = urllib.request.urlopen(req, data)        logger.info("------Getting info from url: %s", info.geturl())        response = json.loads(info.read().decode('utf-8'))        if response['ok'] == False:            logger.error("------Problem with the token: %s", response['error'])            output = {                "statusCode": 400,                "body": response['error']            }            return output        else:            logger.info("------Putting token into the DynamoDB table: %s", tablename)            table = SlackArmyDynamo(tablename)            table.put_dynamo_items(                item=response['user_id'],                team=response['team_name'],                team_id=response['team_id'],                token=response['access_token'],                bot=response['bot']            )            output = {                "statusCode": 200,                "body": "Token validated. Put into DynamoDB"            }            return outputdef get_param(item):    try:        ssm = boto3.client('ssm')        parameter = ssm.get_parameter(Name=item, WithDecryption=True)        param = str(parameter['Parameter']['Value'])        return param    except ClientError as err:        logger.critical("----Client error: {0}".format(err))        logger.critical("----HTTP code: {0}".format(err.response['ResponseMetadata']['HTTPStatusCode']))class SlackArmyDynamo(gearDynamo):    def put_dynamo_items(self, item, team, team_id, token, bot):        try:            response = self.table.put_item(                Item={                    'team_id': team_id,                    'team': team,                    'customer_id': item,                    'date': datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),                    'bot': bot,                    'token': token                }            )            return response        except ClientError as err:            logger.critical("----Client error: {0}".format(err))            logger.critical("----HTTP code: {0}".format(err.response['ResponseMetadata']['HTTPStatusCode']))

In other words, this is more or less a typical OAuth flow — we simply exchange the temporary code for a proper access token, which comes in a response with the following shape:

{  "ok": true,  "access_token": "TOKEN",  "scope": "identify,bot",  "user_id": "USER_ID",  "team_name": "Chaos Gears",  "team_id": "TEAM_ID",  "bot": {    "bot_user_id": "BOT_USER_ID",    "bot_access_token": "BOT_ACCESS_TOKEN"  }}

Now we can use this token to make API calls.

Note:

If you encounter a code_expired error response instead, you could try:

In your Slack configuration, within OAuth & Permissions, click Reinstall App,
Reinstalling your app in your workplace via Add to Slack.

Integrating our bot with Slack’s Events API

First of all, Charlize has to somehow become aware of the things happening — i.e. events — in a Slack workspace in order to be able to react to them. Slack provides several protocols that allow us to achieve that, including real-time bidirectional communication. However, those require a permanent running service, while the purpose of Charlize was to occasionally support a relatively small team of engineers.

The word “occasionally” is a good indicator that you might want to consider an asynchronous operating model instead. Slack accommodates this case via an Events API which triggers your selected endpoint(s) via HTTP POST requests whenever any of the events your app is actively subscribed to happen.

We opted for precisely this path — it ties in neatly with our existing infrastructure and allows us to deal with the “occasional” nature of those calls via serverless functions.

As such, in our case, the Request URL points to an API Gateway endpoint sitting in front of our Lambdas, which we will cover in a moment.

Important:

This endpoint must correctly echo the challengesent to it by Slack in order for the installation to succeed.

Our gear_event Lambda performs verification before calling any other internal “action” (another serverless function) and responds with the value of challenge.

events:  name: ${self:custom.app_function}-events  description: Provides verification before invoking internal functions  handler: gear_events.handler  role: LambdaRole  environment:    tablename: ${self:custom.app_function}-tokens-${self:custom.stage}    function_ec2: ${self:custom.app_function}-ec2-actions    table_instances: ${self:custom.t_instances}    region: ${self:custom.region}  events:    - http:        path: /events        method: post

And here’s the part of gear_event which is responsible for echoing challenge:

def get_challenge(body):    logging.info("------Checking event challenge value from slack")    if body['type'] == 'url_verification':        logging.info("------Sending challenge back to Slack")        response = {            "statusCode": 200,            "body": body['challenge']        }        return response

We’ve opted to subscribe to the app.mention event, so whenever Charlize gets mentioned (i.e. someone writes @charlize in a Slack channel), Slack will notify our endpoint about that:

In essence, we wanted mentions like…

… to result in the bot recognizing them as commands and executing the respective actions necessary to fulfil a given request.

Let’s walk through some of the code necessary to get that going. First up is our Lambda handler which receives event notifications from Slack and defines the flow:

gear_events.pypython

def handler(event, context):    logger.info("------Event: {0}".format(event))    body = json.loads(event['body'])    tablename = os.environ['tablename']    region = os.environ['region']    t_instances = os.environ['table_instances']    function_ec2 = os.environ['function_ec2']    if token_verification(body)['statusCode'] == 200 and body['type'] != 'url_verification':        response = {            "statusCode": 200        }        data = get_service_action(body, tablename, t_instances)        if len(data['data']) > 0:            logging.info("------Command successfully extracted")            if data['data'][0] == 'ec2':                function = Lambda(region)                logging.info("------Invoking Lambda function: %s", function_ec2)                function.invoke_function(functioname=function_ec2, payload=data)                return response            else:                logging.info("------AWS service not bound with any function")                return response        else:            logging.info("------Problem with extraction of the data")            return response    elif token_verification(body)['statusCode'] == 200:        logging.info("------Verification of the url slack challenge")        response = get_challenge(body)        return response

Before we actually handle anything, we need to check whether the right token values have been given to us in the request payload coming through API Gateway. In other words, we’re checking whether the message we got legitimately comes from Slack’s Events API.

gear_events.pypython

def token_verification(body, param='VERIFICATION_TOKEN'):    logging.info("------Checking Verification Token")    if body['token'] != get_param(param):        raise ValueError('InvalidToken')    else:        response = {            "statusCode": 200,            "body": "TokenVerified"        }        return response

Next we check whether the message we got in Slack actually is a command we understand and can translate into a known action. Depending on the action, we also search through EC2 instance metadata in a DynamoDB table containing information about EC2 instances deployed in our clients’ environments (which helps us simplify operations on them).

gear_events.pypython

def get_service_action(body, tablename, t_instances):    message = body['event']['text']    logging.info("------Got the message from Slack: '%s'", message)    botUserId = get_team_id(body, tablename)['bot']['bot_user_id']    botAccessToken = get_team_id(body, tablename)['bot']['bot_access_token']    command = (re.split(('<@'+str(botUserId)+'>'), message))[1].lower()    print(command)    temp, flag = action_selector(command, t_instances)    response = {        "command": command,        "bot_user_id": botUserId,        "bot_access_token": botAccessToken,        "data": temp    }    logging.info("------Bot %s is mentioned in: %s", botUserId, message)    if len(temp) == 0 and flag == '1':        text="Excuse me Sir, I didn't understand the command: " + command    elif len(temp) == 0 and flag == '5':        text="Excuse me Sir. Instance you've mentioned is not recognized"    elif len(temp) == 0 and flag == '2':        text="Excuse me Sir. No AWS service found in the command"    else:        text="Hello Sir. I've got the command: " + str(command)    sendResponse(response, text, data=command)    return responsedef get_team_id(body, tablename):    table = gearDynamo(tablename)    logging.info("------Getting info from DynamoDB")    item = table.get_item('team_id', body['team_id'])    return item['Item']

sendResponse() is then responsible for actually invoking Slack’s RPC API (chat.postMessage) to send Charlize’s response to the channel the request originated on.

At this stage, the message we send is not the final response yet — it’s merely an ACK of sorts, whether Charlize understood the command or not, and whether it’s going to be processed or not.

gear_events.pypython

def sendResponse(body, text, data):    params = {        "attachments": [{            "title": "Charlize's response",            "author_name": "ChaosGears",            "text": text,            "color": "#2eb886"        }],        'token': body['bot_access_token'],        'channel': body['event']['channel'],    }    url = 'https://slack.com/api/chat.postMessage'    logging.info("------Requesting: '%s'", url)    data = urllib.parse.urlencode(params).encode("utf-8")    req = urllib.request.Request(url)    info = urllib.request.urlopen(req, data)    logging.info("------Getting info from url: %s", info.geturl())    response = json.loads(info.read().decode('utf-8'))    print(response)

Finally, if our handler() gets any further data to process, it passes that along to other actions Lambdas asynchronously. We use a helper class to define and then simplify their invocation.

gear_events.pypython

class Lambda(object):    def __init__(self, region, service='lambda'):        try:            self.region = region            self.client = boto3.client(service, self.region)        except ClientError as err:            logging.error("------ClientError: %s", err)    def invoke_function(self, functioname, payload, invoke_type='Event'):        try:            self.client.invoke(FunctionName=functioname, InvocationType=invoke_type, Payload=json.dumps(payload))        except ClientError as err:            logging.error("------ClientError: %s", err)

In essence, we had appropriate code paths for numerous actions on EC2 instances of varying states:

{  "services": [    {      "name": "ec2",      "actions": ["check", "stop", "kill", "terminate", "restart", "reboot", "find"],      "states": ["stopped", "running", "terminated"]    }  ]}

But Charlize isn’t really chatty about anything else:

@Charlize, do something for me

With our gateway handler in place and able to translate recognized commands into actions, it’s time we took a closer look at how we implement such action handlers.

action-ec2:  name: ${self:custom.app_function}-ec2-actions  description: Provides actions regarding EC2 service  handler: gear_ec2.handler  environment:    regions: ${self:custom.regions}    tagkey: ${self:custom.tagkey}    tagvalue: ${self:custom.tagvalue}    table_roles: ${self:custom.t_roles}    table_instances: ${self:custom.t_instances}    dest_account: ${self:custom.dest_account}  role: EC2Role

With an EC2Role definition containing a cross-account IAM Role that gets assumed whenever we want to do something on a customer’s account. Our IAM Roles with account IDs and customer metadata are stored in DynamoDB which we call to get the name of the respective role.

Pretty simple IAM role for an example action Lambda.

Policies:  - PolicyName: GearSlack-Army-EC2    PolicyDocument:      Version: '2012-10-17'      Statement:        - Effect: Allow          Action:            - sts:AssumeRole          Resource:            - arn:aws:iam::${self:custom.dest_account}:role/${self:custom.dest_rolename}        - Effect: Allow          Action:            - dynamodb:PutItem            - dynamodb:GetItem          Resource:            - arn:aws:dynamodb:${self:custom.region}:*:table/${self:custom.app_function}-roles-${self:custom.stage}            - arn:aws:dynamodb:${self:custom.region}:*:table/${self:custom.app_function}-instances-${self:custom.stage}

The action Lambda is able to find EC2 instances and act on them within the right environment and the right permissions.

Once the action is performed, we finish up with a response sent to Slack:

def sendResponse(body, text, data):    if data != '':        params = {            "attachments": [{                "title": "Charlize is saying:",                "text": text,                "color": "#2eb886",                "fields": [{                    "title": "Response",                    "value": data,                    "short": "false"                }]            }],            'token': body['bot_access_token'],            'channel': body['event']['channel']        }    else:        params = {            "attachments": [{                "title": "Charlize is saying:",                "text": text,                "color": "#2eb886"            }],            'token': body['bot_access_token'],            'channel': body['event']['channel']        }    url = 'https://slack.com/api/chat.postMessage'    logging.info("------Requesting: '%s'", url)    data = urllib.parse.urlencode(params).encode("utf-8")    req = urllib.request.Request(url)    info = urllib.request.urlopen(req, data)    logging.info("------Getting info from URL: %s", info.geturl())    response = json.loads(info.read().decode('utf-8'))    print(response)

With all of that implemented and running, we can now ask Charlize:

“@charlize find number of running EC2 instances”
“@charlize check number of stopped EC2 instances”
“@charlize find running EC2 instances”
“@charlize find running EC2 instances in customer environment”

An example of asking for stopped instances. Actually none was found.

An example of asking for running instances. Dictionary returned.

Until next time, Charlize

We’ve showed one task that Charlize is capable of doing and which saves us time. In our case, quite a lot of it. Obviously this is just one potential idea for how you can integrate your AWS environments with Slack in order to boost your productivity.

SlackOps/ChatOps aren’t a new thing per se — we’ve used solutions akin to Charlize ever since the days of IRC, and even earlier — and they are certainly here to stay. Especially with the advent of new natural language processing breakthroughs driving a new generation of AI that is bound to revolutionize the way we interact with “bots”. And we haven’t even touched on this, and all the possibilities it brings.

That said, start small. Start with a simple task like we did, and then evolve your own Charlize-bot with new “skills”. You’re going to have a lot of fun and gain a new team member capable of performing tasks the humans in your team are tired of.