AWS

Let’s do some CHaos in your cloud

By 02/06/2019 No Comments

The Story

In my previous post about the “usage Chaos as an internal culture toward greater self-confidence,” I presented some thoughts about the general concept of Chaos Engineering. This time I’d like to present some practice on the topic so, I prepared my implementation of chaos monkey basing on a serverless concept and AWS environment. If it sounds good for you my reader, so don’t waste time and start the engine.

Serverless, my Dear Fellow 

Likewise, previous implementations I’ve used Serverless Framework to set up the whole environment. To make it simple and easier to develop I’ve divided the solution into 3 separated Lambda functions. In the next section I’m going to describe each of them but now for the sake of explanation image such scenario:

Basically, I wanted, similarly to Chaos Monkey from Netflix, to inject some way of randomness having an impact on our cloud environment. Considering different options but to make it happen I’ve chosen such one:

     implement a function let’s call it “settimer” which randomly selects set of dates with hours and puts that information somewhere external via API call

     let’s make “settimer” automatically invokable

     create a dynamodb table containing items mentioned above

     implement another function let’s call it “scheduler” which will be responsible for gathering dates from external storage and updating execution times of the third function called “executor”

     let’s make “scheduler” automatically invokable

     let’s make “executor” automatically invokable

     create a dynamodb table containing terminated instances’ info – to make history recording

     finally implement notification – I hate being not informed about events

Here’s how the implementation took place. Everything started within serverless.yml. I’ve defined 3 Lambda functions and some external resources like dynamodb tables and IAM roles needed for those functions. The attentive reader might notice the part titled “events”

events:
     - schedule:
          rate: cron(0 * ? * MON-SUN *)
          name:
event-${self:custom.app_acronym}-${self:custom.stage}-scheduler
          description: Default event cron for
chaosbox scheduler maintainer

It allows you to easily implement AWS CloudWatch Events without additional resource creation.  IAM roles are being used by Lambdas to get an access to external resources like dynamodb tables, logs, cloudwatch events and ec2 service.

functions:
  chaosbox-executor:
    name:
${self:custom.stack_name}-executor-${self:custom.stage}
    description: CHaos Box for randomly
selected ec2 instances termination
    handler: chaosbox.lambda_handler
    role: ChaosBoxRole
    timeout: 60
    environment:
      slack_url: ${self:custom.slack_url}
      slack_channel:
${self:custom.slack_channel}
      aws_region: ${self:custom.region}
      tablename:
${self:custom.dynamodb_terminated}
      tagkey: ${self:custom.tagkey}
      tagvalue: ${self:custom.tagvalue}
    events:
      - schedule:
          rate: cron(0 8 ? * MON-FRI *)
          name:
event-${self:custom.app_acronym}-executor-${self:custom.stage}
          description: Execute CHaos events
based on schedule
    tags:
      Name: ${self:custom.app_acronym}-terminator-${self:custom.stage}
      Project: chaosbox
      Environment: dev
  chaosbox-scheduler:
    name:
${self:custom.stack_name}-scheduler-${self:custom.stage}
    description: CHaos Box scheduler
    handler: events.lambda_handler
    role: ChaosBoxSchedulerRole
    timeout: 15
    environment:
      event_name: ${self:custom.event_name}
      aws_region: ${self:custom.region}
      dynamodb_schedule:
${self:dynamodb_schedule}
    events:
      - schedule:
          rate: cron(0 * ? * MON-SUN *)
          name:
event-${self:custom.app_acronym}-scheduler-${self:custom.stage}
          description: Default event cron for
chaosbox scheduler maintainer
    tags:
      Name:
${self:custom.app_acronym}-scheduler-${self:custom.stage}
      Project: chaosbox
      Environment: dev
  chaosbox-settimer:
    name:
${self:custom.stack_name}-settimer-${self:custom.stage}
    description: CHaos Box hours selector
    handler: dynamo_schedule.lambda_handler
    role: ChaosBoxSchedulerRole
    timeout: 15
    environment:
      aws_region: ${self:custom.region}
      dynamodb_schedule:
${self:dynamodb_schedule}
      intervals_per_day:
${self:intervals_per_day}
      interval_days: ${self:interval_days}
      starthour: ${self:starthour}
      stophour: ${self:stophour}
    events:
      - schedule:
          rate: cron(0 1 */5 * ? *)
          name:
event-${self:custom.app_acronym}-settime-${self:custom.stage}
          description: Choose hours of CHaos
events and put them into Dynamodb
    tags:
      Name:
${self:custom.app_acronym}-settime-${self:custom.stage}
      Project: chaosbox
      Environment: dev

resources:
  - ${file(resources/roles.yml)}
  - ${file(resources/dynamo.yml)}
plugins:
  - serverless-python-requirements
  - serverless-pseudo-parameters

After having it deployed by typing sls deploy –aws-profile=YOUR_PROFILE we can move to particular functions’ code.

Settimer

The first action is to fill Dynamodb table called “chaosbox-schedule” with dates and hours of particular chaos launches.

Additionally, I’m putting data of a last update and simple owner information. Probably we’re going to restructure this table and add/remove some columns. Anyway, treat it as your source of particular invocations.

Generally “settimer” is invoked once per 5 days (0 1 */5 * ? *) uses randomDate function which start with interval_days (amount of days ahead for which we should set chaos actions), calculate proper values and then puts it all into dynamodb. I’ve added some variables like starthour and stophour flags to narrow the period of calculated executions hours and intervals_per_day to set number of invocations per day.

def randomDate(region, tablename, interval_days,
intervals_per_day, starthour, stophour):
    day_list = []
    chaos_dates = []
    dynamo = Dynamo('dynamodb', region)
    d_today = datetime.datetime.today()
    d_today = d_today.replace(hour=0, minute=0,
second=0, microsecond=0)
 

    for days in range(interval_days):
        for item in range(intervals_per_day):
            hours =
random.randint(starthour,stophour)
            minutes = random.randint(0, 59)
            seconds = random.randint(0, 60)
            new_d = d_today +
datetime.timedelta(days=days, hours=hours, minutes=minutes, seconds=seconds)
            day_list.append(new_d.strftime("%Y-%m-%d
%H:%M:%S"))
    try:
        day_list[0]
        logging.info("------Chaos dates
selection process")
        for i in day_list:
            #to extract date from hour i[:10] /
i[11:]
            chaos_dates.append(i)
        logging.info("------Chaos dates
selected: {0}".format(chaos_dates))
        dynamo.clear(tablename)
        dynamo.batch_write(tablename,
chaos_dates)
        logging.info('------Data successfully
written into Dynamodb table')
    except IndexError as err:
        logging.critical("------Index
error: {0}".format(err))
    except ClientError as err:
        logging.critical("------Client
error: {0}".format(err))
 

def
lambda_handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    logger.info("Event:
{0}".format(event))
    region = os.environ['aws_region']
    dynamodb_schedule =
os.environ['dynamodb_schedule']
    intervals_per_day =
os.environ['intervals_per_day']
    interval_days = os.environ['interval_days']
    starthour = os.environ['starthour']
    stophour = os.environ['stophour']
   
randomDate(interval_days=int(interval_days),
intervals_per_day=int(intervals_per_day), starthour=int(starthour), stophour=int(stophour),
region=region, tablename=dynamodb_schedule)

Scheduler

Basically, this function is responsible for setting CloudWatch event rules for Lambda called Executor. What I’m actually doing is check current date, extract part which is being used in dynamodb scan function (imported from class that I use for dynamodb API actions) and finally update event rule with properly formatted values like #0 13,14,15 ? * * *.

class Events(Dynamo):
    def __init__(self, service, region):
        self.service = service
        self.region = region
        try:
            self.client =
boto3.client(self.service)
        except ClientError as err:
            logging.critical("------Client
error: {0}".format(err))

    def update_event(self, name,
tablename):
        frequency = ''
        d_today = datetime.datetime.today()
        d_today =
d_today.strftime("%Y-%m-%d %H:%M:%S")[11:]
        try:
            dynamo = Dynamo('dynamodb',
self.region)
            chaos_today =
dynamo.scan_dynamo_param(tablename=tablename, attr='chaosdate',
value=datetime.datetime.today().strftime('%Y-%m-%d %H:%M:%S')[:10])
            for item in
range(len(chaos_today)):
                chaos_today =
chaos_today[11:13]
            temp = sorted(chaos_today)
            cron = ','.join(temp)
            frequency = 'cron(0 '+cron+' ? * *
*)'
            logging.info("------Set
frequency value: {0}".format(frequency))
            #0 13,14,15 ? * * *
            #print(datetime.datetime.strptime('0:10:06',
'%H:%M:%S')-datetime.datetime.strptime('2:34:24', '%H:%M:%S'))
            self.client.put_rule(
                Name=name,
                ScheduleExpression=frequency,
                Description='Execute CHaos
events based on schedule',
                State='ENABLED',
                )
           
logging.info("------Successfully changed rule:
{0}".format(name))
        except ClientError as err:
            logging.critical("------Client
error: {0}".format(err))
 

def lambda_handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    logger.info("Event:
{0}".format(event))
    region = os.environ['aws_region']
    name = os.environ['event_name']
    dynamodb_schedule =
os.environ['dynamodb_schedule']
    event = Events('events', region)
    event.update_event(name,
tablename=dynamodb_schedule)

Executor

The last, but generally the core function of this solution, is the Lambda called “Executor”. I’ve pasted only the main part and without mother class (Instance) which has got handy methods for working with AWS EC2 service. Any way you can use AutoScaling class without inheritance and simply add your own methods. One important note: def batch_write has been taken and overwritten from another class called Dynamo. Likewise, Instance class you might write your own version or catch me on Linkedin and send you more details but now for the sake of the clear view and focus on Chaos termination case I intentionally omitted it. The “Executor” is searching for auto scaling groups with pre-defined tag key/value pair and having that dictionaries can randomly terminate one instance from each ASG. Obviously provided you don’t set the key/value pair in asg then this group will be skipped. All instance ids which have been selected by ChaosBox are being saved into Dynamodb table with additional values like:

       termination date

       ASG name

       owner (might be different, but I had no better idea that time)

       instance-id

class AutoScaling(Instance):
    def __init__(self, service, region):
        self.service = service
        self.region = region
        try:
            self.client =
boto3.client(self.service, region_name=region)
        except ClientError as err:
            logging.critical("----Client
error: {0}".format(err))

    def find_chaos_asg(self, tagkey, tagvalue):
        autoscale_groups = []
        chaos_candidates = {}
        try:
            response =
self.client.describe_auto_scaling_groups()['AutoScalingGroups']
            for item in response:
                for i in item['Tags']:
                    if i['Key'] == tagkey and
i['Value'] == tagvalue:
                    autoscale_groups.append(item['AutoScalingGroupName'])
            for found in autoscale_groups:
                temp = []
                for i in response:
                    if found ==
i['AutoScalingGroupName']:
                        for instances in i['Instances']:
                           
temp.append(instances['InstanceId'])
                           
chaos_candidates[found] = temp
            return chaos_candidates
        except ClientError as err:
            logging.critical("----Client
error: {0}".format(err))
        except KeyError as err:
            logging.critical("----Key
error: {0}".format(err))
 

    def batch_write(self, tablename, items,
terminated, owner='chaosgears', region='eu-west-1'):
        dynamo = Dynamo('dynamodb', region)
        table =
dynamo.resource.Table(tablename)
        with table.batch_writer() as batch:
            for key in items:
                for val in items[key]:
                    if val in terminated:
                        batch.put_item(Item={
                                'instance-id':
val,
                               
'autoscaling-group': key,
                               
'terminated-on-date': datetime.datetime.now().strftime("%Y-%m-%d
%H:%M:%S"),
                                'owner': owner
                        })
            return True 

    def chaos_box(self, slack_url,
slack_channel, region, tablename, tagkey, tagvalue,
topic='arn:aws:sns:eu-west-1:123456789000:chaosbox', owner='chaosgears'):
        chaos_victims = []
        count = 0
        try:
            chaos_group =
self.find_chaos_asg(tagkey, tagvalue)
            if len(chaos_group) == 0:
                logging.info("------No
CHaos today")
                response = {
                    'Info': '----No CHaos
autoscaling groups found with ChaosBox=True tag',
                    'Status': 'NO CHANGE' }
                send_notification(response,
topic, self.region)
                slack_info(slack_url,
slack_channel, owner, region, num_ec2='0', 
asg_ids='none', chaos_victims='none', flag='nochange',
dirname='slack_messages')
            else:
                response = {
                    'Info': 'Found '
+str(len(chaos_group))+ ' autoscaling groups',
                    'Data': chaos_group,
                    'Status': 'SUCCESS' }
                send_notification(response,
topic, self.region)
                for key in chaos_group.keys():
                    chg_victim =
random.randint(0,len(chaos_group[key])-1)
                   
chaos_victims.append(chaos_group[key][chg_victim])
                    ec2 = Instance('ec2',
region)
                   
ec2.terminate_instance(instance=chaos_victims)
                   
logging.info("------Terminated instance: {0}".format(chaos_group[key][chg_victim]))
                    count += 1
                self.batch_write(tablename,
chaos_group, chaos_victims, owner, region)
                responseBody = {
                        'Info': 'Terminated '
+str(count)+ ' instances',
                        'Data': chaos_victims,
                        'Status': 'SUCCESS' }
                send_notification(responseBody,
topic, self.region)
                slack_info(slack_url,
slack_channel, owner, region, num_ec2=count, asg_ids=','.join(chaos_group),
chaos_victims=','.join(chaos_victims), flag='success',
dirname='slack_messages')
        except ClientError as err:
            logging.critical("----Client
error: {0}".format(err))
 

def lambda_handler(event, context):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    logger.info("Event:
{0}".format(event))
    slack_channel = os.environ['slack_channel']
    slack_url = os.environ['slack_url']
    region = os.environ['aws_region']
    tablename = os.environ['tablename']
    tagkey = os.environ['tagkey']
    tagvalue = os.environ['tagvalue']
    asg = AutoScaling('autoscaling', region)
    asg.find_chaos_asg(tagkey, tagvalue)
    asg.chaos_box(slack_url, slack_channel,
region, tablename, tagkey, tagvalue)

Notification

Without being notified about events we couldn’t monitor and improve them; therefore I’ve added Slack notification to one of our channels. We’re enclosing:

     name of auto scaling group

     region of the scope

     number of terminated instances

     instance-ids 

There are many ways of implementation but honestly just make the system able to inform you about actions it does. No matter if it is Slack or any other system.

 

slack_info(slack_url,
slack_channel, owner, region, num_ec2=count, asg_ids=','.join(chaos_group),
chaos_victims=','.join(chaos_victims), flag='success',
dirname='slack_messages')
 

def
custom_message(num_ec2, asg_ids, owner, region, chaos_victims,  dirname, pattern):
    message = import_all_data(pattern, dirname)
    now = datetime.datetime.now()
   
date_now = now.strftime('%Y-%m-%d %H:%M:%S')
   
message['attachments'][0]['fields'][0]['value'] = owner
   
message['attachments'][0]['fields'][1]['value'] = region
   
message['attachments'][0]['fields'][2]['value'] = date_now
    if pattern == 'success':
       
message['attachments'][0]['fields'][3]['value'] = asg_ids
       
message['attachments'][0]['fields'][4]['value'] = num_ec2
       
message['attachments'][0]['fields'][5]['value'] = chaos_victims
    else:
       
message['attachments'][0]['fields'][3]['value'] = 'No CHaos autoscaling
groups found with ChaosBox=True tag'
    return message 

def
slack_info(slack_url, slack_channel, owner, region, num_ec2, asg_ids,
chaos_victims, flag, dirname='slack_messages'):
    try:
        if flag == 'success':
            slack_message =
custom_message(num_ec2, asg_ids, owner, region, chaos_victims,
pattern='success', dirname=dirname)
        else:
            slack_message =
custom_message(num_ec2, asg_ids, owner, region, chaos_victims,
pattern='nochange', dirname=dirname)
        req = requests.post(slack_url,
json=slack_message)
        if req.status_code != 200:
            print(req.text)
            raise Exception('Received non 200
response')
        else:
            print("Successfully posted message to
channel: ", slack_channel)
    except requests.exceptions.RequestException
as err:
        logging.critical("----Client
error: {0}".format(err))
    except requests.exceptions.HTTPError as
err:
        logging.critical("----HTTP request
error: {0}".format(err))
    except requests.exceptions.ConnectionError
as err:
        logging.critical("----Connection
error: {0}".format(err))
    except requests.exceptions.Timeout as err:
        logging.critical("----Timeout
error: {0}".format(err))

 

 

Is it the end or just the beginning?

Having a solution able to terminate your randomly selected instances it’s just the first step in the whole story. Chaos Engineering is just about taking small steps, if you already have a steady state which is under control. Whenever you run any chaos action, you should have a hypothesis in mind about what the outcome of the experiment will be. Think about how the steady state behavior is going to be changed whenever you inject different types of events into your system. This is the pure beauty of “chaos engineering”, always treat the environment like a “day one”, during which you’re find some new areas to investigate. Anyway, turning off a server is cheap and easy, making it easy to start with, especially with serverless. As we ran these exercises more frequently, termination exercise will be perceived more as a “normal”; and that’s the goal. Taming chaos is making it daily, monthly internal habit. We’re not stopping even slowing down. We’ve started this machine. Stay tuned for more.