ArticleautomationAWSdatamigrationpythonS3

Data Migration Service in S3 (Part 1)

Foreword

Data migration is a very broad subject. Often, it requires an individual approach, depending on services and tools used for data processing and storage, as well as requirements specified by the clients.

In this and the following article, we would like to present our own, original solution for migration of data stored in AWS Simple Storage Service (S3). Part one will serve as a brief introduction to S3 and a review of scenarios most frequently occurring during data migration process. In the second part, we will show the capabilities of our method in detail, along with comparative tests measuring the duration of data migration process performed by a set of widely available tools.

Introduction to S3

Amazon Simple Storage Service (S3) is one of the oldest but at the same time, one of the most popular services in an AWS public cloud portfolio. It is also one of the best integrated services within the AWS, allowing direct access to data gathered in buckets.

Thanks to its simplicity and reliability this object data storage space can be used for:

  • backup copies and archiving,
  • providing static content for applications and web sites
  • big data analysis,
  • disaster recovery scenarios,

and many more.

Data are put into a unique buckets located in a specified regions. Depending on a chosen S3 class, data are arranged in one, two, three, or even more isolated locations of Data Center (AZ = availability zone). This allows us to increase the level of accessibility of stored objects. The service was designed to meet the durability requirements up to a level of 99,999999999%. Files (objects) are identified by a special key. Therefore, we can access any object, using bucket name, a key or an object’s version identifier. Keep in mind that S3 bucket naming is based on DNS service, and so, bucket names should be unique within the scope of all AWS regions.

S3 is a service with unlimited space and no minimum entry threshold. You use as much space as you need. It scales automatically with growing demands, and without an advance space allocation. High levels of durability and availability (99,99%), combined with many data transfer and storage encryptions, make S3 a service, that can be used in numerous business applications.

You can access it through:

  • AWS console,
  • SDK library,
  • API.
  • CLI.

In the end, the cost of using S3 comes down to used disk space and outgoing transfer. Additionally, we can combine S3 with Amazon Glacier service and automatically transfer unused data to a cheaper disk space, meaning archive, through inbuilt mechanisms managing objects life cycle.

Why we have decided to create our own migration tool

Browsing through available solutions (both free and commercial products), we were not able to find one capable of performing a fully automated data migration process in different scenarios. Nevertheless, below is a list of tools worth mentioning:

For example, we can perform a copy/synchronize data process with AWS CLI, following those steps:

aws configure

 command,

  • assuming we have a properly configured AWS CLI with two buckets (s3_source_bucket and s3_destination_bucket), we can synchronize data with a command:
    aws s3 sync s3://S3_SOURCE_BUCKET s3://S3_DESTINATION_BUCKET
    
  • next, we need to verify if all the data were copied to the destination bucket – to do that, we run two commands and compare both summaries (number and size of objects should be identical):
aws s3 ls --recursive s3://S3_SOURCE_BUCKET --summarize
aws s3 ls --recursive s3://S3_DESTINATION_BUCKET --summarize

In case of applications performing many write data operations, we need to specify a new bucket, and synchronize the data once more on a S3 bucket level.

Data synchronization process with CLI is fairly simple, but it gets complicated, once we try to modify ACL or change the metadata, as there are no available switches to secure this type of data migration. Second weakness of CLI lies in the speed of the copy process. Migrating sizeable S3 buckets lengthens the process, which can result in inaccessibility of data, as applications switch to different buckets. It can also lead to an unnecessary lengthening of maintenance window until the end of a whole synchronization process. What is more, automation of repetitive processes with AWS CLI gets complicated with growing number of different scenarios, forcing the increase of workload. In other words, AWS CLI is ideal for simple migration with no or very limited requirements. For managing repetitive processes and their automation, you have to look elsewhere. And that is exactly what we did, creating our own migration tool.

Our main goal was to automate the whole process – access verification, s3 source and destination buckets verification, implementation of proper policies, objects ownership transfer – right from the moment of providing input data (AWS accounts, buckets names and access data). Additionally, we wanted to speed up the process, so our tool could be used for synchronization/migration of big S3 buckets, regardless of their size or objects’ quantity. We also paid attention to functionality, concentrating on logging standards, notifications explaining particular stages of migration process, and error handling.

We wanted a tool to copy/migrate data to new buckets or, even more importantly, to keep their original names. In other words, a solution capable of transferring data to different accounts or regions.

Unfortunately, none of the solutions/tools tested by us fulfilled aforementioned requirements. That is why we have decided to create our own tool for simplifying and automating data migration.

Most frequently occurring scenarios

In this part of the article, we will describe the most frequently occurring scenarios of data migration. It does not mean, that the list is complete. On the contrary, you will probably encounter different circumstances. We only summarized our experiences with projects using S3 service.

  1. Migration/copy within the same AWS account (different S3 bucket names):

a) one shared AWS region

b) two AWS regions

In both cases, the data are transferred/copied to new S3 bucket within the same account, and within one or between two different regions.

2. Migration/copy between two different AWS accounts (different S3 bucket names):

a. one shared AWS region

b. two different AWS regions

In both cases, the data are transferred/copied to a new S3 bucket between different AWS accounts, and within one or two different regions.

3. Data migration (keeping original S3 bucket name):

a. one AWS account, two different regions

b. two different AWS accounts, two different regions (or one shared region)

In both cases, the data are transferred to a temporary S3 bucket. The bucket is then deleted, and created again with the same name, but in a target location. The data are transferred from a temporary bucket to a destination bucket, and the temporary bucket is deleted. This method allows us to skip an otherwise required reconfiguration (as data are available under the same address, and our tool will transfer objects’ ownership). The only thing you have to consider is a temporary inaccessibility of data in a period between the deletion of source bucket and the release of its name by AWS, which allows a repeated use. It’s worth to mention, that AWS does not guarantee, that you will be able to use the same name for a second time. When you delete S3 bucket from our account, you can use its name again, once it disappears from S3 service. But so can anybody else. Therefore, you should add a prefix or a hash to bucket names, and use internal or external DNS domain with records indicating S3 buckets.

Summary

We have introduced you to AWS S3 service and data migration process. We also showed you some tools for using objects in S3, and scenarios we have encountered, while working for our clients. We explained, why we have decided to develop our own tool, and sketched its functionality.

In the second part of our article, we will present, in detail, the stages and code used in our application for automated data migration/copy process in a given scenario. We will also show, how we managed to parallelize and accelerate the whole process, simplifying the operations of data migration in S3.

Follow/read our blog and social channels. See you soon in the second part.