Migrating Clouds Quickly

In August 2020, ShipperBee migrated the bulk of their business from Google Cloud Platform (GCP) to Microsoft Azure. It took us under 3 months start-to-finish, with about 4 days of panic. We managed to finish the year with a 99.5% uptime on our core services (not amazing, but really good!).

I think this was well-executed, especially for a business with a few hundred daily clients, several hundred delivery drivers, and a team of about 70 people using and operating in the platform 6 days per week.

ShipperBee’s landscape included: 2 x REST API services, 4 x web portals, an IoT/hardware support stack, Apple + Android notifications system, PostGres databases, Redis caches, and a Shopify app already running in Azure.

Here is what we did, and what I learned.

Hire + Trust the Experts

Our dev team had the will, but not quite the internal experience with Azure, to execute on a migration like this. We determined that hired-guns on a short engagement would be cheaper than hiring a full-time Azure infrastructure expert (or two) … and it definitely was.

Via our Microsoft CSP Arrow, we hired their subsidiary company, eInfoChips from India, to help with the lift. It went OK! eIC were knowledgeable about what Microsoft has to offer in Azure, about proven configurations, and which products/settings to avoid.

The most controversial example of following expert advice was using VM Scale Sets rather than Docker containers for core APIs. On the advice of their system architects, we avoided the Docker ecosystem in Microsoft since it wasn’t as well tooled for scaling, repair, price reservations, etc.

Do DevOps Yourself

I wish we had avoided leveraging our experts in setting up our DevOps / build / CI / CD pipelines. Ultimately that should be owned and managed by the people closest to the product code.

Nuances like naming convention and owners for service accounts matter in future upkeep. A contractor won’t be as concerned with those details as your team might be.

We migrated our deployment process to Azure Pipelines, rather than staying with existing Bitbucket pipelines that had been hardened over the years. There was no real need to move our deployment pipeline with our Cloud provider; it ended up being extra scope that ultimately caused a bunch of confusion and problems.

Skip the Great Inventory

Whenever you decide to move Clouds, inevitably someone will suggest you do a complete inventory of your software system, and how it interacts with every other piece (including your IT and DevOps stack).

I believed we had a thorough inventory of every GCP component that we used, and moving it would involve checking things off a list. Unless you’ve turned off the Console/Portal and everything you’ve ever put in the cloud was done explicitly via Terraform, you’ll learn more than you’ve ever wanted about vnet’s, service accounts, storage keys, and deleted users. And you’ll miss things.

Do enough of an inventory to understand the big pieces of the lift, but each instance of each service will have to be migrated independently, with care for all the small stuff around it (ex. resource groups, networking, IAM)

Aggressive Timelines are Key

Our COO decided that we had to be done the migration by September 25th for some business reason. That gave us under 3 months to complete the project. Frankly, without that arbitrary target, we’d still be “multi-cloud” and struggling to get pieces over the line many quarters later.

Having an aggressive target date helped us identify two critical “all hands on deck” weekends, which became our key milestones. One milestone involved a working Demo environment. Two weekends after that, we’d move the Production environment.

Those weekends were predictably terrible; 18 hour days, non-stop War Rooms and unexpected emergencies. Since we had the next stage scheduled for the following weekend and business resuming Monday, we just kept “failing forward” instead of responsibly reverting.

As long as things were 90% working by start of business on Monday, we explained the problems away with the COO and resolved the remaining 10% quietly during the week.

Reset your IAM and Subscriptions

My favourite part of the transition to Azure was updating our access control lists and permissions (we went to a group-based permission model after years of adding trusted dev’s as Editors or Owners to pieces), and taking advantage of the Subscription level segregation in Azure.

Subscriptions are one level higher than Projects in GCP. I created Subscriptions for Development, Production, and Integrations .. which made cost analysis for those budget items much easier. It was easy to lock down Production access since it didn’t impact the design of the Development and Integrations resources.

Hope that helps someone! I’m now a reluctant fan of Microsoft Azure. More than anything, moving clouds convinced me of the importance of good Infrastructure and DevOps talent in today’s modern software development setup.