Skip to main content

Command Palette

Search for a command to run...

Orchestrating a Zero-Downtime Migration from AWS RDS to GCP Cloud SQL: The Engineering Deep Dive

Updated
12 min read

Orchestrating a Zero-Downtime Migration from AWS RDS to GCP Cloud SQL: The Engineering Deep Dive

Migrating a production database is often described as "changing the engines on a plane while it's in flight." Last week, we successfully moved our primary transactional database from AWS RDS to Google Cloud SQL with exactly zero seconds of application downtime.

This post goes beyond the high-level "how-to" and dives into the specific scripts, configuration blocks, and validation protocols we used to ensure 100% data integrity during the transition.


1. Executive Summary: The Business Case for Migration

In the lifecycle of a high-growth startup, architectural shifts are inevitable. Our primary motivation for moving from AWS RDS to Google Cloud SQL wasn't just cost-optimisation—it was performance and the desire for better integration with Google's managed data services like BigQuery for real-time analytics.

The challenge was immense: our transactional database handles over 2,000 writes per second during peak hours. Any downtime would result in immediate financial loss and a significant hit to our customer trust. We calculated that even a 5-minute maintenance window could result in thousands of failed transactions. Therefore, a "Zero-Downtime" migration wasn't a nice-to-have; it was a hard requirement.

The Metrics of Success

Our goal was defined by four key KPIs:

  1. Zero Service Downtime: Users should not see a maintenance page at any point.
  2. Sub-second Data Lag: The replication lag between AWS and GCP must be under 1 second at the moment of cutover.
  3. 100% Data Integrity: No missing transactions, no duplicate entries.
  4. Transparent Rollback: A plan to return to AWS in under 30 seconds if any anomalies were detected in the first 60 minutes.

2. Phase 0: The Planning Stage. Calculating SLA Risk.

Planning a migration of this scale requires a meticulous analysis of the failure points. We began by auditing our existing SQL traffic to identify the heaviest tables and the most frequent write patterns.

Identifying Potential Bottlenecks

We found that our transactions table, which grows by several gigabytes every day, would be the most difficult to keep in sync. The Change Data Capture (CDC) mechanism relies on reading transaction logs (WAL in Postgres, Binlogs in MySQL). If our write volume exceeded the replication bandwidth provided by our interconnection, the lag would never hit zero.

SLA Calculation

We mapped out our dependencies. Our microservices rely on a shared database instance, meaning we couldn't just migrate one service at a time—it had to be a coordinated "all-in" cutover for the data layer.


3. Phase 1: Networking and Security

Networking is the foundation of any cross-cloud migration. Without a stable, low-latency "bridge," the migration is doomed before the first byte is copied.

Deep Dive into IPsec/VPN Configuration

We established a site-to-site VPN between AWS and GCP. While a Dedicated Interconnect/Direct Connect would have been faster, the timeline required a faster setup. We used GCP Cloud VPN and AWS Transit Gateway.

MTU Optimization: The Silent Performance Killer

One of the most common mistakes in cross-cloud replication is ignoring the Maximum Transmission Unit (MTU). Standard Ethernet has an MTU of 1500 bytes. However, the VPN overhead reduces this. We tuned our network interfaces to an MTU of 1390 to prevent packet fragmentation, which significantly increased our replication throughput during the initial sync.

BGP Routing

We used Dynamic Routing with BGP (Border Gateway Protocol) to ensure that if one VPN tunnel failed, traffic would instantly failover to the secondary tunnel without manual intervention.


4. Phase 2: Preparing the Source (AWS RDS)

Before GCP can see your data, RDS must be configured to allow logical replication. This requires a custom Parameter Group.

Detailed Parameter Changes (PostgreSQL)

Ensure these are set in your AWS console:

  • rds.logical_replication = 1
  • max_replication_slots = 20 (Higher than default to allow for multiple test runs)
  • max_wal_senders = 20
  • wal_sender_timeout = 60s

The Logical Decoding Challenge

Logical replication in Postgres works by sending chunks of the Write Ahead Log (WAL) to the subscriber. This process consumes CPU. We monitored our RDS instance carefully during the enablement phase to ensure the extra overhead didn't impact existing application performance.

Logical Replication Slots

A replication slot ensures that the source doesn't delete WAL files until they have been safely received by the target. This protects you during a networking outage but can lead to disk space exhaustion on RDS if the target goes down for too long. We implemented a monitoring script to alert us if the pg_replication_slots size exceeded 100GB.


5. Phase 3: Initial Data Load (The Baseline)

The first step of the sync is the "Full Dump and Restore." However, with a database over 500GB, a standard dump/restore is too slow.

Chunking Strategies

Google DMS (Database Migration Service) handles the initial load by performing a parallel snapshot. We found that for our largest tables, we needed to partition the sync process.

Challenges with Large Tables

One specific table, audit_logs, had over 100 million rows. To speed up the initial load, we temporarily disabled non-essential indexes on the target (GCP) and re-enabled them after the bulk copy was complete. This "index-less" copy was 4x faster.


6. Phase 4: Continuous Change Data Capture (CDC)

Once the initial load is complete, the migration job enters the "Synchronizing" state. This is where the real magic happens.

Replication Lag Monitoring

The 'Replication Lag' metric became our most important dashboard. It represents the time difference between a commit on AWS and its arrival on GCP.

Handling High-Throughput Writes

During peak hours, our lag would spike to 30 seconds. To counteract this, we optimized the GCP instance's write throughput by switching to Local SSDs temporarily during the migration and ensuring the db-custom-16-61440 tier had enough IOPS to keep up with the incoming WAL stream.

WAL Buffering Logic

Google DMS effectively reads the WAL stream from AWS and converts it into SQL commands to run against Cloud SQL. We optimized the concurrent tasks in DMS to ensure that different tables were being updated in parallel, preventing a single large table from blocking the entire replication stream.


7. Phase 5: Testing and Data Integrity

You cannot trust the "Success" message in a console. You must verify the data yourself at the row level.

Python: The Multi-Threaded Data Validator

We built a script that calculates checksums for specific time windows of data. For example, it would check every transaction from '2026-04-10 10:00' to '2026-04-10 10:05' across both databases.

import hashlib
import psycopg2
from concurrent.futures import ThreadPoolExecutor

def get_row_checksum(conn_params, table, time_window):
    conn = psycopg2.connect(**conn_params)
    with conn.cursor() as cur:
        # We use a windowed MD5 hash to keep performance high even on billion-row tables
        query = f"SELECT md5(array_to_string(array_agg(t.* ORDER BY id), ',')) FROM {table} t WHERE created_at BETWEEN %s AND %s"
        cur.execute(query, time_window)
        res = cur.fetchone()[0]
        conn.close()
        return res

def verify_table(table, start_time, end_time):
    # AWS and GCP params defined here...
    aws_hash = get_row_checksum(AWS_PARAMS, table, (start_time, end_time))
    gcp_hash = get_row_checksum(GCP_PARAMS, table, (start_time, end_time))

    if aws_hash == gcp_hash:
        print(f"✅ {table} matches for window {start_time} to {end_time}")
    else:
        print(f"❌ MISMATCH IN {table}!")

with ThreadPoolExecutor(max_workers=5) as executor:
    tables = ['transactions', 'user_balance', 'ledger']
    # Execute batch verification...

8. Phase 6: The Cutover Protocol (The "Zero Downtime" Step)

The cutover is the point of no return. This is where your preparations are put to the ultimate test.

Step 1: Maintenance Mode (Wait... but without the downtime)

We didn't show a maintenance page. Instead, we used our application-level Circuit Breaker. For 30 seconds, we queued all incoming write requests in Redis. To the user, the app just felt "slow" for half a minute.

Step 2: Final Sync Verification

We waited for the replication lag to hit 0.0 seconds.

Step 3: Nginx Hot-Swap

As described in previous posts, we shifted the database connection string at the proxy layer. However, we went further: we used pgBouncer to pool our connections. We simply updated the pgBouncer configuration on our app servers to point to the new GCP host.

# pgbouncer.ini update
[databases]
# Old: production = host=aws-rds.xxxx.com port=5432
# New:
production = host=gcp-cloudsql.xxxx.com port=5432

A simple pgbouncer -R (restart/reload) swaps all active connections to the new cloud provider in under 500ms.


9. Phase 7: Post-Migration & Long-Term Monitoring

After the cutover, we kept the AWS RDS instance running in a read-only state for 48 hours.

Rollback Strategy

If any anomaly was detected (e.g. higher latency in the checkout flow), we had a script ready to point pgBouncer back to AWS. Because we had a "Reverse Migration" job running, GCP was actually sending changes BACK to AWS during those 48 hours, ensuring we could rollback without data loss.

Final Validation

After 24 hours of stable production traffic on GCP, we ran a final checksum of the entire database. The result? 100% match.

The move from AWS to GCP has since allowed us to leverage BigQuery's federated queries to run real-time fraud detection without adding any load to our primary production database. The efforts spent on this zero-downtime migration have paid off manifold in architectural flexibility and system reliability.


10. Troubleshooting Real-World Migration Failures

No migration is perfectly smooth. During our staging runs, we encountered several performance-degrading events that would have been catastrophic in production.

Scenario A: The Replication Slot Disk Crisis

Mid-way through our full sync, our AWS VPN tunnel experienced a momentary lapse due to a renegotiation error. While connectivity was restored in under 4 seconds, the RDS replication slot stayed "pinned." This meant that RDS refused to recycle old WAL files. In a high-write environment, our disk usage spiked by 40GB in under an hour. The Fix: We implemented a "Kill Switch" script that would drop the replication slot if the disk space hit 90%, preferring a restart of the migration over a production database crash.

Scenario B: "Zombie" Connections During the Swap

After reloading pgBouncer, we found that nearly 5% of web workers were still trying to send traffic to the old AWS endpoint. The Fix: We had to force-terminate all active sessions on the RDS side using pg_terminate_backend. This ensured that the application's connection pool was forced to re-initiate and pick up the new GCP configuration from pgBouncer.


11. Comparative Performance: AWS RDS vs. GCP Cloud SQL

Post-migration, we ran extensive benchmarks to validate our architectural choice.

Read Latency

On AWS, our typical indexed read latency was ~1.2ms. On GCP Cloud SQL, using the Enterprise Plus tier with its enhanced data cache, we saw this drop to 0.8ms. This 30% improvement had a noticeable downstream effect on our slowest 95th percentile requests.

Provisioning Speed

GCP's ability to "clone" a database instance for testing is significantly faster than AWS's snapshot restoration process. Creating a full-fidelity test clone of our 500GB database takes under 10 minutes on GCP, compared to 45+ minutes on AWS. This has drastically improved our developer velocity for database-heavy feature testing.


12. Future-Proofing the Data layer

By moving to GCP, we are now poised to utilize AlloyDB for even higher throughput workloads in the future. The migration process we developed is engine-agnostic; we could use the same CDC patterns to move from Cloud SQL to AlloyDB with the same zero-downtime promises.

Strategic Advantages

  • Unified Billing: Consolidating our AI workloads (Vertex AI) and database costs on one platform.
  • Improved Management: Google's "Autopilot" features for database maintenance and query optimization have reduced our DevOps overhead by approximately 15%.

Final Checklist for Your Migration

If you are planning your own cross-cloud database move, keep this checklist pinned:

  1. [ ] Connectivity: Run an iperf test between VPCs for 24 hours.
  2. [ ] Networking: Set MTU to 1390 or 1400.
  3. [ ] Source DB: Ensure rds.logical_replication=1.
  4. [ ] Target DB: Temporarily disable non-unique indexes for the initial load.
  5. [ ] Validation: Run multi-threaded checksums on at least 10% of your data.
  6. [ ] Cutover: Use a connection pooler like pgBouncer for millisecond switching.
  7. [ ] Monitoring: Set alerts for replication lag (>5s) and disk space (>80%).

13. Deep Dive: Understanding the CLI Flags

To ensure absolute transparency in our automation, here is a breakdown of every flag used in the Google Cloud SDK commands for Database Migration Service:

  • --location=us-central1: Specifies the region where the migration job and its metadata reside. This should be as geographically close to your source as possible to minimize API latency.
  • --display-name: A human-readable label. In complex environments with hundreds of migrations, naming conventions like [Source]-to-[Target]-[Date] are life-savers.
  • --postgresql-host: The public endpoint or the private IP (if using Cloud VPN). We used the private IP to ensure data never traversed the public internet.
  • --postgresql-port: Default is 5432, but often obfuscated in enterprise environments for security.

14. Frequently Asked Questions (FAQ) for Database Migrations

Q: Why not use a 'One-Click' migration tool? A: Most "one-click" tools abstract away the most critical parts: replication lag monitoring and data validation. In a zero-downtime scenario, you need granular control over the WAL stream, which only a dedicated service like DMS combined with manual orchestration provides.

Q: What is the biggest danger during the promoter phase? A: "Split Brain." This occurs if your application starts writing to the new database while the old one is still accepting writes. Our use of pgBouncer with a forced connection termination (pg_terminate_backend) completely eliminated this risk.

Q: How do you handle schema changes during the migration? A: Absolute rule: Schema Freeze. We implemented a strict lock on all DDL changes (ALTER TABLE, CREATE INDEX, etc.) once the migration job was initialized. Schema changes destabilize the CDC process and can lead to unrecoverable replication errors.

Q: Can I migrate from a newer version of Postgres to an older version of Cloud SQL? A: No. Logical replication generally supports migrating to the same or a higher version. We took this opportunity to upgrade from Postgres 13 on AWS to Postgres 15 on GCP.

Q: How does the "Reverse Migration" work for rollbacks? A: We essentially configured Cloud SQL as the "Publisher" and AWS RDS as the "Subscriber" immediately after the cutover. This way, any new data created on GCP was instantly replayed back on AWS. If we had to rollback, the RDS instance would have been 100% current.


15. The Human Side: Team Roles during the Migration

A successful migration isn't just about code; it's about the humans at the keyboard. We divided our task force into four roles:

  1. The Navigator: Responsible for the timeline and the Go/No-Go decision.
  2. The Infrastructure Lead: Managed the VPN tunnels and the DMS job status.
  3. The Application Lead: Managed the pgBouncer reload and application circuit breakers.
  4. The Quality Assurance Lead: Ran the checksum scripts and verified data parity.

By having clear roles, we prevented the "too many cooks" syndrome during the high-pressure cutover window.


16. Final Conclusion and Future Roadmap

Our journey from AWS RDS to GCP Cloud SQL was more than just a change of scenery; it was a fundamental upgrade to our data tier. The performance gains, the ease of integration with Google's AI ecosystem, and the improved cost modeling have validated every hour of planning we invested.

If you take one thing away from this 3,000-word deep dive, let it be this: Zero downtime is a result of preparing for failure. By assuming the network would lag, the disk would fill, and connections would hang, we built a migration pipeline that could overcome all three.

Happy migrating, fellow engineers. May your replication lag always be zero.