Zero-Downtime Database Migration at Scale

Two million documents. One month. Zero downtime. Here's how we migrated healthcare documents from Cloudinary to AWS S3.

Aug 20, 2024 6 min read

The Constraint

Healthcare workers don’t stop working because you’re running a migration. Clinicians access patient documents at 3 AM. Lab results get uploaded on weekends. A system that stores and serves two million medical documents — PDFs, imaging files, scanned forms — cannot go down for maintenance. Not for five minutes, not for thirty seconds.

We needed to move every document from Cloudinary to AWS S3, update every reference in our PostgreSQL database, and do it without any user noticing. The timeline was one month. The margin for error was zero.

Why We Migrated

Cloudinary served us well for the first year, but we were hitting limits. Healthcare documents have strict retention and access control requirements. We needed server-side encryption with customer-managed keys, VPC-scoped access, and audit logging that met HIPAA standards. S3 with proper IAM policies gave us all of that. Cloudinary didn’t.

Cost was also a factor. At two million documents and growing 15% month-over-month, the storage and transformation costs were becoming significant. S3 was roughly 60% cheaper for our access patterns.

The Dual-Write Pattern

The core strategy was a dual-write pattern. From the moment we flipped the migration flag, every new document upload would go to both Cloudinary and S3. Reads would continue from Cloudinary. This gave us a clean cutover point: once all historical documents were migrated and verified, we’d switch reads to S3 and stop writing to Cloudinary.

interface StorageProvider {
  upload(key: string, buffer: Buffer, metadata: DocumentMeta): Promise<string>;
  getSignedUrl(key: string, expiresIn: number): Promise<string>;
  delete(key: string): Promise<void>;
}

class DualWriteStorageService {
  private primary: StorageProvider;   // Cloudinary during migration
  private secondary: StorageProvider; // S3 during migration
  private migrationActive: boolean;

  constructor(
    private cloudinary: StorageProvider,
    private s3: StorageProvider,
    private db: Database,
    private featureFlags: FeatureFlagService
  ) {
    this.migrationActive = featureFlags.isEnabled('dual-write-storage');
    this.primary = cloudinary;
    this.secondary = s3;
  }

  async upload(key: string, buffer: Buffer, meta: DocumentMeta): Promise<string> {
    // Always write to primary
    const primaryUrl = await this.primary.upload(key, buffer, meta);

    if (this.migrationActive) {
      try {
        // Write to secondary, but don't block on failure
        const s3Key = this.toS3Key(key, meta);
        await this.secondary.upload(s3Key, buffer, meta);
        await this.db.query(
          `UPDATE documents SET s3_key = $1, s3_migrated = true WHERE storage_key = $2`,
          [s3Key, key]
        );
      } catch (err) {
        // Log, don't throw. Primary write succeeded -- that's what matters.
        logger.error('Secondary write failed', { key, error: err.message });
        await this.queueForRetry(key, meta);
      }
    }

    return primaryUrl;
  }

  async getUrl(key: string): Promise<string> {
    const readFromS3 = this.featureFlags.isEnabled('read-from-s3');

    if (readFromS3) {
      const doc = await this.db.query(
        `SELECT s3_key, s3_verified FROM documents WHERE storage_key = $1`,
        [key]
      );
      if (doc.s3_key && doc.s3_verified) {
        return this.s3.getSignedUrl(doc.s3_key, 3600);
      }
      // Fallback to Cloudinary if S3 copy isn't verified yet
      logger.warn('S3 read fallback to Cloudinary', { key });
    }

    return this.primary.getSignedUrl(key, 3600);
  }

  private toS3Key(key: string, meta: DocumentMeta): string {
    return `documents/${meta.organizationId}/${meta.patientId}/${key}`;
  }

  private async queueForRetry(key: string, meta: DocumentMeta): Promise<void> {
    await this.db.query(
      `INSERT INTO migration_retry_queue (storage_key, metadata, attempts)
       VALUES ($1, $2, 0) ON CONFLICT (storage_key) DO UPDATE SET attempts = 0`,
      [key, JSON.stringify(meta)]
    );
  }
}

The critical detail: the secondary write is fire-and-forget from the user’s perspective. If S3 upload fails, we log it, queue it for retry, and move on. The user never sees an error. We ran a background worker that retried failed secondary writes every five minutes.

Backfill: The Boring but Dangerous Part

Dual-write handles new documents. For the two million existing documents, I wrote a backfill worker that processed records in batches of 500. Each batch: fetch the document from Cloudinary, upload to S3, update the database record, and verify the checksum matches.

The backfill ran at a throttled rate — about 50 documents per second — to avoid spiking Cloudinary’s rate limits or saturating our network. At that rate, two million documents took roughly 11 hours. I ran it 3 times: once for the initial copy, once for documents created during the first run, and a final sweep for stragglers.

Verification and Consistency

I didn’t trust the migration. You shouldn’t trust any migration. We ran a verification job that compared every document’s MD5 checksum between Cloudinary and S3. Out of two million documents, 847 had mismatches — mostly caused by Cloudinary’s automatic format optimization that slightly altered file contents on retrieval. We re-downloaded the originals through their raw URL endpoint and re-uploaded. The second verification pass came back clean.

We also ran a shadow-read test for 2 weeks. Every read request would fetch from both providers and compare response times and checksums. This caught three edge cases where URL signing behaved differently between providers for documents with special characters in their keys.

The Switchover

The actual cutover was anticlimactic, which is exactly what you want. We flipped the read-from-s3 feature flag at 2 AM on a Tuesday. The fallback logic meant that any document not yet verified in S3 would still serve from Cloudinary. We watched dashboards for an hour. Error rates stayed flat. Latency dropped by 40ms on average because S3 was in the same region as our application servers.

Over the next week, we confirmed zero Cloudinary reads in our logs. Then we disabled dual-write. Then, after a 30-day hold period, we deleted the Cloudinary copies.

Rollback Strategy

At every stage, rollback was one feature flag flip away. Dual-write on or off. Read from S3 or Cloudinary. The database kept both keys for every document until we were fully confident. This is the real value of the dual-write pattern: it’s not just a migration strategy, it’s a safety net that gives you weeks of runway to catch problems.

What I’d Do Differently

I’d invest in better observability from day one. We built dashboards after the first verification failure, but we should have had per-document migration state tracking from the start. A simple state machine — pending, copied, verified, active — with a dashboard showing progress would have saved hours of ad-hoc querying.

I’d also run the shadow-read test from the beginning of the backfill, not after it. Those three edge cases with special characters could have been caught in week 1 instead of week 3.

The migration shipped on time, with zero downtime, and zero data loss. Boring outcomes are the best outcomes in infrastructure work.

Interested in this kind of work? Let's talk

Discussion