Skip to the content.

Availability

Ensuring high availability in a system means minimizing downtime and maintaining functionality even if components fail. One common method is to implement a failover mechanism for the database using read replicas. This ensures that if the primary database becomes unavailable, a replica can take over as the new primary.


Step 1: Designing a Failover Mechanism

A failover mechanism involves the following components:

  1. Primary Database: The main database instance handling both read and write operations.
  2. Read Replicas: Secondary database instances synchronized with the primary, typically used for read operations but capable of being promoted to primary during failover.
  3. Failover Manager: A monitoring and management system (e.g., Amazon RDS Multi-AZ, Orchestrator, or custom scripts) to detect failures and switch traffic to a replica.

Example Architecture:


Step 2: Implementing Read Replicas

Using PostgreSQL:

  1. Set Up Primary Database:

    • Configure the primary database to allow replication.
    • Edit postgresql.conf:

      wal_level = replica
      max_wal_senders = 3
      max_replication_slots = 3
      
    • Restart PostgreSQL:

      sudo systemctl restart postgresql
      
  2. Set Up a Replica:

    • Clone the primary database:

      pg_basebackup -h <primary_host> -D /var/lib/postgresql/replica -U replication_user -Fp -Xs -P` 
      
    • Configure the replica:

      • Edit postgresql.conf:

        hot_standby = on
        
      • Create a recovery.conf file:

        standby_mode = 'on'
        primary_conninfo = 'host=<primary_host> port=5432 user=replication_user password=<password>'
        trigger_file = '/tmp/failover.trigger'
        
  3. Start the Replica:

    sudo systemctl start postgresql
    

Using AWS RDS:

  1. Launch a primary RDS instance.

  2. Create read replicas via the AWS Console or CLI:

    aws rds create-db-instance-read-replica \
        --db-instance-identifier db-replica-1 \
        --source-db-instance-identifier db-primary \
        --region us-east-1
    
  3. Enable Multi-AZ for automatic failover:

    • In the AWS Console, choose the primary database.
    • Enable Multi-AZ in the settings.

Step 3: Implementing Failover

Manual Failover Testing:

  1. Simulate Primary Database Failure:

    • Stop the primary database to simulate downtime:

      sudo systemctl stop postgresql
      
  2. Promote a Replica:

    • Create the trigger file on the replica to promote it:

      touch /tmp/failover.trigger
      
    • Alternatively, use pg_ctl:

      pg_ctl promote -D /var/lib/postgresql/replica
      
    • For AWS RDS:

      • Use the AWS Console or CLI to promote a read replica:

        aws rds promote-read-replica \
            --db-instance-identifier db-replica-1
        
  3. Update Application Configuration: Update the application’s database connection string to point to the new primary.


Automatic Failover:

  1. Monitoring Tools:

    • Use tools like Patroni, PgBouncer, or AWS RDS Multi-AZ for automatic failover.
    • These tools monitor the health of the primary database and switch to a replica if a failure is detected.
  2. Configure DNS or Load Balancer:

    • Use a load balancer (e.g., HAProxy) to route traffic to the current primary.
    • Example HAProxy configuration:

      backend db_cluster
          mode tcp
          option tcp-check
          server db-primary 192.168.1.1:5432 check
          server db-replica-1 192.168.1.2:5432 check backup
      

Step 4: Observing and Testing Availability

  1. Simulate Increasing Traffic: Use a script or Apache Benchmark to send read and write requests:

    ab -n 1000 -c 100 http://your-application.com/api
    
  2. Perform Manual Failover:

    • Stop the primary database.
    • Promote a replica.
    • Observe whether the application continues to function with minimal downtime.
  3. Monitor Metrics:

    • Use pg_stat_replication in PostgreSQL to monitor replication lag:

      SELECT * FROM pg_stat_replication;
      
    • Use cloud monitoring tools like AWS CloudWatch for metrics such as replication lag and instance health.


Expected Results


Takeaways