Availability

Ensuring high availability in a system means minimizing downtime and maintaining functionality even if components fail. One common method is to implement a failover mechanism for the database using read replicas. This ensures that if the primary database becomes unavailable, a replica can take over as the new primary.

Step 1: Designing a Failover Mechanism

A failover mechanism involves the following components:

Primary Database: The main database instance handling both read and write operations.
Read Replicas: Secondary database instances synchronized with the primary, typically used for read operations but capable of being promoted to primary during failover.
Failover Manager: A monitoring and management system (e.g., Amazon RDS Multi-AZ, Orchestrator, or custom scripts) to detect failures and switch traffic to a replica.

Example Architecture:

Primary Database: db-primary
Read Replica 1: db-replica-1
Read Replica 2: db-replica-2

Step 2: Implementing Read Replicas

Using PostgreSQL:

Set Up Primary Database:
- Configure the primary database to allow replication.
- Edit postgresql.conf:
```
wal_level = replica
max_wal_senders = 3
max_replication_slots = 3
```
- Restart PostgreSQL:
```
sudo systemctl restart postgresql
```

Set Up a Replica:

Clone the primary database:

pg_basebackup -h <primary_host> -D /var/lib/postgresql/replica -U replication_user -Fp -Xs -P` 

Configure the replica:

Edit postgresql.conf:
```
hot_standby = on
```

Create a recovery.conf file:

standby_mode = 'on'
primary_conninfo = 'host=<primary_host> port=5432 user=replication_user password=<password>'
trigger_file = '/tmp/failover.trigger'

Start the Replica:
```
sudo systemctl start postgresql
```

Using AWS RDS:

Launch a primary RDS instance.

Create read replicas via the AWS Console or CLI:

aws rds create-db-instance-read-replica \
    --db-instance-identifier db-replica-1 \
    --source-db-instance-identifier db-primary \
    --region us-east-1

Enable Multi-AZ for automatic failover:
- In the AWS Console, choose the primary database.
- Enable Multi-AZ in the settings.

Step 3: Implementing Failover

Manual Failover Testing:

Simulate Primary Database Failure:
- Stop the primary database to simulate downtime:
```
sudo systemctl stop postgresql
```
Promote a Replica:
- Create the trigger file on the replica to promote it:
```
touch /tmp/failover.trigger
```
- Alternatively, use pg_ctl:
```
pg_ctl promote -D /var/lib/postgresql/replica
```
- For AWS RDS:
  - Use the AWS Console or CLI to promote a read replica:
    aws rds promote-read-replica \ --db-instance-identifier db-replica-1
Update Application Configuration: Update the application’s database connection string to point to the new primary.

Automatic Failover:

Monitoring Tools:
- Use tools like Patroni, PgBouncer, or AWS RDS Multi-AZ for automatic failover.
- These tools monitor the health of the primary database and switch to a replica if a failure is detected.

Configure DNS or Load Balancer:

Use a load balancer (e.g., HAProxy) to route traffic to the current primary.

Example HAProxy configuration:

backend db_cluster
    mode tcp
    option tcp-check
    server db-primary 192.168.1.1:5432 check
    server db-replica-1 192.168.1.2:5432 check backup

Step 4: Observing and Testing Availability

Simulate Increasing Traffic: Use a script or Apache Benchmark to send read and write requests:
```
ab -n 1000 -c 100 http://your-application.com/api
```
Perform Manual Failover:
- Stop the primary database.
- Promote a replica.
- Observe whether the application continues to function with minimal downtime.
Monitor Metrics:
- Use pg_stat_replication in PostgreSQL to monitor replication lag:
```
SELECT * FROM pg_stat_replication;
```
- Use cloud monitoring tools like AWS CloudWatch for metrics such as replication lag and instance health.

Expected Results

Manual Failover:
- Minimal downtime when switching to a replica.
- Applications may need a restart to reconnect to the new primary.
Automatic Failover:
- Seamless transition with negligible downtime.
- Applications continue to operate as long as the failover mechanism is correctly configured.

Takeaways

Read replicas enhance availability and scalability by offloading read operations.
Failover mechanisms (manual or automatic) ensure high availability during primary database failures.
Regular testing and monitoring are essential to validate failover configurations.