Tuesday, March 13, 2018

Scale an Aurora cluster's writer node

While there is no "autoscaling" for RDS, for adjusting an instance on a schedule, the AWS CLI can be used.  For an Aurora cluster, when you resize the primary node (the writer), AWS fails writes over to one of the readers but never switches that role back (see the AWS doc for details on the failover settings and logic).  This would leave the cluster in a state where the up-sized node becomes a "reader" and one of the smaller nodes would remain a "writer".

In this use-case, I needed to increase the capacity of the writer for a few hours a day for a known ingestion event (the fleet of readers could remain the same size and number), but then decrease the writer afterwards.  The standard "aws rds modify-db-instance" call works as expected, but after scaling the instance, Aurora still leaves a smaller reader in the cluster as the primary (writer).

Below is a script that does the resizing, then waits until the change has taken effect, then switches the "writer" back to the newly resized node.

Example usage:
scriptname.sh db.m4.xlarge
Where the parameter is the new shape to apply.

The cluster ID and node names used can be found in the RDS / Cluster page in the AWS console.

check_for='"PendingModifiedValues": {},'

aws rds modify-db-instance --db-instance-identifier "${primary_node}" --db-instance-class "${instance_size}" --apply-immediately --region "${region}"

echo "Checking status for ${primary_node}..."
until [ "${pending_status}" != "" ]; do
    pending_status=$(aws rds describe-db-instances --db-instance-identifier "${primary_node}" --region "${region}" --output json | grep "${check_for}")
    echo "${primary_node} still pending changes, waiting."
    sleep 10

echo "Failing back to ${primary_node}"
# sometimes it seems pending status is removed but node is not yet ready for failback (likely pending-reboot but not shown in CLI response)
# so if an error occurs, keep trying (there's probably a better way to do this)
while [ $? -ne 0 ]; do
    aws rds failover-db-cluster --db-cluster-identifier "${cluster_id}" --target-db-instance-identifier "${primary_node}" --region "${region}"
    sleep 5

view the gist

This can run as either a cron or part of the ingestion process.

There are likely better ways to actually check for the status.