Central is primarily RAM bound, even for exports, so I'd recommend you monitor RAM/swap utilization and disk utilization (because of this tmp file issue). You'll need to use the CloudWatch Agent to get these metrics.
It looks like you are seeing regular spikes. What is happening at those times? Do you have backups configured? Is everyone sending submissions at that time? Log into the machine and run htop
and docker stats
and see what happens at those times. Checking your logs with docker-compose logs
is also a good idea.
It might also be good to move your database from EC2 into RDS to gain continuous backups of the data (although you should still backup the EC2 machine) and more performance.
As far as minimizing downtime, multiple Central instances that write to one database which also has a replica is the best way to do that. It's not very straightforward for most people to do that. A reasonable alternative is to get a beefy machine, monitor performance, and adjust accordingly. Or use ODK Cloud, and all these problems go away