System Load
If you receive any of the following System warnings in slack, it is often the result of a php error or attack.
NOTE: SYS_LOAD_AVG is not an issue until 200%
First, try and identify the cause using either HTOP or TOP or both:
- How to use the top command to monitor system processes
- How to use the htop command to monitor system processes
Pressing shift + h will condense the htop view and make it easier to identify a specific site or process causing the issue.
Most SYS_LOAD or SYS_CPU_USAGE alerts are due to an automated attack.
Diagnosing and resolving an automated attack
Most attacks are easily dealt with by banning the offending IP's.
- Use HTOP to diagnose which site or sites are currently under attack.
- Sort the table by using
f6and selecting a column, or clicking the column header - Use
shift + hto condense the table and make it easier to recognize the problem site - Often a single site will have mulitple
php-fpm: pool \{\{domain.com\}\}processes running and crashing the server - If the server is down and other sites are impacted, suspending the offending site in Gridpane's site settings panel will ensure other sites are available while we ban the offending IPs - having one site down is preferable to managing multiple clients experience downtime
- Once the offending site is identified, access the sites settings panel in Gridpane and check the site access logs and nginx error logs. Often a single IP or set of IPs will be clear as the offenders.
- If we control the sites DNS via Cloudflare, or the site is managed via Cloudflare for SaaS, log into Cloudflare and add the offending IP to the WAF block rules for that domain (or SaaS)
- Find WAF rules under Domain -> Security -> Security Rules - the first rule is typically the rule containing all blocks
- Log the IP to be added across all domains WAF rulesets at a later time
Next, if it’s not an attack,
Warning examples
System Load Average
Note that “per core” refers to the system’s load average. 100% is equal to one vCPU, 200% is equal to two vCPU, and so on.
Monitor after these alerts:
SYS_LOAD_AVG_15 70% CPU Warning {host} {serverIP} 15 Minute Load average has been running at over 70% per core for over 1 hour.
SYS_LOAD_AVG_15 100% CPU Warning {host} {serverIP} 15 Minute Load average has been running at over 100% per core for over 30 minutes.
Take action if alert hits 200%
SYS_LOAD_AVG_15 200% CPU Warning {host} {serverIP} 15 Minute Load average has been running at over 200% per core for over 10 minutes.
System Memory Usage
Monitor after these alerts:
SYS_MEM_USAGE 70% RAM Warning {host} {serverIP} System Memory utilisation has exceeded 70% RAM for over 1 hour.
SYS_MEM_USAGE 80% RAM Warning {host} {serverIP} System Memory utilisation has exceeded 80% RAM for over 30 minutes.
Take action if alert hits sustained 90%+
SYS_MEM_USAGE 90% RAM Warning {host} {serverIP} System Memory utilisation has exceeded 90% RAM for over 10 minutes.
System CPU Usage
Monitor after these alerts:
SYS_CPU_USAGE {user} 30-70% CPU Warning {host} ${serverIP} {user} CPU utilisation has exceeded 30-70%. This is nothing to be too concerned about yet…
Take action if alert hits sustained 70%+
SYS_CPU_USAGE {user} 70+% CPU Warning {host} ${serverIP} {user} CPU utilisation has exceeded 70+%.
System Swap Memory
SYS_SWAP_MEM {aspect}% Usage Warning {host} {serverIP} System Swap Memory usage has exceeded {aspect}% of allocation!