We moved this page to our Documentation Portal. You can find the latest updates here. |
Munin, the monitoring tool surveys all your computers and remembers what it saw. It presents all the information in graphs through a web interface. Using Munin you can easily monitor the performance of your computers, networks, SANs, applications, weather measurements and whatever comes to mind. It makes it easy to determine "what's different today" when a performance problem crops up. It makes it easy to see how you're doing capacity-wise on any resources.
Onapp CDN provides Munin graphs for all edge servers to help operators monitor their edge servers status. The Munin graphs are accessible through CDN Debug page. These are example of good and bad graphs:
Good |
Bad |
|
CPU Usage Y-Axis represents CPU usage percentage. Avoid having high iowait and high steal. Also spot the unusual trend like the system cpu usage and user cpu usage growing rapidly. |
|
|
Shows CPU has low iowait and low steal. |
Shows CPU has high iowait and some steal. Possible Actions: Upgrade storage to high performance disk (eg. SSD) and/or upgrade CPU. |
|
Disk usage in percent Y-Axis represents disk usage percentage. Ignore /dev, /run, /run/lock, /run/shm and /boot partitions. Cache partitions is OK to fill up until 90% of disk space. |
|
|
There are a lot of free space on /, /mnt/nginx/bay-* and /var/cache/nginx-hls partitions. |
There is a small free space on / partition. Generally, all partitions should not grow beyond 90% of disk space. Possible Actions: Require investigation. |
|
Utilization per device Y-Axis represents disk percentage busy. The disk utilization should be below 80% on average. |
||
|
Shows a minimal disk utilization. |
Shows high disk utilization, reaching 90% disk utilization. Possible Actions: Upgrade storage to high performance disk (eg. SSD) and/or add more disks. |
Disk IOs per device
|
|
|
Positive value shows write operations. Negative value shows read operations. Zero value indicates no IO operations on the server. |
Shows a sustained read and write operations in the edge indicates the edge is working well. |
Shows a read and write is almost zero (middle line) which indicates the edge is idle and not processing any request. The spike shown when the edge is resumed to ACTIVE. Possible Actions: Require further investigation |
Disk latency per device |
||
|
Shows the disk latency is on average below 10 milliseconds. This is reasonable disk latency for an edge with SSD storage. For edge with HDD storages, disk latency below 50 milliseconds is acceptable. |
Show the disk latency is high (more than 10 milliseconds). Possible Actions: Upgrade storage to high performance disk (eg. SSD) |
Load average Y-Axis represents CPU Load. Higher numbers may represents a problem or an overloaded machine. |
|
|
|
Load is below 6 on average. Generally, load of 6 to 8 is normal for average specification edge servers with 4 to 8 CPU cores. |
Load is high which is above 10 on average. However, for server with high specification, it is OK to have higher CPU Load. Possible Actions: Upgrade CPU to be able to handle large loads. |
Memory usage Y-Axis represents memory usage in bytes. |
||
|
Shows steady memory usage. |
Memory usage grow quickly beyond its capacity and has small unused space. Possible Actions: Require further investigation |
Nginx Y-Axis represents Connections or Requests. Higher the value indicates more connections and requests it handles. |
Shows that the edge handle a lot user requests. |
Shows that the edge handle a little amount of user requests. Possible Actions: Ensure the edge has a good specification so that DNS will redirect more request to the edge. |
Ping & Packet Loss Y-Axis represents both Packet Loss and Ping. Positive value represents ping time in millisecond. Negative value represents percentage of packet loss. Ping is tested towards our CDN Monitoring servers. |
Shows a consistent connection between edge and our CDN Monitoring servers and no packet loss (no negative value). |
Shows packet loss (negative value) and unreachable from our CDN Monitoring servers. Possible Actions: Ensure the Internet connection to the server is stable. |
Throughput per device Y-Axis represents bytes of read and write per second. Positive value indicates data write. Negative value indicates data read. Zero value indicates no data operations on the storage devices. |
Shows that the storage devices are actively handling user requests. |
Shows the in storage devices are near zero which indicates the edge might be idle from handling user requests. Possible Actions: Require investigation |
Uptime Y-Axis represents uptime in days. |
Server uptime is increasing over time indicates no downtime on the server. |
Sudden drop in server uptime indicates the server experienced downtime recently. Possible Actions: Ensure the server has lesser downtime and network connection is good. |
In conclusion, the examples given are a brief interpretation of the graphs for edge servers. If you notice an unusual trend or pattern in any edges, feel free to contact us for more detailed clarifications and recommended actions to optimize your edge server performances.
External links:
- http://munin-monitoring.org/
- https://en.wikipedia.org/wiki/Munin_(software)
- http://kaminario.com/company/blog/whats-an-acceptable-io-latency/
- http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
- http://blog.scoutapp.com/articles/2013/07/25/understanding-cpu-steal-time-when-should-you-be-worried
- https://www.cyberciti.biz/tips/linux-disk-performance-monitoring-howto.html