Monitoring starts with metrics. How to set up infrastructure monitoring

Monitoring starts with metrics. How to set up infrastructure monitoring
How we set up monitoring for Scalesta hosting solution

You can't manage what you can't measure. In the context of IT projects, this means you need to monitor all parts of the project: from CPU utilization, load speed and fault tolerance to business indicators such as the number of orders and checkouts in your online store or banner impressions on a website.

For stable work of your project and efficient technical support you need to collect metrics, visualize their dynamics (in dashboards or charts), and work with incidents before they become real disasters. Monitoring systems continuously check the infrastructure behind your business so you can be sure that all the processes are stable and well-optimized. This includes:

✔️Making sure that the SQL database works well
✔️Evaluation of the amount of free storage space
✔️Checking if Nginx responds to requests correctly
✔️Evaluation of the state of the server, etc

The system must be "adequate" and not over complicated to satisfy basic principles:

1) track the metrics that are needed for decision-making and 2) do not spam with false alarms.

It sounds simple and logical, but in practice, finding a balance is not always easy.

Alert Concept

You don’t need to monitor all of the parameters of your systems 24/7. Only a few metrics are crucial for the sustainable work of your website – for instance, the availability of the web server. Others, like the number of open file descriptions, don't need to be monitored constantly. They can be observed during routine checks from time to time. The construction of alerting - an automatic notification system that a metric has reached a threshold value - is based on the concepts of SLI, SLA and SLO.

SLI (Service Level Indicators) is a quantitative assessment of the performance of a service, usually related to user satisfaction with the performance of an application or service over a given period of time (month, quarter, year). More specifically, it is a user experience indicator that tracks one of the many possible metrics and is presented in percentage terms, where 100% means a great user experience, and 0% means terrible.

SLA (Service Level Agreement) defines the level of service expected by a customer from a supplier with metrics by which that service is measured, and the remedies or penalties, if service levels are not achieved. SLA defines an external obligation to the end user or client.

SLO (Service Level Objectives) - a set of target, “desired” SLI values, going beyond which can lead to a violation of the SLA of a particular service or component. The maximum allowable deviation from the "ideal" indicators in this concept is called the Error Budget (the right to make a mistake). As an example, this could be: the maximum number of 500 errors in 5 minutes, the maximum time a web page is unavailable, the maximum allowable load on the processor, etc.

In general SLO is the threshold at which you need to set alerts. But SLO is a "desired" state, and not everything that differs from it is necessarily an abnormal situation. In simple words: an alert should work not when “everything is already very bad”, and not from every “interference”, but when a problem has arisen, but something can still be fixed. In our experience, to achieve this balance, internal alerts are best set to a value between SLO and Error Budget - when the system behavior can still be called normal, but if nothing is done, there is a risk of going beyond the SLA.

Usually, this applies to reaching the limit indicators of CPU usage, free storage space, RAM space, and the availability of all nodes. For example, the low priority threshold for CPU usage can be 85%, and the high priority for the same metric will be around 100% – your CPU shouldn't run at full capacity for long periods.

Or another example. If the server is used to distribute content, sudden fluctuations in the load on the channel may signal some kind of anomaly, for example, a DDoS attack. In our practice, alerts on network interfaces are most often set at the request of customers and track the incoming load >90% of the limit and alert for a sharp (and unexpected) jump in incoming or outgoing traffic.

It makes little sense to monitor those metrics that are not clear what to do with or that are more expensive to collect and process than the potential costs of not monitoring them can be. And alerts should not occur every two seconds and merge into white noise.

Read more: How to choose a web hosting provider for your eCommerce business

How to skip all that hassle and still have servers that never fail

server and infrastructure monitoring

That all looks like a lot of work! And, frankly, it is. There is a way to avoid all of that hassle by outsourcing the maintenance of your servers to professionals. At Scalesta we handle all the technical issues for our clients.

To control the performance and security of your project, we use Zabbix, an advanced monitoring tool that can track the dynamics of servers and network equipment, quickly respond to emergency situations and prevent possible load problems. Special configurations for online stores allow us to check your server performance, security integrity and backup procedures every minute.

The monitoring system constantly monitors 300 indicators of services and servers and 50 hardware indicators. This means 1440 web parameter checks and 250,000 measurements per server per day. Based on this data, our systems make predictions that allow us to respond to incidents before your customers notice that something is going wrong. In other words, we're evolving our monitoring to respond to incidents that haven't happened yet.

Efficient process automation, tested on hundreds of servers of our clients, guarantees 99% uptime (availability of your project). In case of an emergency, the on-duty specialist will begin to solve your problem within 15 minutes as stipulated in the Service Level Agreement (SLA). According to statistics, in 2022 the response time is 6 minutes 17 seconds.