Monitoring Effectively – Steve David

Monitoring is all about finding and addressing problems before your customers do. But what conditions should you be watching for?

I suggest a tiered approach, consisting of the following three strategies, in priority order: 1) prevention, 2) monitor for causes and 3) monitor for symptoms.

Prevention

Where possible, we should strive to prevent problems from occurring in the first place.

For example, to prevent a web service from going offline from the perspective of your customers, implement redundancy. This is preferable to a monitor that continually verifies your services are running — though I’d argue you’d want both in this case.

Another example, to prevent invalid data from getting added to a database table, do the following:

Implement validation rules on the front-end, in your business tier, or both.
For discrete column values, employ foreign key constraints. This will ensure data is valid at the time the record is inserted / updated, else an exception will be thrown.

Effective prevention measures will reduce incidences of problems, and reduce the scope and type of monitoring needed.

Monitoring

While we should always have processes in place to minimize problems and maximize quality, we will never be able to prevent all problems from occurring — they are a fact of life.

What’s important is how we detect and handle them. In fact, when issues occur and are addressed in a proactive, positive manner, this can increase customer loyalty.

In our monitoring, we’ll want to watch for two categories of conditions, causes and symptoms. Causes lead to symptoms, which is what ultimately get noticed and affects the customer experience. Both types of problems can and should be monitored.

Here’s a chain of events to illustrate:

A bug in the code creates a runaway process under certain circumstances
Excessive data is written to disk
Disk fills to 100% capacity
Database on same machine can no longer operate and hangs
Processes can no longer connect to the database
Users can no longer log in to your portal or view your website

Ultimately, what directly affects the user experience is the inability to use the portal. The remaining items in the chain are hidden from customer view. However these items can still be monitored as early warning signs of an impending problem.

Monitor for Causes

If we can identify the root cause of a problem right away, you may be able to take corrective action — automatically or manually — before it starts affecting the user experience.

Here are some examples of causes and actions:

Problem Cause	Possible Action(s)
low / out of disk space	compress and/or archive unneeded files, preferably with an automated script
database deadlock	implement code to wait and retry the transaction
hardware / server failure	automatic failover to alternate cluster node

Monitoring ’causes’ is limited to what we predict can go wrong — we are still left with the unknown. To cover these, we can implement monitoring that keeps an eye on things that would affect the user experience — or, symptoms.

Monitor Symptoms

Symptoms are often easier to identify, as they are what the user would notice. At Talksoft, some examples are:

Some examples of symptoms that can be programmatically detected — you should implement monitoring for as many as possible. Others manifest as incidents that you won’t know about until the customer reports it to you. In either case, detecting symptoms are likely to result in further investigative action.

Symptom	Possible Action(s)
Customer unable to access portal	Send alert to support team for investigation
Customer’s reminders not going out	Send alert to support team for investigation
Low or no messaging volume from a server	Send alert to support team for investigation

It’s important to recognize these incidents as opportunities, and do whatever follow-up that’s needed to catch it if it were to happen again, or better yet to prevent its occurrence in the first place. Over time, observation of symptoms will uncover more and more potential problem areas to implement preventative measures and detection, resulting in an ever-increasing degree of system stability.

Implementing Monitoring

Some ideas for developing your monitoring capabilities include:

Heartbeat Checkers – processes that run continuously are vulnerable to resources blockages or thread hangs. Have them write to a file or update a database table periodically. Then, have an independent thread or process send an alert if those updates cease to occur. For example, worker thread A runs in a loop processing customer files, updating a heartbeat file every time through. Heartbeat checker thread B has one responsibility, and that is to check the date/time of the heartbeat file updated by thread A.
Scheduled Scripts – If there are conditions in your database that signal a problem, you can write a query that can check for and send an alert if those conditions occur. For example, excessive login attempts as represented in your audit trail table could indicate a brute force attack is occurring, and can generate an alert.
Log / Alert Console – As your system grows in complexity, there will inevitably be multiple log files or tables to keep tabs on. Consolidating them into a single view, preferably with business rules to elevate visibility of important events, can make monitoring much more efficient and effective.

Conclusion

Monitoring is a key component in developing stable production software. The cycle of monitoring to problem detection to improvement leads to a continuously improving software and satisfied, loyal customers.