Monitoring is all about finding and addressing problems before your customers do. But what conditions should you be watching for?
I suggest a tiered approach, consisting of the following three strategies, in priority order: 1) prevention, 2) monitor for causes and 3) monitor for symptoms.
Prevention
Where possible, we should strive to prevent problems from occurring in the first place.
For example, to prevent a web service from going offline from the perspective of your customers, implement redundancy. This is preferable to a monitor that continually verifies your services are running — though I’d argue you’d want both in this case.
Another example, to prevent invalid data from getting added to a database table, do the following:
- Implement validation rules on the front-end, in your business tier, or both.
- For discrete column values, employ foreign key constraints. This will ensure data is valid at the time the record is inserted / updated, else an exception will be thrown.
Effective prevention measures will reduce incidences of problems, and reduce the scope and type of monitoring needed.
Monitoring
While we should always have processes in place to minimize problems and maximize quality, we will never be able to prevent all problems from occurring — they are a fact of life.
What’s important is how we detect and handle them. In fact, when issues occur and are addressed in a proactive, positive manner, this can increase customer loyalty.
In our monitoring, we’ll want to watch for two categories of conditions, causes and symptoms. Causes lead to symptoms, which is what ultimately get noticed and affects the customer experience. Both types of problems can and should be monitored.
Here’s a chain of events to illustrate:
- A bug in the code creates a runaway process under certain circumstances
- Excessive data is written to disk
- Disk fills to 100% capacity
- Database on same machine can no longer operate and hangs
- Processes can no longer connect to the database
- Users can no longer log in to your portal or view your website
Ultimately, what directly affects the user experience is the inability to use the portal. The remaining items in the chain are hidden from customer view. However these items can still be monitored as early warning signs of an impending problem.
Monitor for Causes
If we can identify the root cause of a problem right away, you may be able to take corrective action — automatically or manually — before it starts affecting the user experience.
Here are some examples of causes and actions:
| Problem Cause | Possible Action(s) |
| low / out of disk space | compress and/or archive unneeded files, preferably with an automated script |
| database deadlock | implement code to wait and retry the transaction |
| hardware / server failure | automatic failover to alternate cluster node |
Monitoring ’causes’ is limited to what we predict can go wrong — we are still left with the unknown. To cover these, we can implement monitoring that keeps an eye on things that would affect the user experience — or, symptoms.
Monitor Symptoms
Symptoms are often easier to identify, as they are what the user would notice. At Talksoft, some examples are:
Some examples of symptoms that can be programmatically detected — you should implement monitoring for as many as possible. Others manifest as incidents that you won’t know about until the customer reports it to you. In either case, detecting symptoms are likely to result in further investigative action.
| Symptom | Possible Action(s) |
| Customer unable to access portal | Send alert to support team for investigation |
| Customer’s reminders not going out | Send alert to support team for investigation |
| Low or no messaging volume from a server | Send alert to support team for investigation |
It’s important to recognize these incidents as opportunities, and do whatever follow-up that’s needed to catch it if it were to happen again, or better yet to prevent its occurrence in the first place. Over time, observation of symptoms will uncover more and more potential problem areas to implement preventative measures and detection, resulting in an ever-increasing degree of system stability.
Implementing Monitoring
Some ideas for developing your monitoring capabilities include:
- Heartbeat Checkers – processes that run continuously are vulnerable to resources blockages or thread hangs. Have them write to a file or update a database table periodically. Then, have an independent thread or process send an alert if those updates cease to occur. For example, worker thread A runs in a loop processing customer files, updating a heartbeat file every time through. Heartbeat checker thread B has one responsibility, and that is to check the date/time of the heartbeat file updated by thread A.
- Scheduled Scripts – If there are conditions in your database that signal a problem, you can write a query that can check for and send an alert if those conditions occur. For example, excessive login attempts as represented in your audit trail table could indicate a brute force attack is occurring, and can generate an alert.
- Log / Alert Console – As your system grows in complexity, there will inevitably be multiple log files or tables to keep tabs on. Consolidating them into a single view, preferably with business rules to elevate visibility of important events, can make monitoring much more efficient and effective.
Conclusion
Monitoring is a key component in developing stable production software. The cycle of monitoring to problem detection to improvement leads to a continuously improving software and satisfied, loyal customers.
