Your suitcases are packed and your plane tickets neatly put away with your passport. This time, you haven’t forgotten the sunscreen or the mosquito repellent. Everything is just perfect.
At work you’ve thought about everything too. A little while ago, you hired “Bob”, a former NSA hacker turned entreprise security expert. He keeps watch over your data like Cerberus over the gates of hell. In case “Bob” happened to vanish into thin air, you’ve left crystal clear instructions for your teams. But nothing will happen anyway, because your servers are buried in a fallout shelter in Siberia guarded by a squad of Ninja Green Berets.
EVERYTHING. You’ve thought about everything and your email — your company’s most critical service — is under control.
That’s when your mobile starts to buzz.
One text message, then another, then another… Dozens of messages start pouring in. There’s not time to read one, the next is already here. There’s James, from Accounts, who’s asking “When will it be back on? I need to finish something urgent”; Stephanie from Sales “Need I remind you that the contract has to be out today?”; and Peter from R&D “What do I do? I’m supposed to deliver a patch this afternoon!?”…
That’s when Bob, hacker extraordinaire, calls.
– “We have a problem”, he says.
As sweat drips down your forehead, Bob breaks the news:
– “The email service is down!”
– “What!? What happened?! Have we been attacked? Malware? An issue with our datacentre? Our virtualisation platform?”
“No, nothing like that…
The main hard disk is full”
In the era of digital technology, hybrid clouds, AI all around, digital revolution, process monitoring and data and solutions observability, you might have thought that the first cause of email breakdown would be a power issue at your datacentre or a poorly calibrated hybrid platform… well no. Unlikely as it may sound, your run-of-the-mill “disk full” issue from prehistoric times has struck!
That’s right… your email server’s disk is out of space.
Most likely it’s a mail spool issue (where the messages are stored). The mail service wobbles then collapses, operations fail, the server spews error messages and finally crashes. Users can no longer use the service and the webmail sends an endless flow of error messages until it just stops responding.
More rarely, the “disk full” issue might come from the system itself, but the other typical cause is logs. Your email system generates a lot of logs. If logs aren’t stored in a separate partition, which is often the case, the whole system crashes!
BlueMind’s TICK monitoring tool helps get to the bottom of things! Its dashboards show you the state of your users’ disk space almost in real time. Data history is kept for 7 days, that way you can go back to the system’s state before and at the time of the incident. You can also use it to program a disk space alert!
At last, you can relax. It’s nothing major, and it could have been avoided, but it can be fixed. You’re still in time to catch your plane and save your holidays.
But Bob the Hacker clears his throat. He has more bad news…
“Boss, someone’s stealing our cpu”
Mail systems’ virtual machines (VMs) share resources with other instances on virtualised platforms. The CPU (Central Processing Unit) — i.e. the engine that runs all platform systems and program instructions — is one of those resources.
CPU demand on the system is usually fairly low — besides a few usage spikes — and the whole point of virtualisation is to share CPU between systems and VMs. However, if there are too many VMs or excess load on some of them, CPU demand may overload the platform which is no longer able to assign CPU to meet your VMs’ needs.
“Steal time” is the percentage of time a VM has to wait for CPU to be available when it has made a request – i.e. the time when the platform is too busy serving other VMs to respond to yours.
Steal time should be watched because it can cause major problems. For tasks that need to be executed in near-real time such as responding quickly to multiple web or database requests, a drop in performance may lead to a backlog of requests which will slow the system down and possibly end up in errors or failures.
If your virtual machine shows a high steal time, this means that CPUs are being taken away from your VM for other purposes. You may be using too many processor resources or the physical server may be overloaded. Try giving your virtual machine more processor resources if the platform isn’t bursting at the seams, or move the VM to another physical server.
– “Ok, Bob, you can fix this, right?” you ask nervously.
– “Sure, Boss. BlueMind monitors steal time history, so we’ve been able to fix the issue.”
As you’re about to hang up – drenched in sweat but reassured – Bob strikes the final blow:
– “One last thing …
We’ve been blacklisted, our passwords are too weak!”
Who hasn’t grumbled about some website prompting you to choose a password that contains 70% of consonants, 3 uppercase letters, 2 prime numbers and one ancient Greek character? Yet these requirements aren’t to no avail.
A weak or unchanged default password can compromise the security of your account which can be used to send SPAM massively. Spam is the scourge of email. It will flood your server, but worse still, you’ll be put on a blacklist of SPAM senders and your emails will be rejected by recipient servers (anti-spam programs will block your server’s flow) and it will usually take at least a few days – after the issue has been corrected – for the situation to go back to normal.
Be careful, though, forcing your users to change passwords too often may encourage them to use increasingly easier passwords (so that they can remember them), or to write them on huge post-it notes stuck on their computer screen.
With BlueMind, you can set up a forced password policy to avoid this.
If you’re using an external password manager — e.g. through your LDAP directory — the most effective option is to generate a complex password or encourage your users to choose a different password for each account.
“Relax and enjoy your holidays”
You’d been so busy planning ahead for the most complex disasters that you’d forgotten the most important thing: monitoring a set of simple indicators on a regular basis can help you avoid quite a few problems which can be as basic as they can be incapacitating for users.
The moral of the story is, no matter how good “Bob” is, he will never replace a good monitoring system!
If you want to know more about monitoring your email system, please read our blog article about monitoring a BlueMind installation or watch our monitoring tool’s presentation video (french only) and contact us so that we can discuss it!