How to lose 2 years of trust in 2 hours
15/04/2017
The facts
08:44
A client of EURL Barme reports the unavailability of a service.
08:55 Creation of an incident ticket:
Hello,
Whether from outside (ping barme.fr 195.154.60.32) or via RPN (ping 10.x.x.x), the server does not respond.
Is it a network problem?
Thank you.
08:58 support response:
Hello,
I'll do some checks and get back to you as soon as possible.
Best regards,
09:01 (support):
Hello,
I see logs on the kvm that may indicate a Kernel Panic.
Do you authorize me to switch the server to rescue mode to do some tests? I'll get back to you as soon as the tests are finished.
Best regards,
09:08 (EURL Barme):
Yes, switch to rescue mode.
Note, I switched my failover IP to my backup server.
09:36 (support):
Hello,
After the tests I performed, it appears that the server is functional and has no hardware issues.
I invite you to look at the Software/OS side and installation or configuration.
I remain at your disposal for any other request.
Best regards,
09:40 (EURL Barme):
ping x.x.x.x: 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
^C
You call that functional?
09:43 (support):
Hello,
I did the tests in rescue mode and it doesn't use the server configuration (the one you installed) but boots on an OS located on one of our storage servers so I invite you to review the configuration of the OS you installed and the settings you made.
I remain at your disposal for any other request.
Best regards,
09:45 (EURL Barme):
Thank you, I know what rescue mode is.
So don't touch anything anymore; I'll investigate myself.
09:47 (EURL Barme):
Note the "Monitoring":
Monitoring
Service Status
Server Ping Available 1 year ago history
!!
09:54 (support):
Hello,
Very well.
Best regards,
10:00 (EURL Barme):
My server sd-x is operational again, directly without _any_ modification to its configuration...
A simple reboot was enough to get it back in service; it's usual to blame the client's configuration but how can you justify it?
10:16 (support):
Hello,
I see that the issue on this server is resolved after a simple reboot.
I invite you to close the ticket and rate it by leaving a comment.
Have a good day,
10:20 (EURL Barme):
In syslog, I have:
...
Apr 11 21:17:01 ol1 /USR/SBIN/CRON[46128]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Apr 12 09:40:42 ol1 kernel: imklog 5.8.11, log source = /proc/kmsg started.
...
So my server has been down since yesterday 21:17:01 and operational again since 09:40:42 this morning.
And yet your monitoring did not alert me and claims it has been available since 09/30/2015 09:52:18! ?
Isn't there a problem on your side?
10:44 (support):
Hello,
I forward your request to the concerned department and we will get back to you as soon as possible.
Best regards,
10:53 (EURL Barme):
Does it amuse you to switch my server to rescue mode without warning?
11:03 (EURL Barme):
Your technicians put my site down while I asked them not to touch it anymore at 04/12/2017 09:45:53.
Yet this server is not a toy!
11:09 (support):
I warned my colleagues in charge
-- --
Best regards,
11:30 (support):
Hello,
I'm taking over the ticket given the report of it.
A misunderstanding and too quick an exchange during the escalation indeed led to another switch to rescue.
I confirm that the server is operational on my side
It pings well in normal mode.
I updated the monitoring on the server hoping you will be alerted next time.
Nothing prevents you from setting up your own monitoring.
Best regards - Premium Technician
Comments
It's always in an unfavorable context that an incident occurs, in this case during a trip, with a technical context of intervention and reduced availability.
Obviously, it's a client who discovers and reports the problem; normally it would have been discovered much earlier and EURL Barme's investigation would have been done directly by connecting to the server console, before involving support.
The support was very responsive: response to the ticket in 3 minutes. Its initial diagnosis (Kernel Panic) was confirmed by the "Premium" technician who intervened after escalating the problem.
Support error: return without checking that the server is operational after exiting rescue mode. It would have been preferable to wait for the end of the restart.
Support slip: arbitrary blaming of the client ("I did the tests in rescue mode and it doesn't use the server configuration (the one you installed) but boots on an OS located on one of our storage servers so I invite you to review the configuration of the OS you installed and the settings you made.")
It's a great classic of all supports to systematically blame the client. It's obvious that most escalations to supports are made by people with modest IT skills and undoubtedly aggressive because the problem for which they call is badly experienced. But that's no reason to put everyone in the same basket and consider, a priori, that the origin of the problem is on the client's side. The right approach should be to first check if there isn't an issue on the provider's side and nothing prevents, then, putting the client face to face with their responsibilities.
Gravest support error: switching a production server (which works!) to rescue mode without client authorization. There it's the height of inconsideration and totally unacceptable from a professional service.
Then it's escalation and the problem lands on the back of a "Premium" technician who had nothing to do with it and can no longer do anything about it. The term "Premium" makes you smile with its connotation of "superior competence". Technically it's probably true but in the context, it's more communication skills that are required.
This "Premium" technician will take the trouble to call to report the return to normal mode of the server, which is very good and courageous on his part but his condescending tone towards an interlocutor with 36 years of IT experience in such a context will be the cherry on the cake… A basic communication training would be very useful to him.
And for the future…
Linux servers are very reliable but not infallible. That's precisely why a dedicated server used for a critical or non-critical service must be duplicated by a clone automatically and permanently synchronized with the functioning server. An isolated kernel panic is a potential hazard that does not call into question the server or its configuration.
Similarly, an isolated mishandled incident does not call into question a provider or its professionalism.
EURL Barme will therefore not change its servers involved in this incident and ultimately wishes only one thing: that this habit of supports systematically blaming the client ceases.
Epilogue
On the occasion of another technical issue (minor this time) on my servers at Online, I dealt with a quality service, both in technical skills and communication.
As it depends a lot on the interlocutors; there are disappointing ones but also very good ones.
Phew, my servers are definitely well managed and my providers well chosen :-)