Mar 2, 2016

How to Make a Server More Reliable

Written by Katie Gomez

There are lots of things you have to do to make your server reliable.  As I think about the real problems I encounter on a regular basis, two solutions seem more important than the rest.

The first change has been happening all around us.  If you’re not building your own computers, you might not notice it, but people are constantly making this switch.  SSDs are so much more reliable than traditional hard disks.  And, in the last few years, SSDs have become much more affordable.  Like a like of data centers, all of our new machines use SSDs.  And we can often bring new life to an old server just by replacing the disk.

SSD vs traditional hard disk in a Linux server

I’ve been working with computers for a long time.  The disk is by far the most common piece of hardware to fail.  Fixing this means so much less down time, and so much less money spent repairing and rebuilding our servers.

The second big change we made is something we came up with at Trade-Ideas.  I haven’t heard of anyone else doing this, so I’ll share it with you.  We made our date center more reliable by adding a new type of custom monitoring.

We do a lot of monitoring.  If we have 3 computers doing a job, and one of them fails, the users typically never know the difference.  But we need to know, so we can fix that computer.  If we just had extra computers with no monitoring, we’d only only improve things a little bit.  Eventually the second computer would fail and the last one would slow down from all the extra work.  Eventually they’d all fail.  We protect our users because we have extra computers, and we monitor them closely, so we can fix the problems before most people know about them.

There are a lot of ways to do monitoring.  There are different types of monitoring built into a lot of our custom software.  But today I’m talking about something more generic, something that could be helpful in any server room.

One simple type of monitoring is to ping a server.  Send a request over the network and make sure the computer responds.  That sounds good, but in practice it doesn’t work as well as it should.  Ping is implemented way down in the core of the operating system.  A computer will continue to answer pings even when it’s in bad shape.  Of course, if the network is down, the ping will fail.  But I’ve seen machines continue to ping even they are thrashing like crazy — running incredibly slowly because they are using the disk when they should be using memory.  I’ve seen machines ping when the disk is completely broken.  The idea is right, but we need something more sensitive.

Secure Shell (ssh)

The Trade-Ideas solution is a simple script which will try to to log into each machine.  It will log in using ssh.  ssh uses encryption and it is the preferred way to access a Unix server over the internet.  I can make my script more or less sensitive by changing the timeout.  If a machine is slow — maybe some process is using too much memory — my script will time out.  Once a minute my script tries to log into each of my machines, and it will notify me as soon as one starts acting up.

The nicest part of this solution is that I often use ssh to inspect and fix the servers.  If ssh is broken, I’m blind.  At that point I can only do very crude things, like rebooting the machine.  But this script tells me as soon as things start to get slow.  So as long as I respond in a reasonable amount of time, I can log in and see more.  Maybe it’s just one process that’s using too much memory.  I can restart that one process.  That’s faster than rebooting the entire machine, and it won’t hurt other processes running on the same machine.  And I saw exactly what software was acting up, so I can submit a bug report and make sure the original problem gets fixed!

This works well because, in practice, so many problems start small and gradually get worse.  We can address a problem as soon as it starts, before a customer notices it, before it becomes hard to fix.

These are just my favorite two ways to make servers more reliable.  There are a lot more.  There are a lot of variations on what I’ve described.  For example, if you have a machine that locks up on you a lot, go to a difference machine in the same server room and log in from there.  Use screen to keep yourself logged in, and to mange several ssh sessions at once.  As I said before, sometimes when a Linux computer starts to act up, it will be hard to log in.  But the existing ssh session will work great.  You can use that to investigate and fix the problem.

These various solutions can work together.  What if a disk starts to go bad?  Linux is surprisingly good at keeping itself going.  But it often has trouble rebooting with a bad disk.  If you can log in, you can see this problem.  You can order the new hardware immediately.  You can wait for a weekend or other slow time to fix the problem.  By investigating you knew the problem was going to get worse before it got better.  As much as I like rebooting computers, that’s not the solution to every problem.