194
Networking: A Beginner's Guide
Maintaining and Troubleshooting Servers
To do the best job of maintaining and troubleshooting servers, you need to take steps
to do two things: decrease the chance of failure and improve your chance of rapidly
resolving any failures that do occur. Problems are inevitable, but you can greatly decrease
your odds of having them, and you can also greatly improve your chances of resolving
them quickly by taking steps before you actually have any problems.
To decrease the chance of failure, make sure to follow all the advice previously
given: use reliable, tested servers and components. You should also take these additional
steps:
Whenever possible, try to reduce the number of jobs that a server must do.
Although building a single server that will be a file and print server, a database
server, an e-mail server, and a web server is certainly possible, you're much
better off (from an overall reliability standpoint) segregating these duties onto
smaller, separate servers.
Set up a practice of frequently viewing the server's error logs. If the server NOS
supports notification of errors (such as to a pager), consider implementing
this feature. Many failures start with error messages that might precede the
actual failure by a few hours, so getting an early heads-up might help you keep
the server running or at least enable you to resolve the problem at the best
possible time.
If a server supports management software that monitors the server's condition,
make sure to install the software.
Most RAID arrays that support hot-swap of failed drives also require that the
NOS have special software installed to support this feature fully. Make sure
that you install this software before any failures occur.
NOS software is among the most bug-free available, but it's still true that there
is no such thing as completely bug-free software. Over time, any NOS will
eventually fail. While many servers run for up to a year without requiring
a restart, you're better off establishing a practice of periodically shutting
down the server and bringing it back up again. This practice eliminates small
transient errors that might be accumulating and could eventually lead to a
server crash, such as memory leaks in the NOS. The best frequency for such
restarts is monthly.
CAUTION
Make sure that you do a backup before shutting down the server and restarting it. The
greatest chance of hardware failure occurs when the system is powered back up again.
It's a good idea to make three good backups and test restorations prior to putting
a server into use. It might seem redundant, but you never know when you might need