The Linux Letter for
November 22, 2000
Hello again! I'm sure that
this week's installment of The Letter comes as a huge surprise,
especially since there hasn't been one for months. But I finally
decided to drag myself over to the keyboard and do something
useful for a change.
Plenty is new since the last
Letter, more than I can talk about in one column. So hopefully
I'll be able to pull myself together and string a few weeks'
worth of them together for your reading enjoyment.
This weeks topic is high
availability computing. It really doesn't have that much of an
application to the average Linux user, but it's an interesting
subject and one for which Linux is well suited.
A while back, Ebay, the
Internet auction service, was down for almost a day. The company
estimated that it lost about $40 million in business and the
stock price plummeted in response. High availability computing
is an option that companies can use to fight against
catastrophic computer failures like that.
In order to understand why
high availability systems are important, its necessary to have
an idea about how they work. Designers of the systems analyze
all of the possible ways that a computer system or network of
computers can fail, and then devise schemes to either prevent
the failures or work around them. Other methods of providing
high availability include clustering multiple servers together
and distributing the network load evenly amongst them, a concept
called load balancing.
System monitoring is critical
to high availability computing because if a node in a network
goes down, the rest of the network needs to know immediately so
that data does not become corrupted. Another potential problem
in clustered servers is for a node to fail, then come back to
life without going through a proper startup. If the rest of the
network does not know the status of the node, data corruption is
almost certain.
Functionally, a high
availability system is composed of several nodes, with each node
usually being a separate computer. Special software monitors the
nodes and removes any failed nodes from service, seamlessly
allowing the other nodes to continue to function without
interruption.
The measurement of a high
availability system is traditionally expressed in terms of
uptime. 99.9% uptime is the accepted beginning of high
availability and translates into about 9 hours of downtime per
year. As the number of "nines" increases, so does the
complexity of the system. Of course, increased complexity and
reliability translates directly into increased cost, so true
high availability systems tend to be quite expensive. But, as
demonstrated by the Ebay case, the relative cost can be an easy
amount to swallow.
Linux is a great candidate for
high availability because the operating system's source code is
freely available. The High Availability Linux Project is an open
source attempt to develop such a system by creating software
tools to monitor a cluster of Linux-based computers.
Mission Critical Linux ships a
distribution based on some of the components of the High
Availability Linux Project. SuSE is porting SGI's FailSafe high
availability system to Linux. A high availability HOWTO provides
suggestions for hardware and software to limit downtime.
High availability computing
for Linux is still very much in the development stage, but with
the explosive growth of online services such as banking and
e-commerce, you can expect to see quick development of the
technology. While high availability computing may not have
direct applications to your situation as a home or small
business user, the usual trickle-down effects of new technology
development are sure to add more performance and reliability to
our everyday desktop environment.
High-Availability Linux
Project: http://linux-ha.org
Mission Critical Linux: http://www.missioncriticallinux.com/
SuSE Linux: http://www.suse.com
SGI: http://www.sgi.com
You boot your system.
Everything goes great until you get to the LILO prompt. There's
no LILO…or maybe just an L…or LI…you get the picture. You
know that something's wrong, but what is it? The
"amount" of LILO displayed can give you a rough clue
about what's going on in your system:
"L" - /boot/boot.b
could not be loaded. Almost any kind of disk error can cause
this.
"LI" - /boot/boot.b
has been moved, but LILO wasn't reinstalled. Also, some kind of
disk error may have happened.
"LIL" - LILO can't
allocate the necessary map file. This is probably some kind of
disk error.
"LIL?" - /boot/boot.b
has been moved, but LILO wasn't reinstalled. Also, some kind of
disk error may have happened.
"LIL-" - The map
file data is invalid or (you guessed it) /boot/boot.b has been
moved, but LILO wasn't reinstalled. Also, some kind of disk
error may have happened.
"LILO" -
Congratulations…LILO loaded successfully with no errors!
Happy computing!
Drew Dunn