DISTRIBUTED
OPERATING SYSTEMS
INTRODUCTION
An operating system is a program that controls the resources of a
computer and provides its users with an interface or virtual machine that is
more convenient to use than the bare machine. Examples of well-known centralized
(i.e. not distributed) operating systems are MS_DOS and UNIX.
A distributed operating system is one that looks to its users like an
ordinary centralized operating system but runs on multiple, independent central
processing units (CPUs). The key concept here is transparency
.In other words, the use of multiple processors should be invisible
(transparent) to the user.
According
to Tanenbaum, if one can tell which
computer they are using, they are not using distributed system. The users of a
true distributed system should not know (or care)
On
which machine their programs are running, where their files are stored, and so
on.
He
defines the network operating system as
collection of personal computers along with a common printer server and file
server for archival storage, all tied together by a local network.
Sape
J. Mullender
distinguishes the network operating system and distributed operating
system
as follows:
A
network operating system
is essentially a centralized operating system whose components have been
distributed over multiple nodes, while a distributed
system is one
in
which this distribution is combined with replication to achieve fault tolerance
as well as performance.
Goals and
Problems
According
to Tanenbaum the following are the
main goals (advantages)
of distributed systems
1.The
relative simplicity of the software-each processor has a dedicated function.
2.Incremental
growth is another plus; if we need 10 percent more computing power, we
just add 10 percent more processors.
3.
Reliability and availability can also be a big advantage; a few parts of the
system can be down without disturbing people using the other parts.
4.
With a distributed system, a high degree of fault tolerance is often, at least,
an implicit goal
Sape
J. Mullender
says
A
major-probably the major-motivation for distributed systems research used to be
the quest for dependable systems, systems that would tolerate failures in order
to become more reliable than their parts.
The
problems with distributed operating systems are
Unless
one is very careful, it is easy for the communication protocol overhead to
become a major source of inefficiency.
A
more fundamental problem in distributed systems is the lack of global state
information. It is generally a bad idea to even try to collect complete
information about any aspect of the system in one table. Lack of up-to-date
information makes many things much harder. It is hard to schedule the processors
optimally if we are not sure how many are up at the moment.
Computer
hardware is now very reliable. Disk manufacturers claim mean times between
failure of a million hours and more, so that very few disks ever fail during
their operational lifetime. Because of this, in most situations there is little
need for replicated data storage. The extra complexity of the software might
actually reduce the reliability of a system due to the presence of more bugs.
The other reason is that the world is currently burdened with a few
operating-system standards that cannot easily be extended with fault-tolerance
mechanisms without major change. There is such an investment in existing
software that any short-term changes are unlikely. In any case, the world’s
most widely used operating systems have many more urgent problems that need
solving before increased tolerance of hardware failures will become noticeable.
The
key issue that distinguishes a network operating system from a distributed one
is how aware the users are of the fact that multiple machines are being used.
This visibility occurs in three primary areas:
File
System
In the distributed operating systems, namely, to have a single global
file system visible from all machines. When this method is used, there is one
“bin” directory for binary programs, one password file, and so on. When a
program wants to read the password file it does something like open (“/etc/passwd”,
READ-ONLY) without reference to where the file is. It is up to the operating
system to locate the file and arrange for transport of data, as they are needed.
LOCUS us an example if a system using this approach.
Thus in a network operating system, the users must do control over file
placement manually, whereas in a distributed operating system it can be done
automatically by the system itself.
Protection
In a true distributed system there should be a unique UID for every user,
and that UID should be valid on all machines with out any mapping. In this way
no protection problems arise on remote accesses to files; as far as protection
goes, a remote access can be treated like a local access with the same UID. The
protection issue makes the difference between a network operating system and a
distributed one clear: In one case there are various machines, each with its own
user-to-UID mapping, and in the other there is a single, system wide mapping
that is valid everywhere.
Execution Location
In the most distributed case, the system chooses a CPU by looking at the
load, location of files to be used, etc. In the least distributed case, the
system always runs the process on one specific machine (usually the machine on
which the user is logged in).
DESIGN
ISSUES
Five
issues that distributed systems’ designers are faced with:
1. Communication
primitives,
2. Naming and
protection,
3. Resource management,
4. Fault tolerance,
5. Services to provide.
Communication
Primitives
The computers forming a distributed system normally do not share primary
memory, and so communication via shared memory techniques such as semaphores and
monitors is generally not applicable. Instead, message passing in one form or
another is used.
Remote Procedure Call (RPC)
The next step forward in message-passing systems is
the realization that the model of “client sends request and blocks until
server sends reply” looks very similar to a traditional procedure call from
the client to the server. This model has become known in the literature as
“remote procedure call”.
Error
Handling
In a distributed system, matters are more complex. If a client has
initiated a remote procedure call with a server that has crashed, the client may
just be left hanging forever unless a time-out is built in.
Naming
and Protection
Naming
and Mapping
Naming can best be seen as a problem of mapping between two domains.
In a distributed system a separate name server is sometimes used to map
user-chosen names (ASCII strings) onto objects in an analogous way.
Name
Servers
In centralized systems, the problems of naming can be
effectively handled in a straightforward way. The system maintains a table or
database providing the necessary name-to-object mappings. The most
straightforward generalization of this approach to distributed systems is the
single name server model.
This model is often acceptable in a small-distributed
system located at a single site.
For large distributed system the approach is to
partition the system into domains, each with its own name server.
Resource
Management
Resource
management in a distributed system differs from that in a centralized system in
a fundamental way. Centralized systems always have tables that give complete and
up-to-date status information about all the resources being managed; distributed
systems do not. The problem of managing resources without having accurate global
state information is very difficult.
Processor Allocation
One of the key resources to be managed in a
distributed system is the set of available processors. One approach that has bee
proposed for keeping tabs on a collection of processors is to organize them in a
logical hierarchy independent of the physical structure of the network. This
approach organizes the machines like people in corporate, military, academic,
and other real-world hierarchies. Some of the machines are workers and others
are managers.
Scheduling
The hierarchical model provides a general model for resource control but
does not provide any specific guidance on how to do scheduling. If each process
uses an entire processor (i.e., no multiprogramming), and each process can be
assigned to any processor at random. However, if it is common that several
processes are working together and must communicate frequently with each other,
as in UNIX pipelines or in cascaded (nested) remote procedure calls, then it is
desirable to make sure that the whole group runs at once. In this section we
address that issue.
Distributed Deadlock Detection
Two kinds of potential deadlocks are resource deadlocks and communication
deadlocks. Resource deadlocks are traditional deadlocks, in which all of some
set of processes are blocked waiting for resources held by other blocked
processes.
In
a communication deadlock, suppose A is waiting for C and C is waiting for A.
Then we have a deadlock.
Fault
Tolerance
An important distinction should be made between systems that are fault
tolerant and those that are fault intolerant. A fault-tolerant system is one
that can continue functioning (perhaps in a degraded form) even if something
goes wrong. A fault-intolerant system collapses as soon as any error occurs.
One
of the key advantages of distributed systems is that there are enough resources
to achieve fault tolerance, at least with respect to expected errors. The system
can be made to tolerate both hardware and software errors, although it should be
emphasized that in both cases it is the software, not the hardware, that cleans
up the mess when an error occurs.
Redundancy
Techniques
All
the redundancy techniques that have emerged take advantage of the existence of
multiple processors by duplicating critical processes on two or more machines. A
particularly simple, but effective, technique is to provide every process with a
backup process on a different processor. All processes communicate by message
passing. Whenever anyone sends a message to a process, it also sends a message
to a process; it also sends the same message to the backup process. The system
ensures that neither the primary not the backup can continue running until it
has been verified that both have correctly received the message.
Thus, if one process crashes because of any hardware fault, the other one
can continue. Furthermore, the remaining process can then clone itself, making a
new backup to maintain the fault tolerance in the future.
Atomic
Transactions
The
property of run-to-completion or do nothing is called an atomic update .The
Property of not interleaving two jobs is called serializability.
The goal of people working on the atomic transaction approach to fault tolerance
has been to regain the advantages
of the old tape system, without giving up the
convenience of databases on disk that can be modified in place, and to be able
to do everything in a distributed way.
Services
Server
Structure
The simplest way to implement a service is to have one server that has a
single,
sequential
thread of control. This approach is simple and easy to understand, but has the
disadvantage that if the server must block while carrying out the request, no
other requests from other users can be started, even if they could have been
satisfied immediately. One way of achieving both good performance and clean
structure is to program the server as a collection of miniprocesses, which we
call a cluster of tasks. Tasks share
the same code and global data, but each task has its own stack for local
variables and registers and, most important, its own program counter. In other
words, each task has its own thread of control.
File
Service
The most important service in any distributed system
is the file service.
File services can be roughly classified into two
kinds,” traditional” and “robust”.
Traditional
file service is offered centralized operating systems. Robust file service,
On
the hand, is aimed at those applications that require extremely high reliability
And
whose users are prepared to pay a significant penalty in performance to achieve
it.
These
file services generally offer atomic updates and similar features lacking in the
traditional file service.
Conceptually,
there are three components that a traditional file service normally has
1. Disk service
2. Flat file service
3. Directory
service
Print service
Nearly all distributed systems have some kind of
print service to which clients can end files, file names, or capabilities for
files with instructions to print them on one of the available printers, possibly
with some text justification or other formatting beforehand.
Process service
Every distributed operating system needs some
mechanism, for creating new processes. At a higher level it is frequently useful
to have a process server that one can ask whether is a Pascal, TROFF, or some
other service, in the system.
Time service
There are two ways to organize a time service. In the
simplest way, clients can just ask the service what time it is. In the other
way, the time service can broadcast the correct time periodically, to keep all
the clocks in the other machines sync. The timeserver can be equipped with a
radio receiver tuned to WWW or some other transmitter that provides the exact
time down to the microsecond.
Boot Service
The boot service has two functions: bringing up the
system from scratch when the power is turned on and helping important services
survive crashes.
Gateway
service
If the distributed system in question needs to
communicate with other systems at
remote sites it may need a gateway server to convert
messages and protocols from internal format demanded by the wide area network
carrier.
Examples of distributed operating systems
1.The Cambridge Distributed Computing System
2.Amoeba
3.The V kernel
4.The Eden Project
.