An operating system is a program that controls the resources of a computer and provides its users with an interface or virtual machine that is more convenient to use than the bare machine. Examples of well-known centralized (i.e. not distributed) operating systems are MS_DOS and UNIX.

         A distributed operating system is one that looks to its users like an ordinary centralized operating system but runs on multiple, independent central processing units (CPUs). The key concept here is transparency .In other words, the use of multiple processors should be invisible (transparent) to the user.


According to Tanenbaum, if one can tell which computer they are using, they are not using distributed system. The users of a true distributed system should not know (or care)

On which machine their programs are running, where their files are stored, and so on.

He defines the network operating system as collection of personal computers along with a common printer server and file server for archival storage, all tied together by a local network.


Sape J. Mullender distinguishes the network operating system and distributed operating

system as follows:

A network operating system is essentially a centralized operating system whose components have been distributed over multiple nodes, while a distributed system is one

in which this distribution is combined with replication to achieve fault tolerance as well as performance.


Goals and Problems

According to Tanenbaum the following are the main goals (advantages) of distributed systems


 1.The relative simplicity of the software-each processor has a dedicated function.

2.Incremental growth is another plus; if we need 10 percent more computing power, we             just add 10 percent more processors.

 3. Reliability and availability can also be a big advantage; a few parts of the system can be down without disturbing people using the other parts.

 4. With a distributed system, a high degree of fault tolerance is often, at least, an implicit goal


Sape J. Mullender says

A major-probably the major-motivation for distributed systems research used to be the quest for dependable systems, systems that would tolerate failures in order to become more reliable than their parts.

The problems with distributed operating systems are

Unless one is very careful, it is easy for the communication protocol overhead to become a major source of inefficiency.

A more fundamental problem in distributed systems is the lack of global state information. It is generally a bad idea to even try to collect complete information about any aspect of the system in one table. Lack of up-to-date information makes many things much harder. It is hard to schedule the processors optimally if we are not sure how many are up at the moment.

Computer hardware is now very reliable. Disk manufacturers claim mean times between failure of a million hours and more, so that very few disks ever fail during their operational lifetime. Because of this, in most situations there is little need for replicated data storage. The extra complexity of the software might actually reduce the reliability of a system due to the presence of more bugs.

  The other reason is that the world is currently burdened with a few operating-system standards that cannot easily be extended with fault-tolerance mechanisms without major change. There is such an investment in existing software that any short-term changes are unlikely. In any case, the world’s most widely used operating systems have many more urgent problems that need solving before increased tolerance of hardware failures will become noticeable.

The key issue that distinguishes a network operating system from a distributed one is how aware the users are of the fact that multiple machines are being used. This visibility occurs in three primary areas:

File System

          In the distributed operating systems, namely, to have a single global file system visible from all machines. When this method is used, there is one “bin” directory for binary programs, one password file, and so on. When a program wants to read the password file it does something like open (“/etc/passwd”, READ-ONLY) without reference to where the file is. It is up to the operating system to locate the file and arrange for transport of data, as they are needed. LOCUS us an example if a system using this approach.

    Thus in a network operating system, the users must do control over file placement manually, whereas in a distributed operating system it can be done automatically by the system itself.


           In a true distributed system there should be a unique UID for every user, and that UID should be valid on all machines with out any mapping. In this way no protection problems arise on remote accesses to files; as far as protection goes, a remote access can be treated like a local access with the same UID. The protection issue makes the difference between a network operating system and a distributed one clear: In one case there are various machines, each with its own user-to-UID mapping, and in the other there is a single, system wide mapping that is valid everywhere.

 Execution Location

              In the most distributed case, the system chooses a CPU by looking at the load, location of files to be used, etc. In the least distributed case, the system always runs the process on one specific machine (usually the machine on which the user is logged in).


Five issues that distributed systems’ designers are faced with:

1.         Communication primitives,

2.         Naming and protection,

3.         Resource management,

4.   Fault tolerance,

5.         Services to provide.


   Communication Primitives

          The computers forming a distributed system normally do not share primary memory, and so communication via shared memory techniques such as semaphores and monitors is generally not applicable. Instead, message passing in one form or another is used.

 Remote Procedure Call (RPC)

                The next step forward in message-passing systems is the realization that the model of “client sends request and blocks until server sends reply” looks very similar to a traditional procedure call from the client to the server. This model has become known in the literature as “remote procedure call”.

Error Handling

               In a distributed system, matters are more complex. If a client has initiated a remote procedure call with a server that has crashed, the client may just be left hanging forever unless a time-out is built in.

Naming and Protection

 Naming and Mapping

              Naming can best be seen as a problem of mapping between two domains. In a distributed system a separate name server is sometimes used to map user-chosen names (ASCII strings) onto objects in an analogous way.


 Name Servers

     In centralized systems, the problems of naming can be effectively handled in a straightforward way. The system maintains a table or database providing the necessary name-to-object mappings. The most straightforward generalization of this approach to distributed systems is the single name server model.


This model is often acceptable in a small-distributed system located at a single site.

For large distributed system the approach is to partition the system into domains, each with its own name server.


Resource Management


Resource management in a distributed system differs from that in a centralized system in a fundamental way. Centralized systems always have tables that give complete and up-to-date status information about all the resources being managed; distributed systems do not. The problem of managing resources without having accurate global state information is very difficult.


       Processor Allocation

                  One of the key resources to be managed in a distributed system is the set of available processors. One approach that has bee proposed for keeping tabs on a collection of processors is to organize them in a logical hierarchy independent of the physical structure of the network. This approach organizes the machines like people in corporate, military, academic, and other real-world hierarchies. Some of the machines are workers and others are managers.



                  The hierarchical model provides a general model for resource control but does not provide any specific guidance on how to do scheduling. If each process uses an entire processor (i.e., no multiprogramming), and each process can be assigned to any processor at random. However, if it is common that several processes are working together and must communicate frequently with each other, as in UNIX pipelines or in cascaded (nested) remote procedure calls, then it is desirable to make sure that the whole group runs at once. In this section we address that issue.


      Distributed Deadlock Detection

                Two kinds of potential deadlocks are resource deadlocks and communication deadlocks. Resource deadlocks are traditional deadlocks, in which all of some set of processes are blocked waiting for resources held by other blocked processes.

In a communication deadlock, suppose A is waiting for C and C is waiting for A. Then we have a deadlock.


Fault Tolerance


             An important distinction should be made between systems that are fault tolerant and those that are fault intolerant. A fault-tolerant system is one that can continue functioning (perhaps in a degraded form) even if something goes wrong. A fault-intolerant system collapses as soon as any error occurs.

   One of the key advantages of distributed systems is that there are enough resources to achieve fault tolerance, at least with respect to expected errors. The system can be made to tolerate both hardware and software errors, although it should be emphasized that in both cases it is the software, not the hardware, that cleans up the mess when an error occurs.


  Redundancy Techniques


               All the redundancy techniques that have emerged take advantage of the existence of multiple processors by duplicating critical processes on two or more machines. A particularly simple, but effective, technique is to provide every process with a backup process on a different processor. All processes communicate by message passing. Whenever anyone sends a message to a process, it also sends a message to a process; it also sends the same message to the backup process. The system ensures that neither the primary not the backup can continue running until it has been verified that both have correctly received the message.

       Thus, if one process crashes because of any hardware fault, the other one can continue. Furthermore, the remaining process can then clone itself, making a new backup to maintain the fault tolerance in the future.


 Atomic Transactions


            The property of run-to-completion or do nothing is called an atomic update .The

Property of not interleaving two jobs is called serializability. The goal of people working on the atomic transaction approach to fault tolerance has been to regain the advantages

of the old tape system, without giving up the convenience of databases on disk that can be modified in place, and to be able to do everything in a distributed way.




Server Structure


            The simplest way to implement a service is to have one server that has a single,

sequential thread of control. This approach is simple and easy to understand, but has the disadvantage that if the server must block while carrying out the request, no other requests from other users can be started, even if they could have been satisfied immediately. One way of achieving both good performance and clean structure is to program the server as a collection of miniprocesses, which we call a cluster of tasks. Tasks share the same code and global data, but each task has its own stack for local variables and registers and, most important, its own program counter. In other words, each task has its own thread of control.


File Service


The most important service in any distributed system is the file service.

File services can be roughly classified into two kinds,” traditional” and “robust”.

Traditional file service is offered centralized operating systems. Robust file service,

On the hand, is aimed at those applications that require extremely high reliability

And whose users are prepared to pay a significant penalty in performance to achieve it.

These file services generally offer atomic updates and similar features lacking in the traditional file service.

Conceptually, there are three components that a traditional file service normally has

1.   Disk service

2.   Flat file service

3.            Directory service


Print service


Nearly all distributed systems have some kind of print service to which clients can end files, file names, or capabilities for files with instructions to print them on one of the available printers, possibly with some text justification or other formatting beforehand.


Process service


Every distributed operating system needs some mechanism, for creating new processes. At a higher level it is frequently useful to have a process server that one can ask whether is a Pascal, TROFF, or some other service, in the system.


Time service


There are two ways to organize a time service. In the simplest way, clients can just ask the service what time it is. In the other way, the time service can broadcast the correct time periodically, to keep all the clocks in the other machines sync. The timeserver can be equipped with a radio receiver tuned to WWW or some other transmitter that provides the exact time down to the microsecond.


Boot Service


The boot service has two functions: bringing up the system from scratch when the power is turned on and helping important services survive crashes.


Gateway service


If the distributed system in question needs to communicate with other systems at

remote sites it may need a gateway server to convert messages and protocols from internal format demanded by the wide area network carrier.



 Examples of distributed operating systems

     1.The Cambridge Distributed Computing System


     3.The V kernel

     4.The Eden Project