当前位置：文档库 › 1 Monitoring and Evaluation of Parallel and Distributed Systems

1 Monitoring and Evaluation of Parallel and Distributed Systems

Monitoring and Evaluation of Parallel and Distributed Systems

Richard Hofmann

University Erlangen, IMMD VII

Martensstr. 3, D-91058 Erlangen

phone: ++49-9131-85-7026

email: rhofmann@informatik.uni-erlangen.de

Abstract

Due to the complex interactions between activities in parallel processes, the dynamic behavior of the system can-not be quantified a priori. However, a profound knowledge about what is going on in the system is the basis for bal-ancing the load in order to optimally utilize the potential power of such a parallel system. Monitoring is a valuable aid in getting the necessary insight into this dynamic be-havior of interacting processes.

In the first part of the tutorial, the principles of measure-ment-based performance analysis in parallel and distributed systems are discussed. General topics, concerned with hardware, software, and hybrid monitoring are presented with examples, and rules are given for choosing the appro-priate monitoring technique. As an example, ZM4, a univer-sal distributed monitor system is introduced.

The second part of the tutorial deals with all tasks re-lated to the process of presenting the meaning of trace data to human beings. Trace evaluation can be performed with statistics-oriented tools that compute common trace statis-tics, find activities, and validate assertions on system be-havior, as well as interactive graphics-oriented tools that present state time diagrams or draw causality diagrams between process traces. All these tools will be introduced with examples from measurements at practical parallel and distributed systems.

1 Introduction

The raw computing power of modern computers is growing rapidly with time. One should expect that the power at the user’s disposal grows with the same rate. However, experience shows a sometimes rising but some-times falling amount of power that can be really used. The reasons for this phenomenon are manifold: Users expect a more comfortable environment that costs computing power, security mechanisms also account to a significant part of the raw processor power. It is probably not possi-ble to get rid of these effects.

Another source of wasted processor power can be remedied by careful design of software systems on the one hand and thorough analysis of the runtime behavior on the other hand. While performance analysis is an im-portant issue in monoprocessor systems, this is an indis-pensable task in parallel and distributed systems. This fact is caused by the complex interactions between the differ-

ent program parts all cooperating in order to solve a common task.

Necessarily using shared resources causes problems with process synchronization, waiting times, deadlocks and the like. Beyond merely functional problems this dif-ficulty in managing parallel tasks, there is a high prob-ability of wasting processor power, i.e. not exploiting the processor power at a sufficiently high level.

This tutorial paper first deals with the basic problems in parallel and distributed systems in order to prepare a common knowledge about the reasons of such a power loss. In general, this topic can be treated by regarding causal relationships between events on different cooper-ating processors. In the second part an introduction into monitoring of parallel and distributed systems will be pre-sented. It will be shown, how different monitoring ap-proaches can be designed by systems programmers as well as by users of a parallel and distributed system.

Parallel and distributed systems require a monitoring facility that is able to cope with a larger number of proc-essors as well as with spatial distribution. For this reason, ZM4, a monitor system that is being used for many proj-ects will be introduced as an example for structuring and using a universal distributed monitor system.

Using event-based monitoring typically yields large traces even if the events are chosen carefully. In order to concentrate work on promising parts of the event trace, it is necessary to point out the location of the problem.

Therefore, statistical methods are used for a quick over-view and a coarse analysis. With that insight, more de-tailed methods can be applied. Their common goal is not only to have a measure for the performance of these sys-tems in part or in total, but also to get insight into the dy-namic behavior of the processes interacting.

The most important methods for visualizing the dy-namic behavior of parallel processes are time state dia-grams (gantt charts) and causality diagrams (hasse dia-grams). Each of these methods is discussed with an ex-ample in the remainder of this paper. Based on this infor-mation, the system can be reprogrammed in order to im-prove its performance.

2 Performance Problems in P&D Systems

Tuning programs for single processor machines is fairly easy: use profiling for finding out those parts of the program that are used predominantly. Typically, this is only a small fraction of the whole code. Rewriting these parts of the program yield a higher performance.

This simple conclusion does not hold for programs running on parallel and distributed systems. For example, tuning a part of a program that has to wait for an interme-diate result from another process will not profit from this tuning — it simply has to wait for a longer time.

In order to determine why a program behaves the way it does, the reason for this behavior must be sought. This leads to regarding causality in computer systems. As can be seen in a later section, analyzing parallel and distrib-uted systems from a causality point of view can lead to interesting results.

2.1 Causality and Computer Systems

Generally, the term causality denotes a law, where a specific action always leads to the same specific result. Adapted to computer systems, causality means, that the behavior of their processes is ruled by the laws, expressed in the program. Here, the future of each process depends on the current location in the program, the current envi-ronment with respect to other cooperating processes, and the next program instruction.

There are a few topics specifically related to causality in computer systems. Regarding a stand-alone process, its program statements are executed in the same sequence as in the program text, except when a flow control statement is encountered. In this case, the next statement is chosen depending on a specified condition. Regarding all state-ments as potential events, each non-control statement causally affects its successor, whereas control statements have the ability to causally affect one out of a set of pos-sible statements.

While causality in a single process is obvious because one statement is the prerequisite for the next, this is not true for cooperating processes which exchange informa-tion in order to solve their common task. The information exchanged consists of, for example, partial results or syn-chronization primitives such as barriers. In parallel and distributed systems, each process still has its own flow of control, but additionally, the future of the process depends on information coming from other processes. Transferring this information via the system’s communication facilities induces causal relationships between processes. As a re-sult, there are event pairs with an inter-process causal re-lationship and event pairs that are independent from each other.

The causal relationship between events is due to two reasons:

1. All events belonging to one process are causally re-

lated, as discussed above.

2. Events which belong to communication are causally

related, i.e. the sending and the receipt of the same

message are causally related, as well as writing to a

variable in a shared memory and the subsequent read-

ing of that variable from another process.

2.2 Analyzing Causal Relationships

Fig. 1 depicts a scenario where three processes A, B, and C are working together on the same task. Horizontal lines mark the individual process traces, the bullets mark events associated with important statements, and the ar-rows mark causal relations. As causality automatically implies a temporal sequence of a caused event occurring after the causing event, the arrows’ direction always goes from left to right, with the starting point at an earlier in-stant in time than the ending point.

Arrows between process traces start at a causing event,

e.g. the sending event a

on process A, and end at a

caused event, e.g. the receiving event c

on process C.

These arrows reveal the causality structure of an observed system with the starting and ending event together com-prising a causal event pair (this notation follows [13] and

[15]).

Following the arrows from event to event, eventually stepping from one process trace to another in case of a communication, always leads to events, which are caus-ally dependent from the event at the starting point. In contrast to a stand-alone process, not all events in the fu-ture can be reached by this procedure, i.e. not all events are causally related.

Let b

be the starting point of the following considera-tions. All events in its light-shaded post-area are causally

affected by it, but c

and c

cannot be reached from b

, so they are causally independent from it. In the same manner, all events in its dark-shaded pre-area causally af-

fected b

, i.e. all statements associated with the events in

this area have influence on b

, but all events following

Fig. 1: Structure of causally related events

a 2 and c

can be disregarded when searching for the

reason of b

Some prerequisites are necessary in order to analyze process traces in terms of causal relationships:

1. The location of each event with respect to process

and its local ordering has to be known.

2. It must be possible to identify the causal event pairs,

i.e. each starting event in one trace must be paired

with its ending event in the same or another trace. These prerequisites can be met, if the monitoring of the system under investigation delivers suitable information about the relevant process events which are stored in event traces [19]. When evaluating these event traces, a sufficient criterion for pairing the events is the equality of the sequence of starting events, e.g. send(message n) in one trace with the sequence of their ending events, e.g. receive(message n) in another trace (sequence equality criterion).

In other cases, e.g. where messages can overtake, ad-ditional information has to be monitored, which can be used as a criterion for pairing the causally related events. Packet numbers in communication protocols are an ex-ample for this kind of information.

Analyzing causal relationships in computer systems means to follow causally related events starting from a particular point, which indicates an interesting, e.g. dis-liked, behavior. All statements in the program responsible for activities causally preceding this point are candidates for improvement. In parallel and distributed systems with a large number of processes, this restriction to causally related parts of the process history enables the analyzing person to concentrate on essentials, and leads to a signifi-cant reduction of the analysis effort.

3 Monitoring

Considering the complexity of real computer systems it is not possible to predict the temporal behavior of a pro-gram with theoretic means. That is the reason, why most of the tools accessible to programmers lack a facility for …performance planning“. Those tools that have a built-in performance analysis use profiling at best. Profiling is an important aid in programming single processor machines — and it uses a measurement based approach, i.e. (1) in-strument entry and exit points of all subprograms, (2) while running the program (2a) count the number of calls for each subprogram and (2b) sampling the program counter in order to get a rough estimate for the relative amount of time a specific piece of code was executed.

Monitoring makes one step beyond that basic meas-urement: event markers are installed into the program code at specific points of interest that can be used as an indication that the program has passed the point with the

marker. By this means, a program can be analyzed at dif-ferent levels of abstraction as defined by the instrumenta-tion. In the rest of this section, the basics of monitoring will be discussed by means of three different views on the topic:

1. the time where information is collected

2. the location where the different parts of monitoring

are performed

3. the hook, where the monitor is applied to the object

system

3.1 Time of Monitoring

There are two different approaches for determining the moment, when information about the object system has to be recorded by the monitor. Traditional methods use a time-driven approach, where data is continuously written to a protocol in case of analog medical equipment (e.g.

EEG, EKG) or data is sampled at a fixed rate (Accounting, Profiling, Logic Analyzer). This approach has its merits for processes with a low-frequency behavior or for getting a rough overview on what is going on in the system. Getting insight into the dynamic behavior of computer systems requires a very high time resolution on the one hand and a fairly long interval for recording. Us-ing time-driven monitoring, this would result in a high data rate with a huge amount of data at longer measure-ments.

The amount of data gathered by monitoring can be re-duced by several orders of magnitude by event-driven monitoring. Here, important parts of the program under investigation are equipped with more events than those of low importance. This defines a specific user defined ab-straction on the system’s dynamic behavior, provided the events are chosen appropriately. In this case, the meas-urement of the running program produces a compact view of the running system, neglecting details not important for the current investigation.

3.2 Location of Monitoring

As can be seen in Fig. 2, three different monitoring tech-niques can be distinguished depending on where information is gathered, recorded, and processed. Pure hardware moni-toring relies on the fact that everything that is going on in the program of interest will find its correspondence in signal changes at the hardware level. If all necessary signals were available — as it was the case with computers up to the 70s — this method give all necessary insight without disturbing the behavior on the software level. Everything concerned with monitoring is provided by the external hardware moni-tor. This monitor is connected to the system under test with a set of electronic probes.

Nowadays, highly integrated processors with built-in caches and parallel execution units dominate, and most of

the important signals are hidden within the processor chip. Another problem is the difficulty to generate a source refer-ence, i.e. to infer from the signal changes monitored to the

The second approach is pure software monitoring. Here, the software to be monitored is augmented with ad-ditional functions that perform the monitoring. These parts are

1. a function for initializing the monitoring, e.g. allo-

cating space for the event buffer

2. instrumentation procedures for marking the events

and writing appropriate information about the event and a time stamp into the event buffer

3. utilities for processing the content of the event

buffer.

Obviously, this extra amount of work that has to be carried out during the program’s runtime leads to a dis-tortion of the behavior compared to the same program without monitoring. Beside this severe disadvantage this approach has significant advantages

? no dedicated hardware necessary

? easy source reference due to the instrumentation that adds event markers into the program.

The third approach is hybrid monitoring that combines the advantages of the previously mentioned pure software and hardware monitoring. Here, the instrumentation is carried out in a similar way as in software monitoring with one important exception: when an event occurs, the event marker procedure does not write information into a buffer on the same computer. Instead, information about the event in written to a dedicated hardware port that can be adapted easily by the external hardware monitor’s probes. Such hardware ports for interfacing to a monitor will be discussed in sec. 4.3. Hybrid monitoring avoids most of the problems of both other approaches, reduces the distortion to a minimum, but it requires an external monitor.

Each of these approaches has its weakness and its strength. It depends on the application what approach to use. Most of the time software monitoring will suffice. However, in cases where strict time limits (e.g. in hard real-time systems) are an important issue, hybrid moni-toring or even hardware monitoring may be necessary.

3.3 Monitor Hook

Consider an arbitrary parallel or distributed computing system as depicted in Fig. 3. The computing nodes N

are connected by a connection network of arbitrary structure that allows the nodes to exchange information.

Monitoring the overall behavior of that system is possible with two different types of monitors

1. node monitors NM

that record traces T

of the format {}

e t

i i

,* where e

denotes the event that has

occurred at time t

and the braces followed by the

even levels of recursion can easily be determined. How-ever, there are practical restrictions that prevent a node monitor from accessing important information. A monitor running on the user level is not able to gather information about the behavior of system function it calls. This in-cludes access to communication facilities that is run in the kernel mode. Even if the node monitor is run in kernel mode, it typically has no access to lower levels of com-munication protocols because they are handled on sepa-rate hardware with its own proprietary firmware.

Connectiion monitoring is not able to access any in-formation local to nodes in the system. Despite this severe restriction in scope it has become an important aid in system development and maintenance, as the large num-ber of network analyzers available on the market shows.

The reason for this property lies in the ability to retrieve global information on the communication behavior of at

least two nodes in the system — and the discussion about causal relationships revealed that communication is a key issue for understanding the system’s behavior. If the sys-tem to be analyzed is connected by a single cable seg-ment, as it is the case within most LANs, the information is global in the whole system to be measured.

Obviously, none of these methods is sufficient for get-ting insight into the system as a whole. Combining both methods accomplishes this global view. This combination can be reached by building a monitor for both, network monitoring and node monitoring, simultaneously. A more practical solution is to use a node monitor at the nodes in the system and an off-the-shelf network analyzer for the connection(s). In the latter case, global time stamps must be estimated from the local traces because there is typi-cally no means to synchronize the clocks of different monitor systems.

4 Example: The Distributed HW-monitor ZM4

4.1 Demands and Conceptual Issues for a Uni-

versal Monitor System

A monitor system, universally adaptable to arbitrary parallel and distributed computer systems, must fulfill several archi-tectural demands. It must be able to

(i) deal with a large number of processors (nodes in the

object system),

(ii) cope with spatial distribution of the object nodes, (iii) be adaptable to different node architectures,

(iv) support a global view on all interesting events in the object system for showing causal relationships be-tween events in different nodes,

(v) provide a problem-oriented (source-related) view.

The universal distributed monitor system ZM4 fulfills the demands (i) - (iv). Its concepts for meeting these chal-lenges are:

(i) In order to deal with a large number of object nodes

the monitor ZM4 has a distributed architecture, scaleable by allowing an arbitrary number of moni-tor agents.

(ii) ZM4 interconnects the monitor agents by a local area network. Therefore, monitor agents need not be spatially concentrated and can also monitor spatially distributed object systems.

(iii) ZM4 can record events from arbitrary object sys-tems with arbitrary physical event representation.

Different object systems can be monitored simulta-neously.

(iv) ZM4 has a global clock with an accuracy of 100 ns.

This provides sufficient precision for establishing a

global view in any of today's parallel and distributed

systems.

(v) A problem-oriented view can be achieved by repre-senting measured events and activities by the identi-

fiers familiar to the programmer.

Issues (i) – (iv) are dealt with in this section, contain-ing the description of the architecture of ZM4 and its major component DPU. Issue (v) is rather a problem of the object’s instrumentation and an appropriate evaluation of measured traces than of monitoring hardware. How-ever, the ZM4 hardware monitor supports issue (v) by ac-cepting a wide variety of physical event formats.

4.2 Architecture of the Hardware Monitor Sys-

tem ZM4

The ZM4 monitor system is structured as a mas-ter/slave system with a control and evaluation computer (CEC) as the master and an arbitrary number of monitor agents (MA) as slaves (see [8]). The distance between these MAs can be up to 1,000 meters. Conceptually, the CEC is the host of the whole monitor system. It controls the measurement activities of the MAs, stores the meas-ured data and provides the user with a powerful and uni-versal toolset for evaluation of the measured data [19].

The MAs are standard PC/AT-compatible machines equipped with up to 4 dedicated probe units (DPUs). their expandability is used for configuring ZM4 appropriately to the various object systems. Each MA provides proc-essing power, memory resources, a hard disk and a net-work interface for access to the data channel. The MAs control the DPUs and buffer the measured event traces on their local disks. The DPUs are expansion boards for ISA-PCs that link the MA and the nodes of the object system. These boards are responsible for event recogni-tion, time stamping, event recording and for high-speed buffering of event traces.

A clock with a global resolution of 100 ns and a time

stamping mechanism is integrated into each DPU. This

distributed

object system

ZM4

Fig. 4. Distributed architecture of ZM4

allows to correctly order all causally related events in ar-bitrary parallel and distributed systems. The clock of a DPU gets all information for preparing precise and glob-ally valid time stamps via the tick channel from the meas-ure tick generator (MTG). The synchronization scheme is realized based on a distributed frequency synthesis tech-nique [6,9]. While the tick channel together with the syn-chronization mechanism was a dedicated development, commercially available parts for the data channel, i.e. ETHERNET with TCP/IP, were used. The data channel forms the communication subsystem of ZM4, and it is used to distribute control information and measured data.

The architectural flexibility of ZM4 has been achieved by two properties: a versatile interfacing concept and a scaleable architecture. The DPU can easily be adapted to different object systems (see section 4.3). ZM4 is fully scaleable in terms of MAs and DPUs. The smallest con-figuration consists of one MA with one DPU (see Fig. 4, left), and can monitor up to four object nodes. Larger ob-ject systems are matched by more DPUs and MAs, re-

spectively. In the following, the DPU architecture and its main component, the event recorder are discussed in a top-down fashion.

4.2.1 DPU Architecture

The DPU (dedicated probe unit) implements a func-tional separation into the three tasks of event processing: interfacing, event detection and event recording (see gen-eral DPU in Fig. 5, left).

The interface has a tight connection to the object sys-tem, so it must be dedicated to the object system. The event detector investigates the rapidly changing informa-tion supplied by the interface in order to recognize the events of interest, and to supply the event recorder with appropriate information about each event.

The complexity of the event detector mainly depends on the type of measurement: to recognize predefined statements in a program running on a processor without instruction cache and memory management unit, a set of comparators or a memory-mapped comparison scheme suffices. If the object system uses a processor with a hardware cache, or if predefined sequences of statements are intended to trigger an event, much more complex rec-ognition circuits will be necessary [12].

In hybrid monitoring, the object system itself presents the event description in a form suitable for the external monitor. In this case no event detector is needed, and the interface only has to adapt the object system to the event recorder electrically and mechanically. So the event re-corder directly captures the event description, which is prepared by the object system (see simple DPU in Fig. 5, right).

4.2.2 Universal Event Recorder

The event recorder has to fulfill two tasks: assign glob-ally valid time stamps to the incoming event descriptions, thereby building event records, and supply a first level of high-speed buffering. The interface between event detec-tor or hybrid interface and event recorder is a data path transferring the event description itself, and a control path signaling the occurrence of events. The control path mainly consists of four request lines (req1 to req4) and four grant lines (gnt1 to gnt4), each pair req/gnt servic-ing an asynchronous and independent event stream. That means, up to four object nodes can be monitored with only one event recorder.

Each of the four event streams can be furnished with an arbitrary fraction of the data field, which in total sup-plies 48 bits. If at least one of the request lines signals an event, the event recorder’s capture logic latches the in-formation at the 48 data lines into a 48 bit data buffer in order to establish a stable signal condition for further processing. The output of this data buffer together with the flag register (8 bit) and the clock's display register (40 bit) define a 96 bit physical event record. This is written into the FIFO memory within one 100 ns cycle of the globally synchronized clock.

Each event stream is associated with a bit in the flag register, which indicates that its event stream contributed to a valid event. This mechanism allows to recognize the relevant part(s) of the data field and ignore the rest of it.

Coincidence of events in different streams is possible.

Then more than one bit in the flag register is set, meaning that their corresponding parts in the data field are valid event descriptions. There is an additional bit which indi-

bit patterns of predefined event tokens of

cates that a fifth event stream — internal to the monitor system — has generated a synchronization event from de-coding the information transmitted via the tick channel.The transmitted synchronization information supports a fault-tolerant protocol that allows to prove the correctness of all time stamps at the synchronization events and con-

Note that the concepts presented for ZM4 interfaces can be used as well for arbitrary monitor systems provided the monitor system is equipped with a parallel port. The range of possible monitor systems spans from logic analyzer to personal computers.

vided into three classes, starting from basic interfaces for already existing parallel output ports of computers to in-terfaces for direct adaptation of microprocessor buses,and ending with intricate special purpose interfaces. All of them can arbitrarily be combined when carrying out a measurement. As the following subsections show, build-A PC may be used for monitoring traces with a low event rate; the interface can be connected to the printer port.

The pieces of software responsible for recognizing events and informing the monitor about it.

4.3.2 Interfaces for Microprocessor Buses

There are systems, where 8 bits are not sufficient for coding all information associated with an event, and a parallel port that can be used for monitoring purposes is not always available. Typically such parallel ports are missing in multiprocessor systems, e.g. transputer sys-tems, whose I/O-activities normally are carried out by a dedicated host computer. An interface directly connected to the processor pins (or the backplane bus) allows to adapt the ZM4 to a particular system with moderate ef-fort, too. Such an interface has to be organized in a way, that the object system regards it as a peripheral device, and that it agrees with the simple protocol of the ZM4's event recorder.

Technically spoken, such an interface to a microproc-essor bus is a parallel port to be installed in the object system's hardware. The general structure of this kind of interface is shown in Fig. 7. The data path width of the parallel interface typically will be fixed at a value which allows to utilize the whole data path width of the event recorder. This allows for allocating 16 bits for the coding of the event itself (i.e. 65536 different events can be dis-tinguished) and 32 bits for event attributes; this meets the requirements of the today's very popular off the shelf mi-croprocessors. If this wide data path is not necessary for a particular monitoring application, it is of course possible to connect more than one interface to an event recorder by simply using a free event stream and allocating the necessary number of bits in the data path. A combination of whatever interfaces can easily be done by forking the cables between the event recorder and the interfaces, and combining corresponding parts of the cable to the con-nectors of the interfaces.

Since the ZM4 project was launched, the interfaces that are described in Sections 4.3.2.1 to 4.3.3.4 have been developed and successfully used.

4.3.2.1 I NTERFACE FOR SMP-B US

This interface is only 8 bits wide, the natural width of the SMP-bus, because every access of the processor to memory/io transfers one byte of data. It was used in com-bination with the single signal interface, described in the next subsection. As SMP-bus devices typically contain low-end eight bit processors, software access to this inter-face is simple and fast: an event is signaled via writing a byte to the I/O-port dedicated to the interface. The time consumption of signaling an event depends on the speed of the processor, and is in the scale of a few microsec-onds.

4.3.2.2 I NTERFACE FOR SUN4/390

Adapting the ZM4 to the back plane bus of the SUN server; this interface is 16 bits wide and can be accessed

by arbitrary processes running on the server in system and user mode in order to output event data to be recorded by ZM4 or any other monitor device connected to it. In order to allow a user process accessing a hardware resource like this, a dedicated driver has been incorporated into the UNIX operating system. Outputting information via an operating system driver is rather time consuming in UNIX systems: measurements have shown a time of about 200 microseconds for one event.

4.3.2.3 I NTERFACE TO T RANSPUTERS

The Transputer bus interface adapts the ZM4 (more precisely: the event recorder) to the 32-Bit family of Transputers, i.e. the T414, T425, T800, by directly ac-cessing the signals at the socket of the respective micro-processor. For this purpose a dedicated fixed/flexible printed circuit board was developed, which grabs all rele-vant signals of the Transputer via an intermediate socket, and connects them to the interface board on the other side.

An event is signaled to the external monitor via this interface by an assignment to a variable that is located in

a memory area dedicated to monitoring. The hardware

signals associated with this assignment are then recog-nized by the interface and transformed into an event de-scription and the request signal. This mechanism allows to transfer all 48 bits within a single instruction of the Transputer: 16 bits are transferred within the address and

32 bits as the value assigned, yielding in a very low over-

head for signaling an event to the ZM4. The time con-sumed for signaling an event is about 200 nanoseconds.

A special feature of this interface is the hardware event

filter, which allows the inclusion or exclusion of each possible event separately: the 16 bit portion of the event data from the object system is compared with the set of all events to be included, and only if the event matches, a re-quest to the event recorder will be issued. The informa-tion, which events are relevant or irrelevant, is specified by the user of the monitor. Specifying these events is pos-sible by defining ranges of events and in terms of binary patterns. This user supplied information is parsed, trans-

microprocessor

object system

event

recorder of

ZM4

Fig. 7. General structure of a microprocessor

bus interface

formed and transferred to the interface via the event re-corder using a serial protocol.

4.3.3 Special Purpose Interfaces

Sometimes neither the processor bus can be accessed nor a parallel port is available which can be used for out-putting event data. In such peculiar cases a special pur-pose interface has to be designed in order to perform the monitoring task. The following four interfaces give an impression that constructing such special purpose inter-faces is an interesting challenge for the designer.

4.3.3.1 SUPRENUM I NTERFACE

The processor boards of the German supercomputer SUPRENUM have no parallel port, which could be used for monitoring purposes. Interfacing to SUPRENUM is even hardened by the fact, that the backplane bus, which interconnects the SUPRENUM boards, is a shared re-source with a message oriented protocol. This bus cannot be used for monitoring because of the high setup time for transferring a message.

Nevertheless, a 48 bit wide event description is output to the ZM4 from each SUPRENUM board without dis-turbing any resource of that board: for diagnosis purposes the front cover of every SUPRENUM board contains a number of LEDs and a seven segment display capable of figuring 16 different patterns. This restriction to 16 out of 128 is imposed by the SUPRENUM board’s hardware.

As not all of the 16 possible bit patterns were used by the diagnosis device, it was possible to also use this dis-play for monitoring by piecewise outputting the event de-scription. In order to carry out that, two prerequisites are necessary: 1) a software driver that transfers the 48 bit wide event description of its calling parameter with 16 pieces at 3 bits successively to the seven segment display on the front cover. 2) A hardware interface that succes-sively decodes the 16 displayed patterns each into the original 3 bit valued information, and assembles these portions into the complete 48 bit event description as is-sued to the software driver. This interface has been real-ized within a small area of 5 cm by 10 cm due to the nowadays available programmable logic devices [20].

The time for outputting an event is composed by the time necessary to actually shift out the event data and the overhead necessary for changing from user to system state in the operating system. Shifting out the event data was accomplished in 40 microseconds, while the operating system overhead took double the time. So it can be stated, that the rather complicated procedure for outputting the event data only has a marginal influence on the time con-sumption for signaling an event. The main overhead stems from the operating system itself, and this can not be avoided.

4.3.3.2 S INGLE SIGNAL INTERFACE

Quite different requests to an interface arise, if in ad-dition to the dynamic behavior of the software the result-ing hardware activities have to be analyzed, too. This is the case especially when software for controlling ma-chines or driver programs for peripheral devices in com-puters are the point of interest. To our experience, the ef-fects of software activities on the hardware, controlled by it, often can be uniquely identified by changes in the logi-cal value of certain lines; e.g. a motor can be switched on or off, for which exactly one line is necessary.

This type of monitoring can be called hybrid in a dou-ble sense: one aspect is hybrid monitoring by recognizing events in software and recording and processing them by an external hardware monitor. The other aspect is com-bining hybrid monitoring with pure hardware monitoring.

With ZM4 the latter method can be applied by means of the single signal interface. It monitors its input lines and triggers the event recording mechanism every time the specified trigger condition is valid for at least one of the signal lines. Trigger conditions can be 1) a rising edge, 2)

a falling edge and, 3) an arbitrary edge in the signal slope.

The line (or lines) causing the recording of an event rec-ord can be identified, and the signal shape can be recon-structed from the event trace recorded by the ZM4 by means of the trace evaluation software SIMPLE [19].

4.3.3.3 V IDEO INTERFACE

The purpose of the video interface significantly differs from all other interfaces described yet. While the other interfaces deal with binary data by extracting the infor-mation when to request the recording of an event from a change in certain input lines and directly transmitting the rest of the lines as the event description. In contrast to that, the video interface issues a request for event re-cording every time a video screen is completely written, and the event description associated with this event is a measure for the brightness of the picture previously writ-ten on the screen. This interface is used for measuring the time it takes to transfer multimedia data from e.g. a video recorder as an input to the screen of a workstation as an output via a communication network. Together with monitoring the network software, this interface allows a full end to end analysis of the hardware and software part of a multimedia communication testbed.

4.3.3.4 V24-HANDSHAKE INTERFACE

Although a large part of computer systems incorpo-rates a parallel port that can be used for monitoring, there are also computers without a parallel interface or comput-ers, where this interfaces is dedicated to other purposes. A serial interface is typically available at every computer, mainly for historical reasons: terminal equipment always

was connected to the computer via a serial link. It is not

difficult to use the standard serial controller for outputting event information via this serial interface. However, this imposes a severe restriction on the speed for signaling an event. Except for serial interfaces with very special con-trollers, the bit rate is limited to 19200 bit/s. This results in a time of half a millisecond for signaling one event. As many devices are even slower, this interface can not effi-ciently be used for monitoring.

A serial interface that conforms to the V24 standard does not only contain the serial send and receive signals but also signals for handshaking. As these signals can be directly accessed by the processor, two of them can be used for a simple serial protocol. One line serves as the data line while the other one is used for clocking the se-rial data into a shift register on the interface. By this means, data rates of 500 kBit/s can be achieved, yielding only 16 μs for an 8 bit event. This reduces the time for

signaling an event by a factor of up to 1000.

5 Abstract View of Measured Data

5.1 The Concept for a General Logical Struc-

ture of Measured Data — the Basis for In-

dependence of Measurement and Evaluation The design and implementation of an evaluation sys-tem for measured data is too complex and expensive a task to be done for one special object system or one monitor system only. The three following requirements are essential to make the evaluation system capable of ef-ficiently handling measured data produced by event-driven monitoring of parallel and distributed computer systems:

? Object system independence : there are many differ-ences in structure and function of the systems to be monitored, e.g. in the node architecture and in the con-figuration of the interconnection network. There is a variety of operating systems and applications. An evaluation system should be applicable to the meas-ured data coming from all these differently configured computer systems, offering a wide variety of functions.? Monitor independence: the measured data, recorded by different monitor devices, should be accessible in a uniform way, even if it is differently structured, for-matted and represented.

? Source reference: data recorded by monitor systems is usually encoded and compressed. But during analysis and presentation of data, users want to work with the problem-oriented identifiers of hardware and software objects of the monitored system.

Our approach to fulfilling these requirements is to con-sider the fundamental structure of the measured data be-cause this is what the evaluation system sees of the monitored system. All requirements mentioned are related to the structure, format, representation and semantics of the measured data. In order to abstract from these proper-ties a general logical structure for all the different types of measured data is necessary. This logical structure can then be used to define a standardized access method to the measured data. Using event-driven monitoring, the pro-posed general logical structure can focus on a general logical event trace structure as depicted in Fig. 8. It relies on the fact that measurements store the physical event re-cords sequentially in a file (event trace file), resulting in a sequence of event records sorted according to increasing time.

A section in the event trace which has been continu-ously recorded is called a trace segment . A trace segment describes the dynamic behavior of the monitored system during a time interval in which none of the detected events was lost. The knowledge of segment borders is im-portant, especially for validation tools based on event traces. Usually each trace segment begins with a special data record, the so-called segment header , which contains some useful information about the following segment, or is simply used to mark the beginning of a new trace seg-ment. The segment header is followed by an arbitrary number of event records, each consisting of record fields, one of which represents the acquisition time of the event record. With the hierarchy event trace / trace segment /

Fig. 8: General Event Trace Structure

event record / record field there is a general logical structure that allows to abstract from the physical struc-ture and representation of the measured data.

5.2 TDL/POET — a Basic Tool for Accessing Meas-

ured Data

Using this general logical event trace structure, the ac-cess tool TDL/POET meets the mentioned requirements. The basic idea is to consider the measured data a generic abstract data structure. The evaluation system can access the measured data only via a uniform and standardized set of generic procedures. Using these procedures, an evaluation system is able to abstract from different data formats and representations and thus becomes independ-ent of the monitor device(s) and of the monitored object systems. The tool consists of the three components POET, TDL and FDL [18], see Fig. 9.

? TDL (Trace Description Language): the language TDL is designed for a problem-oriented description of event traces. The compilation of a TDL description into a corresponding binary access key file has the advantage that syntactic and semantic correctness is checked once and before evaluation. The development of TDL had two principal aims: the first was to make a language available which clearly and naturally reflects the fun-damental structure of an event trace. The second was to enable even users not familiar with all details of the language to read and understand a given TDL descrip-tion. In addition, a TDL description provides a docu-mentation of the performed measurement.

? POET ( Problem Oriented Event Trace Interface): the POET library is a monitor-independent function inter-face which enables the user to access measured data stored in event trace files in a problem-oriented man-ner. In order to be able to access and decode the differ-ently structured measured data, the POET functions use the access key file which contains a complete de-scription of formats and properties of the measured data. For efficiency, the key file is in a binary and compact format. In addition, the access key file in-cludes the user-defined (problem-oriented) identifiers for the recorded values, enabling the required source reference. The third component, FDL, extends the ca-pabilities of TDL/POET by allowing user-defined views on the measured data.

? FDL ( Filter Description Language) is an approach similar to TDL. It is used for specifying rules for fil-tering event records depending on the values of their record fields. The problem-oriented identifiers of the TDL file are also used for filtering.

The tools can be used to analyze event traces which were recorded by ZM4 or other monitor systems such as

network analyzers, logic analyzers, software monitors, or even traces generated by simulation tools. POET is an open interface. This means that the user can build his own customized evaluation tools using the POET function li-brary.

5.3 Approaches to Transparent Trace Access

Configuration files or some sort of data description language are often used in order to make a system inde-pendent of the format of its input data. Our work on TDL was inspired by the ISO standard ASN.1 (Abstract Syntax Notation One), which is used in some protocol analyzers to describe the format of the data packets. A similar ap-proach to describe and filter monitoring data was used by Miller et al. in the DPM project (Distributed Program Monitor) [16]. Their language allows the description of name, number and size of the components in an event re-cord.

The description of trace structures such as segments and of the physical representation of data values is not supported. Its main targets are distributed systems with Send/Receive communication. In our opinion, the most important work on describing events was the definition of the event trace description language EDL by Bates and Wileden [1]. They also introduced the term behavioral abstraction. The main purpose of EDL is the definition of complex events out of primitive events. In EDL, attributes of the primitive events can be defined, but not their for-mat or representation [2].

5.4 The Performance Evaluation Tools of

SIMPLE

SIMPLE ( Source-related and Integrated Multiproces-sor and -computer Performance evaluation, modeLing and visualization Environment) is a tool environment de-signed and implemented for analysis of arbitrarily for-matted event traces. It runs on UNIX and MS-DOS sys-tems. SIMPLE has a modular structure and standardized interfaces. Therefore it is easily extendible, and tools which were developed and implemented by others can be integrated into SIMPLE with little effort. This section

Fig. 9: Event Trace Access with TDL/POET/FILTER

gives a short overview of the main components and the flow of data within the SIMPLE environment (see Fig.

10). For a more complete discussion and an application example see 18].

5.4.1 Global View.

Using a distributed monitor system results in several independent event traces. The first evaluation step is to generate a global event trace (tool MERGE) in order to get a global view of the whole object system. MERGE takes the local event trace files and the corresponding ac-cess key files as an input and generates a global event trace and the corresponding access key file. In the global trace, the event records of the local event traces are sorted according to their time stamps.

5.4.2 Trace Validation.

The next step is often for gotten but nevertheless nec-essary. Before doing any trace analysis it should be tested whether the measurement was performed correctly. The tool CHECKTRACE performs some simple standard tests which can be applied to all event traces. The tool V ARUS ( VA lidating RU les checking S ystem) enables the user to specify some rules (assertions) in a formal lan-guage for validating the event trace.

5.4.3 Trace Analysis.

There are two standard tools for trace analysis . The tool LIST for the generation of readable trace protocols and the tool TRCSTAT for computation of frequency, du-ration and other performance indices. Typical results are histograms or pie charts. Time-state diagrams (Gantt-charts) are generated with the tool GANTT.5.4.4 Trace animation.

The dynamic visualization of an event trace presents the monitored dynamic behavior at a speed which can be followed by the user, exposing properties of the program or system that might otherwise be difficult to understand or might even remain unnoticed. The tools SMART ( S low Motion Animated Review of Traces), which can be used on any character- oriented terminal, and VISIMON, which offers enhanced graphics capabilities and is based on X-Windows, support the layout of the animation in an animation description language.

6 Statistics-based Trace Analysis

Even if the level of abstraction is carefully chosen, event-based measurements produce an large number of events that cannot be analyzed with a detailed view. In order to cope with a large amount of data, statistical methods have to be applied that allow to find out those regions of the monitored data that show a strange behav-ior, e.g. long waiting times at a barrier or for receiving a message. Once such a location is found more detailed analyses may be performed as it is depicted in the fol-lowing sections. This section deals with elementary sta-tistical methods and tools for finding these locations.

The first investigation on a trace is counting the occur-rence of the events instrumented in the program. Often there is a property to be obeyed by some of the events, e.g. the number of send events must be equal to the num-ber or receive events (provided that only successful send and receive actions are counted). The relative number of events can also be an interesting parameter. Beyond this simple counting of events, inter-arrival times between events (i.e. jobs to compute, packets to transfer, …) may be of particular interest. Such properties can be regarded as stochastic variables and the whole bunch of statistical methods and tools can be applied to it. Often, simple evaluations like histograms, scatter plots, box plots give valuable hints for a further analysis.

Beside statistical tools like the commercial S-packet from AT&T or spread-sheet tools like Excel from Micro-soft, there are public domain tools like xmgr or xgraph that can be used for basic statistical data analysis. In the SIMPLE environment there are mainly three tools for ex-tracting statistical data from event traces; these tools are list (list event traces) trcstat (trace statistics) and fact (find activities).

? list presents information about every event in the event trace. Depending on a configuration file, all or part of the information is presented in a format that can be configured, too

Fig. 10: SIMPLE Environment

? trcstat computes basic trace statistics as specified

in the configuration file by keywords

? FREQUNCY: count events for frequency analysis,usage of services, etc. in the program run

? DISTANCE: evaluate distance between events of same type, e.g. inter-arrival times. processing times ? DURATION: evaluate duration of segments, e.g.runtimes of functions, message delivery times

? fact finds event sequences that define activities. By

means of a configuration file activities can be specified in the same was as regular expressions.

The latter two tools can be instructed to compute the following statistical properties of the trace: number of oc-currence of specified condition, minimum, maximum,mean value, median, 25 % quantile and 75 % quantile.

7 State Diagrams

State diagrams show the system’s behavior abstracted to a sequence of states the system is currently in. State changes only occur when the event causing this state change occurs. In practice, these so-called gantt charts prove as a very important aid for understanding the rea-son of an unexpected system behavior.

7.1 Example: Monitoring Program Behavior on

a multiprocessor system

This example shows how gantt charts can be used to understand the behavior of a parallel program on a multi-processor. In the SUPRENUM [21] multiprocessor up to 256 nodes are grouped in clusters of size 16, and the in-terconnection network is a two-level bus system: proces-sors within the same cluster communicate via a fast par-allel cluster bus, whereas inter-cluster communication is done via the torus-shaped serial SUPRENUM bus system.

The parallel program under study is a ray tracer. Ray tracing is a computer graphics method for generating high-quality images from formal descriptions of a scene.

For the scope of this paper, the reader need not be famil-iar with ray tracing. In the implementation considered here, the algorithm is parallelized in the following way:there is one master who does administrative work, dis-tributes jobs to an arbitrary number of servants, receives the results and writes them to an output file. The master communicates with all of the servants by message pass-ing, but there is no communication between any two ser-vants. Servants receive a job, work on the job (this in-volves mainly geometric intersection computations) and return the results to the master.

One job only constitutes a small fraction of the total work to be done. The basic structure of the master and servant processes is shown in fig. 9. A window flow con-trol scheme is employed to ensure that each servant al-ways has work to do: initially the master has a fixed num-ber of credits from each servant. The master keeps send-ing jobs to a servant (thereby decrementing the servant's credit count) as long as there are credits from that servant available. With every result a servant returns one credit to the master. Good load balancing is achieved by this dy-namic job assignment and by choosing a small job size.7.2 Analyzing the Parallel Program

Monitoring was first applied to a basic version of the ray tracing program in which SUPRENUM's mailbox mechanism was used for the communication between the master and the servants. Mailboxes were used in order to avoid blocking of the sender while the receiver is not ready to receive the message. The sender of a message does not hand over the message to the receiver during a rendezvous as in synchronous communication, but puts the message in the receiver's mailbox from where the re-ceiver can pick up the message at a later time. Monitoring a run of this version of the ray tracing program in which the program was running on 16 processors (i.e. there are 15 servants) revealed that the servant processor utilization was only about 15%, which means that the servant proc-essors spent about 85% of total time waiting for a job to arrive. This was the starting point for analyzing the pro-gram's behavior in more detail.

Distribute Wait for Receive Begin

End

Begin

Master

Job Begin

Begin

Fig. 11: Structure of Master and Servant Processes

TIME M A S T E R

S E R V A N T sec

WAIT FOR RESULTS

RECEIVE RESULTS

SEND JOBS

DISTRIBUTE JOBS

WRITE PIXELS

WORK

WAIT FOR JOB

130.00130.02130.04130.06130.08

Fig. 12: Behavior of the Mailbox Communication

(Ray Tracer on two Processors)

Fig. 12 shows a Gantt-chart obtained from evaluating a measurement of the ray tracing program during which the ray tracer was running on two processors only, the master and one servant. In the chart, the activities of the master and the servant are shown over a common time axis. One can observe in the chart that the master goes through the following activities in a cyclic fashion: the activities Dis-tribute Jobs (1) and Send Jobs (2) are followed by a Wait for Results activity (3). Then results computed by the ser-vant are received (Receive Results, 4) which is followed by the next Distribute Jobs activity (5).

Since a window flow control scheme is employed to control the number of outstanding jobs, the results re-ceived are not the results for the job just sent but for a previous job. Some of the master's cycles also contain a Write Pixels activity (6), during which results are written to the output file. Writing to the output file is not done in each of the master's cycles. This is because pixels have to be written to the output file in correct order.

Results may not be received in the order in which the corresponding jobs have been sent, because the time to process a job varies considerably. Writing to the output file takes place whenever a continuous stretch of pixels has been processed. The chart in Fig. 12 shows a major drawback. Obviously, the transition from Send Jobs to Wait for Results (2, 3) on the master processor can only take place synchronously with the transition from Work to Wait for Job (7, 8) on the servant processor.

Contrary to what was expected, the master becomes blocked during the Send Jobs activity, which is exactly what was to be avoided by using mailbox communication.

A lot of time is wasted during the Send Jobs activity. The reason for this behavior is as follows: with SUPRENUM, a mailbox is a light-weight process owned by the receiv-ing process and running together with the receiving proc-ess on one processor in a time-sharing manner. The scheduling strategy used is round robin.

However, instead of using time slices each process is allowed to run until it becomes blocked. The master can-not finish his Send Jobs activity because he can put a message in the servant's mailbox only if this mailbox process is actually running. This is not the case until the servant relinquishes the processor because he is waiting for a message. Thus, (asynchronous) mailbox communi-cation behaves like synchronous communication.

As a result for the ray tracing application, the master cannot keep 15 servants busy because he is spending too much time being blocked while sending jobs to the ser-vants. With the knowledge gained by this evaluation, the servants’ utilization could be improved from 15% to 60%, resulting in four times the performance of the original version.

8 Causality Analysis

Causal relationships are due to the communication that occurs in parallel and distributed systems working on a common task. Every communication action causally con-nects the corresponding processes. In total, all cooperat-ing processes are woven in a network of causality. That is the reason, why analyzing causal relationships should be a basic method in computer systems performance evalua-tion.

8.1 Description Language for

Causal Relationships

Causal relationships between different processes are associated with communication between these processes.

Fig. 13 represents common communication structures that are used in parallel and distributed systems.

3 Keywords in the examples are printed in bold letters.

Each dependence has to be associated with a unique name dependence_name, which can be found in the dia-gram created by the hasse as a reference. The causing event causing_event[{range}] and the dependent event dependent_event[{range}] can be specified in the same way, i.e. with an event name and an optional range speci-fier in curly braces. This range specifier allows the user to define requirements on the number of occurrence of one or both of the events, which have to be found in the event trace in order to satisfy the dependence.

Next, the processes in which causing event and de-pendent event occur, can be specified by the keyword se-quence ON [QUALIFIER] PROCESS [OF], where QUALIFIER and OF are optional. For example, simply sending a message from process B to process A is known as a one-to-one communication (Fig. 13, left). In the de-scription language, this kind of causal dependence does not need further qualification, and thus, it is formulated as:

DEPENDENCE one-to-one IS

x ON PROCESS B ?y ON PROCESS A END

In this simple case, the event x on process B has to oc-cur before the event y on process A in the given event trace.

Broadcast messages and multicast messages are exam-ples of one-to-many communications (Fig. 13). Obvi-ously, the constructs used for specifying a one-to-one communication do not allow for a formulation of a one-to-many communication. Therefore, the hasse description language provides a set of keywords for qualifiers, as listed in Table 1.

A comma separated list of process names, enclosed in curly braces, named the list_of_processes, specifies which processes belong to the set of processes for causing and dependent events. This feature is necessary for providing the keywords ALL, SOME, and number with the capa-bility of specifying which processes have to be searched for the causing or dependent events specified.

For protocols with error recovery in the case of cor-rupted or lost packet transmissions, the specification of causal relationships necessitates an extension of the event name part of the event specification. The causing event may occur several times before the dependent event will happen. Therefore, the language construct range in the event name specification part can be used:

DEPENDENCE fault tolerant protocol IS

send{ 1 , 10 } ON PROCESS A ?receive ON PROCESS B

END

Here, the sending process A may try to send data to process B up to 10 times. If process B receives the data within the 10 attempts, the given dependence is fulfilled with the first receive on process B. Otherwise the given dependence is violated.

Fig. 14 shows the layout of representing the causal de-pendencies of two processes named Initiator and Re-sponder over the time (see section 8.2). The pre-area of an event of the process Initiator

is marked by dark shading.

Besides, a pop-up window presents information of the causal dependence drawn with the dotted arrow. This in-

formation consists of various attributes of the events in-volved. Especially, the given event interpretation offers a direct relation to the source code.

Table 1: Qualifiers in the hasse description language

Fig. 14: screen dump of hasse tool

ration file is necessary, whose contents are shown in Fig. 16. Words in upper-case letters are keywords of the

The hasse configuration file contains a causal depend-ence for each PDU exchanged between the Initiator and Responder (CR, CC, DT, AK, DR) plus one dependence describing the causal relationship between the timer events. The specification of dependence Connection Request says that sending the PDU CR (out_CR ) on process Initiator must occur at least once, but may be re-peated several times ({1,}) until the PDU is received by process Responder , indicated by in_CR . The caused event in_CR belongs to the oldest out_CR occurring on process Initiator between the two last in_CR events.Numerous out_CR events are allowed because the send-ing of a CR PDU is supervised by a timer set just after the PDU is sent.

In an analogous way, the specification of dependence Data Transfer has to be interpreted. The dependen-cies Connection Confirmation , Acknowledg-ment , and Disconnection Request represent the causal relationships for the exchange of the PDUs CC,AK, and DR respectively. Here, the specified causing event must be followed by exactly one of its correspond-ing dependent event, i.e. causing and dependent event al-ternate. At last, the Timer dependence has to be speci-fied, which is fulfilled if the causing event set_T is fol-lowed by either a timeout_T event indicating that the

timer expired or by a reset_T event indicating that the supervised PDU is acknowledged in time.

reveals that the timer’s expiration time is adjusted too short. Despite of the late arrival of the first CC PDU, the Initiator consumes it and notifies a successful connection establishment to the user.

In the connected state, data of the Initiator user lets the Initiator send a DT PDU to the Responder , set a timer, and attends for an AK PDU of the Responder. The timer again expires before the expected AK PDU of the Responder arrives. This causes the Initiator to repeat sending the first DT PDU. In contrast to the receipt of the DT PDU for the first time where the Responder notifies the data receipt to the user, the Responder ignores dou-bled data 6 and replies to the Initiator by sending a re-peated AK PDU.

Fig. 18 shows the shaded part of Fig. 17 in more detail.It proves that the time-out (timeout_T ) occurs before the CC PDU is received by the Initiator (in_CC ). Addi-tionally, the timer is set (set_T ) shortly after the CR PDU is sent (out_CR ) and this timer is reset after the CC PDU is received (reset_T ).

The visualization of the causal dependencies between the Initiator and the Responder presented in Fig. 17 and Fig. 18 reveal a delayed communication behavior of the INRES protocol. The obvious recommendation of the

The Responder recognizes doubled data by means of the se-quence number transmitted in each DT and AK PDU.

Table 3: Instrumentation points for INRES protocol

diagrams is to enlarge the timer value. This saves the

bandwidth consumed by the superfluous PDUs and thus, speeds up the communication between Initiator and Re-sponder by avoiding the unnecessary time-out. Note, if a tool like hasse would not have been available, the INRES protocol would probably work, but in an inefficient and functionally unexpected way each CR and DT PDU

would have been transmitted twice.

9 Conclusions

This tutorial paper started with introducing the notion of causality. Causality is an important aid for analyzing computer systems in all those cases where a statistical overview on performance is insufficient. Investigating causal dependencies is indispensable for finding out the reason of a particular behavior monitored.

Monitoring was covered in general by presenting the different types of monitoring, time of monitoring, and the monitoring hooks. Instead of an abstract overview on monitor systems, a universal distributed hardware monitor was discussed in depth. Examples were presented for structuring and implementing such a monitor system, stressing an important feature of ZM4, the ability to adapt it to arbitrary object systems via interfaces. The concept of interfacing can be applied to other monitor systems like logic analyzers, too.

Monitoring is only one side of the coin: the monitored data have to be evaluated by different methods. Starting from a high level of summarizing statistical methods, the real reason for the monitored behavior can be revealed. As learning is easier from examples then from abstract discussions, the topic of detailed trace evaluation was treated with the help of examples that show the useful-ness of the methods.

Finally, some words on how to obtain further informa-tion. This paper as well as the slides of the tutorial can be obtained in machine readable form from https://www.wendangku.net/doc/4118344331.html,rmatik.uni-erlangen.de (131.188.47.79) via anonymous ftp. The directory is /pub/doc. Log in as user ftp and give your email address as the password. Re-gard the files README and README+abstracts for an overview of all available material. This paper belongs to tutorialp496.ps.gz, the slides are in the files tuto-rial496_1.ps.gz and tutorial496_2.ps.gz. References

[1] Bates, P.C. and Wileden, J.C. (Eds.): A basis for

distributed system debugging tools. In Hawaii Int.

Conf. On Syst. Sci. 15, Hawaii, 1982.

[2] Bates, P.: Debugging heterogeneous distributed

systems using event-based models of behavior. In

ACM Sigplan Notices, Workshop on Parallel and

Distributed Debugging, vol. 24, no.1, pp. 11-22, Jan.

1989.

[3] Bemmerl, T.; Lindhof, R. and Treml, T.: The

Distributed Monitor System of TOPSYS. In H.

Burkhart, (Ed.): CONP AR 90–VAPP IV, Joint

International Conference on Vector and Parallel

Processing. Proceedings, pages 756–764, Zürich,

Switzerland, September 1990. Springer, Berlin, LNCS

457.

[4] Dauphin, P.: Combining Functional and Performance

Debugging of Parallel and Distributed Systems based

on Model-driven Monitoring. In 2nd EUROMICRO

Workshop on …Parallel and Distributed Processing“,

University of Malaga, Spain, pages 463–470, Jan.

26.–28. 1994.

[5] Dussa-Zieger, K., Ettl, M, Hofmann, R., and Prei?ler,

O.: Monitoring and Modelling of a Distributed ISDN

Test System. In Proceedings Euromicro Workshop on

Parallel and Distributed Processing, pages 201-208.

San Remo, Italy, January 25-27, 1995.

[6] Gardner, F.M.: Phaselock Techniques. John Wiley &

Sons, New York, 2nd edition, 1979.

[7] Heath M.T., and Etheridge, J.A.: Visualizing the

Performance of Parallel Programs. IEEE Software,

pages 29–39, September 1991.

[8] Hofmann, R., Klar, R., Mohr, B., Quick, A., and

Siegle, M.: Distributed Performance Monitoring:

Methods, Tools, and Applications. IEEE Transactions

on Parallel and Distributed Systems, 5(6):585-598,

June 1994.

[9] Hofmann, R.: Secure Temporal Relationships for

Performance Analysis in Parallel and Distributed

Systems (in German), Arbeitsberichte des IMMD

(Informatik VII), University Erlangen-Nürnberg,

26(3). 1993.

[10] Hogrefe, D.: Estelle, LOTOS und SDL. Springer,

Berlin, 1989.

[11] Kern, W.: Concept and Implementation of a Tool

Representing Causal Relationships of Event Traces

Responder

Initiator

time[ms]

Fig. 18: Zoomed part of the hasse diagram

(in German). Master's thesis, Universit?t Erlangen–

Nürnberg, IMMD VII, September 1993.

[12] Klar, R. and Luttenberger, N.: VLSI-based Monitoring

of the Inter-Process-Communication of Multi-

Microcomputer Systems with Shared Memory. In

Proceedings EUROMICRO '86, Microprocessing and

Microprogramming, vol. 18, no. 1-5 , pages 195-204,

Venice, Italy, December 1986.

[13] Lamport, L.: Time, Clocks, and the Ordering of

Events in a Distributed System. Communications of

the ACM, 21(7):558-565, July 1978.

[14] Malony, A.D.;. Hammerslag, D.H and Jablonowski

D.J.: Traceview: A Trace Visualization Tool. IEEE

Software, September 1991.

[15] Mattern, F. : Verteilte Basisalgorithmen. Springer

Verlag, IFB 226, Berlin, 1989.

[16] Miller, B.P.; Macrander, C. and Sechrest, M.: A

distributed programs monitor for Berkeley UNIX.

Software — Practice and Experience, vol. 16, no. 2,

pp. 183-200, Feb. 1986.

[17] Mills, D.L.: Improved algorithms for synchronizing

computer network clocks. Computer Communication

Review, 24(4):317–327, October 1994.

[18] Mohr, B.: SIMPLE: A performance evaluation tool

environment for parallel and distributed systems, in

Distrib. Memory Computing, 2nd European

Conference, EDMCC2, Bode, A.: (Ed.); Munich,

Germany. Springer LNCS 487. Berlin et al., April

1991, pp. 80 - 89.

[19] SIMPLE — User’s Guide Version 5.3. Internal Report

No. 3/92, University Erlangen-Nürnberg, IMMD VII,

March 1992.

[20] Siegle, M, and Hofmann R.: Monitoring Program

Behaviour on SUPRENUM, Computer Architecture

News 20(2):332-341, May 1992.

[21] Solchenbach, K. and Trottenberg, U.: SUPRENUM:

System essentials and grid applications, Parallel

Computing, Amsterdam, North Holland, 1988, vol. 7,

pp. 265-281.