当前位置：文档库 › SYNERGY SYNERGY

SYNERGY SYNERGY

Building Worthy Parallel Applications Using Networked Computers -- A Tutorial

For Synergy V3.0

Yuan Shi

shi@https://www.wendangku.net/doc/003451449.html,

(215) 204-6437 (Voice)

(215) 204-5082 (Fax)

January 1995

@1995 Temple University

Table of Contents

1. Introduction________________________________________________4

1.1 Synergy Philosophy_______________________________________4

2. Synergy V

3.0 Architecture_____________________________________5

4. Setup Synergy Runtime Environment_____________________________8

5. Parallel Application Development_______________________________9 5.1 Coarse Grain Parallelism___________________________________9 5.2 Adaptive Parallel Application Design (CTF)___________________10 5.3 POFP Program Development_______________________________11

5.3.1 Tuple Space Object Programming________________________13

5.3.2 Pipe Object Programming______________________________21

5.3.3 File Object Programming______________________________24

5.3.4 Compilation_________________________________________24 5.4 Parallel Program Configuration_____________________________24

5.4.1 Configuration in CSL_________________________________25 5.5 Parallel Execution________________________________________27

5.5.1 Parallel Debugging___________________________________28

5.5.2 Fault Tolerance______________________________________28

5.6 Graphic User Interface (License required)_____________________28

6. System Maintenance_________________________________________30 6.1 Utility Daemon Tests_____________________________________30 6.2 Object Daemon Tests_____________________________________30 6.3 Integrated Object Tests___________________________________30 6.4 Example Synergy Applications and Sub-directories_____________31

8. Concluding Remarks________________________________________32

9. Acknowledgments__________________________________________33

10. References_______________________________________________35

ABSTRACT

Since the down falls of leading supercomputer giants the parallel processing field has been under a bearish atmosphere. We are forced to learn from our failures, not only about the high performance computing market place but also the fundamental problems in parallel processing. One of the problems is how can we make parallel processors effective for all applications?

We propose a coarse grain parallel processing method called Passive Object-Flow Programming (POFP). POFP is a dataflow method. It advocates decoupling parallel programs from processors and functional programming from parallel processing, resource management and process management. This method is based on the observation that inter-processor communication requirements of multiple processors will out-pace any inter-networking designs.

Decoupled parallel programs can be written in any language and processed using any combination of operating systems and communication protocols. Assuming the use of multiple high performance uni-processors and an automatic client/server software generation method, Synergy V3.0 offers automatic processor assignments, automatic load balancing and some fault tolerance benefits.

Information disclosed in this documentation is part of an approved patent (U.S. Patent

#5,381,534). Commercial use of the information and the system without written consent from the Technology Transfer Office of Temple University may constitute an infringement to the patent.

1. Introduction

The emergence of low cost, high performance uni-processors forces the enlargement of processing grains in all multi-processor systems. Consequently, individual parallel programs have increased in length and complexities. However, like reliability, parallel processing of any multiple communicating sequential programs is not really a functional requirement.

Separating pure functional programming concerns from parallel processing and resource management concerns can greatly simplify the conventional “parallel programming” tasks. For example, the use of dataflow principles can facilitate automatic task scheduling. Smart tools can automate resource management. As long as the application dependent parallel structure is uncovered properly, we can even automatically assign processors to parallel programs in all cases.

Synergy V3.0 is an implementation of above ideas. It supports parallel processing using multiple “Unix computers” mounted on multiple file systems (or clusters) using TCP/IP. It allows parallel processing of any application using mixed languages, including parallel programming languages. Synergy may be thought of as a successor to Linda1, PVM2 and Express?3.

1.1 Synergy Philosophy

Facilitating the best use of computing and networking resources for each application is the key philosophy in Synergy. We advocate competitive resource sharing as opposed to “cycle stealing.” The tactic is to reduce processing time for each application. Multiple running applications would fully exploit system resources. The realization of the objectives, however, requires both quantitative analysis and highly efficient tools.

It is inevitable that parallel programming and debugging will be more time consuming than single thread processing regardless how well the application programming interface (API) is designed. The illusive parallel processing results taught us that we must have quantitatively convincing reasons to processing an application in parallel before committing to the potential expenses (programming, debugging and future maintenance.)

1 Linda is a tuple space parallel programming system lead by Dr. David Gelenter, Yale University. Its commercial version is distributed by the Scientific Computing Associates, New Heaven, NH. (See [2] for details)

2 PVM is a message-passing parallel programming system by Oak Ridge National Laboratory, Unversity of Tennassee and Emory University. (See [4] for details)

3 Express is a commercial message-passing parallel programming system by ParaSoft, CA. (See [4] for details)

We use Timing Models to evaluate the potential speedups of a parallel program using different processors and networking devices [13]. Timing models capture the orders of timing costs for computing, communication, disk I/O and synchronization requirements. We can quantitatively examine an application’s speedup potential under various processor and networking assumptions. The analysis results delineate the limit of hopes. When applied to practice, timing models provide guidelines for processing grain selection and experiment design.

Efficiency analysis showed that effective parallel processing should follow an incremental coarse-to-fine grain refinement method. Processors can be added only if there are unexplored parallelism, processors are available and the network is capable of carrying the anticipated load. Hard-wiring programs to processors will only be efficient for a few special applications with restricted input at the expense of programming difficulties. To improve performance, we took an application-oriented approach in the tool design. Unlike conventional compilers and operating systems projects, we build tools to customize a given processing environment for a given application. This customization defines a new infrastructure among the pertinent compilers, operating systems and the application for effective resource exploitation. Simultaneous execution of multiple parallel applications permits exploiting available resources for all users. This makes the networked processors a fairly real “virtual supercomputer.”

An important advantage of the Synergy compiler-operating system-application infrastructure is the higher level portability over existing systems. It allows written parallel programs to adapt into any programming, processor and networking technologies without compromising performance.

An important lesson we learned was that mixing parallel processing, resource management and functional programming tools in one language made tool automation and parallel programming unnecessarily difficult. This is especially true for parallel processors employing high performance uni-processors.

2. Synergy V

3.0 Architecture

Technically, the Synergy system is an automatic client/server software generation system that can form an effective parallel processor for each application using multiple distributed "Unix-computers." This parallel processor is specifically engineered to process programs inter-connected in an application dependent IPC (Inter-Program Communication/ Synchronization) graph using industry standard compilers, operating systems and communication protocols. This IPC graph exhibits application dependent coarse grain SIMD (Single Instruction Multiple Data), MIMD (Multiple Instruction Multiple Data) and pipeline parallelisms.

Synergy V3.0 supports three passive data objects for program-to-program communication and synchronization:

?Tuple space (a FIFO ordered tuple data manager)

?Pipe (a generic location independent indirect message queue)

?File (a location transparent sequential file)

A passive object is any structured data repository permitting no object creation functions. All commonly known large data objects, such as databases, knowledge bases, hashed files, and ISAM files, can be passive objects provided the object creating operators are absent. Passive objects confine dynamic dataflows into a static IPC graph for any parallel application. This is the basis for automatic customization.

POFP uses a simple open-manipulate-close sequence for each passive object. An one-dimensional Coarse-To-Fine (CTF) decomposition method (see Adaptable Parallel Application Development section for details) can produce designs of modular parallel programs using passive objects. A global view of the connected parallel programs reveals application dependent coarse grain SIMD, MIMD and pipeline potentials. Processing grain adjustments are done via the work distribution programs (usually called Masters). These adjustments can be made without changing codes. All parallel programs can be developed and compiled independently.

Synergy V3.0 kernel includes the following elements:

a) A language injection library (LIL)

b)An application launching program (prun)

c) An application monitor and control program (pcheck)

d)*A parallel program configuration processor (CONF)

e)*A distributed application controller (DAC)

f)**Three utility daemons (pmd, fdd and cid)

g)*Two object daemons (tsh and fsh)

h) A Graphics User Interface (mp)

* User transparent elements.

** Pmd and fdd are user transparent. Cid is not user transparent.

A parallel programmer must use the passive objects for communication and synchronization purposes. These operations are provided via the language injection library (LIL). LIL is linked to source programs at compilation time to generate hostless binaries that can run on any binary compatible platforms.

After making the parallel binaries the interconnection of parallel programs (IPC graph) should be specified in CSL (Configuration Specification Language). Program “prun” starts a parallel application. Prun calls CONF to process the IPC graph and to complete the program/object-to-processor assignments automatically or as specified. It then activates DAC to start appropriate object daemons and remote processes (via remote cid’s). It preserves the process dependencies until all processes are terminated.

Program “pcheck” functions analogously as the “ps” command in Unix. It monitors parallel applications and keeps track of parallel processes of each application. Pcheck also allows killing running processes or applications if necessary.

To make remote processors listening to personal commands, there are two light weight utility daemons:

?Command Interpreter Daemon (cid)

?Port Mapper Daemon (pmd)

Cid interprets a limited set of process control commands from the network for each user account. In other words, parallel users on the same processor need different cid’s. Pmd (the peer leader) provides a "yellow page" service for locating local cid’s. Pmd is automatically started by any cid and is transparent to all users.

Fdd is a fault detection daemon. It is activated by an option in the prun command to detect worker process failures at runtime.

Synergy V3.0 requires no root privileged processes. All parallel processes assume respective user security and resource restrictions defined at account creation. Parallel use of multiple computers imposes no additional security threat to the existing systems.

Theoretically, there should be one object daemon for each supported object type. For the three supported types: tuple space, pipe and files, we saved the pipe daemon by implementing it directly in LIL. Thus, Synergy V3.0 has only two object daemons:

?Tuple space handler (tsh)

?File access handler (fah)

The object daemons, when activated, talk to parallel programs via the LIL operators under the user defined identity (via CSL). They are potentially resource hungry. However they only "live" on the computers where they are needed and permitted.

To summarize, building parallel applications using Synergy requires the following steps:

a) Parallel program definitions. This requires, preferably, establishing timing models for a

given application. Timing model analysis provides decomposition guidelines. Parallel programs and passive objects are defined using these guidelines.

b) Individual program composition using passive objects.

c) Individual program compilation. This makes hostless binaries by compiling the source

programs with the Synergy object library (LIL). It may also include moving the

binaries to the $HOME/bin directory when appropriate.

d) Application synthesis. This requires a specification of program-to-program

communication and synchronization graph (in CSL). When needed, user preferred program-to-processor bindings are to be specified as well.

e) Run (prun). At this time the program synthesis information is mapped on to a selected

processor pool. Dynamic IPC patterns are generated (by CONF) to guide the

behavior of remote processes (via DAC and remote cid’s). Object daemons are started and remote processes are activated (via DAC and remote cid’s).

f) Monitor and control (pcheck).

The Synergy Graphics User Interface (mp) integrates steps d-e-f into one package making the running host the console of a virtual parallel processor.

4. Setup Synergy Runtime Environment

In addition to installing Synergy V3.0 on each computer cluster, there are four requirements for each “parallel” account:

a) An active SNG_PATH symbol definition pointing to the directory where Synergy

V3.0 is installed. It is usually /usr/local/synergy.

b) An active command search path ($SNG_PATH/bin) pointing to the directory holding the Synergy binaries.

c) A local host file ($HOME/.sng_hosts). Note that this file is only necessary for a host to be used as an application submission console.

d) An active personal command interpreter (cid) running in the background. Note that the destination of future parallel process’s graphic display should be defined before starting cid.

Since the local host file is used each time an application is started, it needs to reflect a) all accessible processors; and b) selected hosts for the current application.

The following commands are for creating and maintaining a list of accessible hosts:? %shosts [ default | ip_addr ]

This command creates a list of host names either using the prefix of the current

host IP address or a given IP address by searching the /etc/hosts file.

? %addhost host_name [-f]

This command adds a host entry in the local host file. The -f option forces

entry even if the host is not Synergy enabled.

? %delhost host_name [-f]

This command removes an entry from the local host file. The -f option forces the removal if the host is Synergy enabled.

? %dhosts

This command allows entry removal in batches.

The local host file has a simple format : . A typical file has the following lines:

129.32.32.102 https://www.wendangku.net/doc/003451449.html, tcp unix shi fsys1

129.32.1.100 https://www.wendangku.net/doc/003451449.html, tcp unix shi fsys2

155.247.182.16 https://www.wendangku.net/doc/003451449.html, tcp unix shi fsys3

Manual modification to this file may be necessary if one uses different login names.The “chosts” command is for selecting and de-selecting a subset of processors from the above pool for running a parallel application.

The “cds” command is for viewing processor availability for all selected hosts.

5. Parallel Application Development

5.1 Coarse Grain Parallelism

As single processor’s speed racing to the 100 MIPS range, exploiting parallelism at instruction level should be deemed obsolete. Execution time differences of assigning different processors to a non-computing intense program are diminishing for all practical purposes. The processing grain, however, is a function of the average uni-processor speed and the interconnection network speed [13]. For building scaleable parallel applications, a systematic coarse-to-fine application partitioning method is necessary.

There are three program-to-program dependency topologies (Flynn’s taxonomy) that can facilitate simultaneous execution of programs : SIMD, MIMD and pipeline. Figure 1illustrates their spatial arrangements and respective speedup factors.

D2D3SIMD

MIMD Pipeline Speedup = 3

Speedup = 3Speedup = (3 x 3)/(3 + 1 + 1) = 9/5Figure 1. Parallel Topologies and Speedups

In the pipeline case, the speedup approaches 3 (the number of pipe stages) when the number of input data items becomes large.

An application will be effectively processed in parallel using multiple high performance uni-processors only if its potential coarse grain topologies are uncovered. Efficiency analysis showed that finer grain partitions necessarily cause larger inter-processor traffic volume leading to lower computing efficiency [13]. To build scaleable parallel applications, however, we need a coarse-to-fine grain development method.

5.2 Adaptive Parallel Application Design (CTF)

The basic assumption here is that the user has identified the computing intense program segments (or pseudo codes). Assuming high performance uni-processors, these program segments must be repetitive either iteratively or recursively.

CTF has the following steps:

a. Linear cut. This pass cuts a program into segments according to time consuming loops and recursive function calls. It produces segments of three types: Sequential (S), Loop (L) and Recursive (R).

b. Inter-segment data dependency analysis. This pass analyzes the data dependencies between decomposed (S, L and R) segments. This results in an acyclic dataflow graph.

c. L-segment analysis. The objective of this analysis is to identify one independent and computation intensive core amongst all nested loops. In addition to conventional loop dependency analysis, one may have to perform some program transformations to remove "resolvable dependencies." The depth of the selected loop is dependent on the number of parallel processors and interconnection network quality. A deeper independent loop should be selected if large number processors are connected via high speed networks.

The selected loop will be cut into three-pieces: loop header (containing pre-processing statements and the parallel loop header), loop body (containing the computing intensive statements) and the loop tail (containing the post-processing statements). The loop master is formed using a loop header and a loop tail. The loop body is all that is needed for a loop worker.

d. R-segment analysis. The objective is to identify if the recursive function is dependent on the previous activation of the same function. If no dependency is found, the result of the analysis is a two-piece partition of the R-segment: a master that recursively generates K*P independent partial states, where K is a constant and P is the number of pertinent SIMD processors, packages them to working tuples and waits for results; and a worker that fetches any working tuple, unwraps it, recursively investigates all potential paths and returns results appropriately.

For applications using various branch-and-bound methods, there should be a tuple recording the partial optimal results. This tuple should be updated at most the same frequency as a constant time of the initial number of independent states (working tuples). The smaller the constant the more effective in preventing data bleeding. However, a larger constant would improve convergence speed (less redundant work done elsewhere).

e. Merge. If pipelines are to be suppressed, this pass merges linearly dependent (or not parallelizable) segments to a single segment. This is to save communication costs between segments.

f. Selecting IPC objects. We can use tuple space objects to implement parallelizable L and R segments (coarse grain SIMDs); pipes to connect all other segments. File objects are used in all file accesses.

g. Coding (see Section 5.3).

Note that the CTF method only partitions one dimension for nested loops. This simplifies program object interface designs making CTF a candidate for future automation. Computational experiments also suggested that multi-dimensional (domain) decomposition performs poorly in practice due to synchronization difficulties.

Adhering to the CTF method one can produce a network of modular parallel programs exhibiting application dependent coarse grain SIMD, MIMD and pipeline potentials.

5.3 POFP Program Development

A Synergy program can be coded in any convenient programming language. To communicate and synchronize with other processes (or programs), however, the program must use the following passive object operations:

a) General operations:

obj_id = cnf_open(“name”, 0);

This function creates a thread to the named object. For non-file objects, the

second argument is not used. For file objects, the second argument can be

“r”, “r+”, “w”, “w+” or “a.” This function returns an integer for

subsequent operations.

cnf_close(obj_id);

This function removes a thread to an object.

var = cnf_getf();

This function returns the non-negative value defined in “factor=“ clause in

the CSL specification.

var = cnf_getP();

This function returns the number of parallel SIMD processors at runtime.

var = cnf_gett();

This function returns the non-negative value defined in the “threshold=“

clause as specified in the CSL file.

cnf_term();

This function removes all threads to all objects in the caller and exits the

caller.

b) Tuple Space objects.

length = cnf_tsread(obj_id,tpname_var,buffer,switch);

It reads a tuple with a matching name as specified in "tpname_var" into

"buffer". The length of the read tuple is returned to "length" and the actual

tuple name is stored in "tpname_var". When "switch" = 0 it performs

blocking read. A “-1” value assumes non-blocking read.

length = cnf_tsget(obj_id, tpname_var,buffer,switch);

It read a tuple with matching name as in "tpname_var" into "buffer" and

deletes the tuple from the object. The length of the read tuple is returned to

"length" and the name of the read tuple is stored in "tpname_var". When

"switch" = 0 it performs blocking get. A -1 switch value instructs non-

blocking get.

status =cnf_tsput(obj_id, "tuple_name",buffer,buffer_size);

It inserts a tuple from "buffer" into "obj_id" of "buffer_size" bytes. The

tuple name is defined in "tuple_name". The execution status is returned in

"status" (status < 0 indicates a failure and status = 0 indicates that an

overwrite has occurred).

c) Pipe objects.

status = cnf_read(obj_id, buffer, size);

It reads "size" bytes of the content in the pipe into "buffer". Status < 0

reports a failure. It blocks if pipe is empty.

status = cnf_write(obj_id, buffer, size);

It writes "size" bytes from "buffer" to "obj_id". Status < 0 is a failure.

d) File objects:

cnf_fgets(obj_id, buffer)

cnf_fputs(obj_id, buffer)

cnf_fgetc(obj_id, buffer)

cnf_fputc(obj_id, buffer)

cnf_fseek(obj_id, pos, offset)

cnf_fflush(obj_id)

cnf_fread(obj_id, buffer, size, nitems)

cnf_fwrite(obj_id, buffer, size, nitems)

These functions assume the same semantics as ordinary Unix file functions

provided that the location and name of the file are transparent to the programmer.

Note that these Synergy functions are effective only after at least one cnf_open call. This is because that the first cnf_open initializes the caller's IPC pattern and inherits any information specified at the application configuration. These information is used in all subsequent operations throughout the caller's lifetime.

5.3.1 Tuple Space Object Programming

In addition to the tuple binding method differences, the Synergy tuple space object differs from Linda's [2] (and Piranha's [3]) tuple space in the following:

? Unique tuples names. A Synergy tuple space object holds only unique tuples.

? FIFO ordering. Synergy tuples are stored and retrieved in an implicit First-In-First-Out order.

? Typeless. Synergy tuples do not contain type information. The user programs are responsible for data packing and unpacking.

Note that the cnf_tsread and cnf_tsget calls do not require a buffer length but a tuple name variable. The actual length of a retrieved tuple is returned after the call. The tuple name variable is for recording the name filter and is over-written by the retrieved tuple's name.

A coarse grain SIMD normally consists of one master and one worker. The worker program will be running on many hosts (SI) communicating with two tuple space objects (MD). Figure 2 illustrates its typical configuration.

Figure 2. Coarse Grain SIMD Using Tuple Space Objects Following is a sample tuple “get” program:

============================================================

tsd = cnf_open("TS1", 0); /* Open a tuple space object */ . . .

strcpy(tpname,"*");/* Define a tuple name */

length = cnf_tsget( tsd, tpname, buffer, 0 );

/* Get any tuple from TS1. Block, if non-existent. Actual tuple name

returned in "tpname". Retrieved tuple size is recorded in "length".*/

/* Unpacking */ . . .

============================================================

A typical tuple “put” program is as follows.

============================================================ double column[XRES];

. . .

main()

{

. . .

tsd = cnf_open("TS1", 0);

/* Packing */

for (i=0; i

column[i] = ...;

/* Send to object */

sprintf(tpname,"result%d",x);/* Making a unique tuple name */

tplength = 8*XRES;

if ( ( status = cnf_tsput( tsd, tpname, column, tplength ) ) < 0 ) {

printf(" Cnf_tsput ERROR at: %s ", tpname );

exit(2);

}

. . .

}

============================================================ SIMD Termination Control

In an SIMD cluster, the master normally distributes the working tuples to one tuple space object (TS1) then collects results from another tuple space object (TS2, see Figure 2). The workers are replicated to many computers to speed up the processing. They receive "anonymous" working tuples from TS1 and return the uniquely named results to TS2.

They must be informed of the end of working tuples to avoid endless wait. Using the FIFO property, we can simply place a sentinel to signal the end. Each worker returns the end tuple to the space object upon recognition of the sentinel. It then exits itself (see program frwrk in Examples section).SIMD Load Balancing

Graceful termination of parallel programs does not guarantee a synchronized ending. An application is completed if and only if all its sub-components are completed. Thus, we must minimize the waiting by synchronizing the termination points of all components. This is the classic "load balancing problem." Large wait penalty is caused not only by the heterogeneous computing and communication components but also by the varying amount of work loads embedded in the distributed working tuples .

In all our experiments, we found that SIMD synchronization penalty is more significant than commonly feared communication overhead. In an Ethernet (1 megabytes per second)environment, the communication costs per transaction is normally in seconds while the synchronization points can differ in much greater quantities.

The "load balancing hypothesis" is that we can minimize the wait if we can somehow calculate and place the "right" amount work in each working tuple. The working tuple calculation for parallel computers and loop scheduling for vector processors are technically the same problem.

Theoretically, ignoring the communication overhead, we can minimize the waiting time if we distribute the work in the smallest possible unit (say one loop per tuple). Assuming that the actual work load is uniformly distributed, "load balancing" should be achieved because the fast computers will automatically fetch more tuples than the slower ones.Considering the communication overhead, however, finding the optimal granularity becomes a non-linear optimization problem [10]. We looked for heuristics. Here we present the following algorithms:

?Fixed chunking. Each working tuple contains the same number of data items.

Analysis and experiments showed that the optimal performance, assuming uniform work distribution, can be achieved if the work size is N P i

i P =∑1

, where N is the total

number of data items to be computed, P i is the estimated processing power

measured in relative index values[10] and P is the number of parallel workers.

For example, if N=1000, P i=1 (all processors of equal power), P=10, the optimal

tuple size is 100.

?Guided Self-Scheduling [7]. Assuming there are N data items to be computed and P parallel workers, GSS tuple sizes are calculated according to the following

algorithm:

R0 = N./* Ri: ith remaining size */

Gi = Ri/P = ((1-1/P)i * N/P)./* Gi: ith tuple size */

Ri+1 = Ri - Gi.

Until Ri = 1.

For example, if N=1000, P=2, we have tuples of the following sizes:

500,250,125,63,32,16,8,4,2,1

GSS puts too much work in the beginning. It performs poorly even when

processors are of the same power and work distribution is relatively uniform [8,9].

?Factoring [8]. Assuming there are P parallel workers, a threshold t>0 and a real value (0

R0 = N.

Gi = Ri * f / P

Ri+1 = Ri - (P*Gi)

until Ri < t.

For example, if N=1000, P=2, f=0.5, t=1, we have the following tuple

sizes:

250,250,125,125,63,63,32,32,16,16,8,8,4,4,2,2,1,1

The presence of a "knob" (0

patterns [12].

Theoretically, variable size scheduling algorithms (factoring and GSS) can out-perform fixed chunking when all processors are of the same power and the work distribution is uniform. The savings come from the reduced connection costs due to heavily packed tuples in the beginning. In reality, however, both conditions seldom hold true at the same time.

For heterogeneous environments, since all working tuples are “anonymous”, a weak processor can grab a heavy tuple in the beginning. It may not be able to finish long after all others are done. A similar problem occurs when the work distribution is not uniform (such

as the Mandelbrot example, next section) even if all processors are of the same power. The situation becomes worse if the processors are shared by many users.

Fixed chunking can avoid this problem by generating working assignments of “equal size”; even though the amount of work implied in all tuples may not be identical. The chunk size can effectively control the synchronization time losses. This is analogous to the “dollar cost averaging” method used in the financial market for reducing risks.

Moreover, if automatic load balancing is desired, fixed chunking is the only choice. This is because it’s optimal size can be analytically computed after a calibration phase. Experiments show that in many cases when the work load distribution is relatively uniform, the extra overhead of the load calibration phase is well compensated by the savings in reduced synchronization time (see $SNG_PATH/ssc/albm for example).

Synergy V3.0 supports the specification of factor (f), threshold (t) and the chunk size values at runtime. The number of parallel workers is calculated automatically.

Examples

Coarse grain SIMD parallel processing can be best illustrated using the Mandelbrot display program. Here we must "paint" an image of X by Y pixels. Each pixel (x,y)'s color is defined by the number of iterations resulted from the convergence (or divergence) of a corresponding complex point (ix,y) in the Mandelbrot equation for a given complex domain. Each pixel's calculation is independent of all others. Thus this is usually called an "embarrassingly parallel" application.

A coarse grain SIMD Mandelbrot has a master (frclnt) that distributes screen coordinates (x,y) to a tuple space object (coords), waits the color indices from another tuple space object (colors) and paints the screen with returned colors. Parallel workers (frwrk) receive coordinates from "coords", calculate the colors and return the color indices to object "colors".

Figure 3. Mandelbrot SIMD Topology

There are only two actual programs: frclnt (or the master) and frwrk (or the worker). The master program must package the coordinates into tuples of proper sizes to achieve the load balancing effects. The worker program must contain a termination control protocol to ensure a graceful exit.

Sketches of these two programs are provided below. Actual programs can be found in directory "$(SNG_PATH)/apps/ssc/fractal". The reader is challenged to find the best factor value for a parallel computing environment using the given programs for a fixed complex domain: (-2,-2,2,2). A more time consuming challenge is to beat the best factored performance using any other algorithms, such as fixed chunking, GSS or anything else.

====================================================

/* frclnt -- This is a SIMD master. */

main()

{

tsd = cnf_open("coords",0);/* Open objects */

res = cnf_open("colors",0);

f = (float)cnf_getf()/100.0;/* Fetch factor valude definition in CSL */

P = cnf_getP();

t = cnf_gett();/* Fetch threshold value definition */

tplength = 9 * 8; /* Define working tuple length */

R = XRES;/* Initialize R0 */

ix = 0; /* Define starting colume */

G = 1;/* Dummy assignment to enter the loop */

while ((R > t) && (G>0)) {

G = (int) R*f/P;/* Calculate the grain size G for each wave */

R = R - G * P;/* Pass along the remainder to next wave */

if (G>0) {

for (i=0; i

ituple[0] = (double) G;/* ith wave grain size */

ituple[1] = (double) ix;/* Starting colume index */

ituple[2] = xmin;/* complex domain def. */

ituple[3] = xmax;

ituple[4] = ymin;

ituple[5] = ymax;

ituple[6] = (double) XRES;/* Screen resolution def. */

ituple[7] = (double) YRES;

ituple[8] = (double) iterat;/* Loop limit def. */

sprintf( tpname, "i%d\0",ix);/* Define a tuple name */

status = cnf_tsput(tsd,tpname,ituple,tplength);

ix +=G;/* Increment colum index by G */

}

if (R > 0) {

ituple[0] = (double) R;/* Complete the left-over */

ituple[1] = (double) ix;

ituple[2] = xmin;

ituple[3] = xmax;

ituple[4] = ymin;

ituple[5] = ymax;

ituple[6] = (double) XRES;

ituple[7] = (double) YRES;

ituple[8] = (double) iterat;

sprintf( tpname, "i%d\0",ix);

status = cnf_tsput(tsd,tpname,ituple,tplength);/* Send the left-over */ }

/* Now receive the results */

received = 0;

while ( received < XRES ) {

strcpy( tpname, "*" );

len = cnf_tsget( res, tpname, otuple, 0 );

ix = (int) otuple[1];/* Unpacking */

G = (int) otuple[0];

iy = 2;

for (i=0; i

received ++;

for ( j= 0; j

ip = (int)otuple[iy];

if (DISPLAY)

{/* Paint the screen */

XSetForeground( dpy, gc, ip );

XDrawPoint( dpy, win, gc, ix, j );

}

iy++;

}

ix ++;

}

/* Insert the sentinel */

ituple[0] = (float) 0;

sprintf( tpname, "i%d\0", XRES+1);

status = cnf_tsput(tsd,tpname,ituple,tplength);

}

cnf_term();

}

====================================================

/* frwrk -- This is a SIMD worker. */

main()

{

tsd = cnf_open("coords",0);

res = cnf_open("colors",0);

while ( 1) {

strcpy(tpname, "*");

len = cnf_tsget( tsd, tpname, ituple, 0 );

if ( len > 0 ) { /* normal receive */

if (G == 0) {/* Found the sentinel */

/* Put sentinel back to inform others */

status = cnf_tsput(tsd, tpname, ituple, len);

printf("Worker found last tuple. Bye.\n");

cnf_term();

}

/* Unpacking a working tuple */

. . .

/* Packing the results while computing them */

for (g=0; g

x = (ix+g) * xdis / xres + xmin;

for ( iy=yres; iy-- > 0; ) {

y = (iy * ydis)/yres + ymin;

tr = x;

ti = y;

for ( i=0; i

/* Calculate the fractal here. */

. . .

}

otuple[2+g*YRES+iy] = (double) htonl(i);

}

tplength = (2+G*YRES) * 8;/* Define result tuple size */

status = cnf_tsput( res, tpname, otuple, tplength );

} else {

printf(" Bad Transmission. Worker terminated\n");

cnf_term();

}

====================================================

It should be noted that the importance of a FIFO ordered tuple space is not only for easier termination control but also load balancing using variable partitioning methods.

Synergy设置指南

实现以下两台计算机合用一套键鼠，“系统属性”对话框，可右击“我的电脑”“属性”调出。计算机A 计算机B 系统：windows xp 系统：windows xp 完整的计算机名称：xpA 完整的计算机名称：xpB 局域网IP地址：192.168.1.111 局域网IP地址：192.168.1.112 选取计算机A做为服务器端，使A上插的键鼠可以控制计算机B A端设置：打开计算机A上的Synergy，点击Configure

在跳出来的“Screens&Links”中选中“+”，添加入计算机A与计算机B的完整的计算机名称，xpA与xpB。在“Links:”栏下填写好下拉框后，按“+”号键，来设置两个屏幕的位置，本文设置如此处文字上方与文字下方的两幅“Screens&Links”图所示。设置的含义如下：

0 to 100% of the right of xpA goes to 0 to 100% of xpB鼠标从xpA的右边移出到达xpB 0 to 100% of the left of xpB goes to 0 to 100% of xpA鼠标从xpB的左边移出到达xpA 确认后退回主界面，选中“Advanced…”，在跳出的“Adwanced Options”对话框中，填写Screen Name为xpA，Interface为xpA的IP地址：192.168.111。确认后返回主界面，A端设置完成，按“Test”或“Start”开始连接。

B端设置：进入主界面，选择“Use another computer’s shared keyboard and mouse(client)”，在“Other Computer’s Host Name”中填入服务器A端的IP地址：192.168.1.111。设置完成，选择“Test”或“Start”即可使用。设置结束，尽情享受两台机子任意切换的欢乐吧！！！！

此协同(SYNERGY)非彼协作(COLLABORATION)

此协同（SYNERGY）非彼协作（COLLABORATION）“协同”的英文词汇是“Synergy”，科学理论渊源是协同论或称协同学。英文的“Collaboration”的意思是协作，虽然经常有人将其译作协同，将Collaborative Software译作协同软件，并且有人造出了英文词组“Collaboration Management”（中文对应“协同管理”）。但是这些事情都发生在中文术语“协同软件”诞生之后。而“协同”一词真正走到今天的人们面前，的确是因为要给一类中国人发明的软件产品品类命名，才被从词典里挖掘出来，用作科技名词的。而之所以会选择“协同”，的确是考虑到这类软件的设计思想，与“新三论”之一的“协同论”有异曲同工之妙。协同软件的创造者们确实描述过这样的思想，通过这类协同软件促使人类社会组织的运转能发挥协同效应，重新发现、构造和利用蚁群与蜂群那样的集体智慧能力（智能）。 ICT的领袖厂商们，特别是与协作和协同科技高度相关的领袖厂商，如IBM、思科、思杰、惠普、微软等，对于“协同”和“协作”两个词汇的使用其实是有明确的区别的。IBM就对协同软件品类的产生非常敏感，当即宣布它才是协同软件的鼻祖（它是指Lotus Notes）。以“Citrix synergy”为搜索词百度给出的结果为10万多条，头两条正是思杰这两年的全球大会，主题为“Citrix Synergy 2016/2017”，稍稍向下翻阅就可发现2011年也是这个主题，当然2012、2013、2014、2015年都是这个主题。看来思杰对这个词汇是当真的情有独钟，不难想象这家致力于云计算虚拟化、虚拟桌面和远程接入技术领域的全球领先科技企业是以此Synergy（协同）作为其技术理念的。

Synergy配置与使用

Synergy配置与使用转载自： https://www.wendangku.net/doc/003451449.html,/hfutliuwei/blog/item/2e333bd2a180ce3 39a5027f5.html Synergy介绍文章： https://www.wendangku.net/doc/003451449.html,/hfutliuwei/blog/item/5ff9d9294d9d4590 023bf6e8.htmlSynergy下载地址： https://www.wendangku.net/doc/003451449.html,/projects/synergy2/在公司的机器的工作机和测试机上做了实验，发现配置过程还是有点复杂的。故做一下小的分享。另外，这个工具的共享剪切板的功能非常好用。可以直接在测试机上做屏幕截图，然后在工作机上打开画图板粘贴。不过这款软件有个缺点，两台机器会共享焦点。在做自动化测试的时候不能随意切换屏幕，否则会造成焦点丢失，造成测试错误。安装过程略。配置过程：1.选中“Share this computer's keyboard and mouse(Server)”。将最常用的机器设为服务器端。2.点击“Configure”，在弹出的窗口中需要配置Screens和Links。3.点击“Screen”下方的加号，添加Screen。这里需要添加Client的屏幕和Server的屏幕。4.配置完Screens之后需要配置link。先选择server 和client的关系，然后点击“link”下方的加号。Server和Client的左右关系，最好依据显示器在桌面上摆放

的位置来设置，这样操作起来舒服点。5.我这里将Server 设置在了左侧，client设置在右侧。注意：这里的左右关系需要配置两遍。6.配置好“Screens”和 “Links”之后，需要点击 “Advanced”修改一下当前的屏幕名称。这里还是在我的工作机上做配置，所以将Screen设为 “server”。7.配置完成之后，可以点击页面上的“test”做一下测试。如果配置错误，会提示有error或warning。测试通过之后就可以点击 “start”启动服务器段的Synergy了。8.客户端配置。首先，客户端需要制定所连接的服务器段的主机名称。我这里填“server”只是演示，实际上你得去服务器端查一下主机名。9.客户端还需要修改一下本地的屏幕名称，这样才能正确显示。10.为了便于使用，还可以配置Synergy的自动运行模式。可以在登录之后自动运行，也可以在机器启动之后运行。不过这里有个问题，如果你的客户机设置了密码，login页面是不能使用共享的键盘的。所以，如果只有一套键盘鼠标，还是将客户机设置为自动登录比较好。小提示：屏幕名称在配置时最好也用主机名，这样便于配置，也便于管理。

HPE Synergy：标榜的“可组合性”并不能掩盖基础设施的复杂性

要点采用思科统一计算系统的按需基础设施? 刀片服务器、机架式服务器和存储服务器位于同一管理域中。 ? 所有服务器通过同一个统一交换矩阵连接。 ? 一次布线，并通过软件配置服务器和带宽。 ? I/O 通过思科统一计算系统?（思科 UCS?）虚拟接口卡 (VIC) 按需配置。? 共享资源本质上具有更大的灵活性。HPE Synergy 基础设施的复杂性? HPE Synergy 比 HP c7000 BladeSystem 机箱更复杂。? 您必须在部署前猜测应用资源需求。? 需要在服务器中物理安装和配置 I/O 设备。 ? Synergy 部署之后很难更改配置。? Synergy 强制实施多种规则，对配置加以限制。 ? 该基础设施具有更多新的机箱内置设备。 ? 管理域与 HPE 的其他产品不兼容。 HPE Synergy ：标榜的“可组合性”并不能掩盖基础设施的复杂性解决方案简介2017 年 4 月为了保持竞争力，您需要缩短部署应用所需的时间。HPE 能在降低复杂度和增加敏捷性和灵活性的同时满足这一要求吗？借助思科统一计算系统?（思科 UCS?）平台，您可以快速轻松地配置基础设施资源：服务器、网络和存储。思科 UCS 提供真正的按需基础设施，系统可以逐步平稳地扩展，因此成本与基础设施能够紧密地保持一致。思科 UCS 让应用需求来决定基础设施交付，而不是反其道而行之。因为 HP c7000 BladeSystem 机箱的使用寿命即将结束，HPE 要重新开始解决过时设计的诸多问题。该公司没有采用全新的可扩展架构（如思科 UCS ），只是进行了一些增量式的更改。目前，HPE Synergy 号称“组合式基础设施”，并标榜具有“流动性”和“软件定义智能”。事实上，Synergy 比之前的 c7000 产品更复杂，需要客户购买更多硬件并在开始部署前做出更多配置决策。这种方法导致系统要求您提前准确地知道您的应用将需要哪些资源。系统部署完毕之后，难以更改、扩展或改变用途。您每次更改或扩展环境时，如果都需要过度调配支持基础设施并遵守复杂的配置准则，则所带来的效益恐难以抵消相应成本。遗憾的是，对于客户而言，HPE 的新基础设施早在发布之前就已经过时。

协同效应

简单地说,协同效应指“1+1>2”的效应。协同效应可分外部和内部两种情况，外部协同是指一个集群中的企业由于相互协作共享业务行为和特定资源，因而将比作为一个单独运作的企业取得更高的赢利能力；内部协同则指企业生产，营销，管理的不同环节，不同阶段，不同方面共同利用同一资源而产生的整体效应。定义协同效应∶Synergy Effects 协同效应就是指企业生产，营销，管理的不同环节，不同阶段，不同方面共同利用同一资源而产生的整体效应。或者是指并购后竞争力增强，导致净现金流量超过两家公司预期现金流量之和，又或合并后公司业绩比两个公司独立存在时的预期业绩高。内涵协同效应原本为一种物理化学现象，又称增效作用，是指两种或两种以上的组分相加或调配在一起，所产生的作用大于各种组分单独应用时作用的总和。而其中对混合物产生这种效果的物质称为增效剂(synergist)。协同效应常用于指导化工产品各组分组合，以求得最终产品性能增强。 1971年，德国物理学家赫尔曼·哈肯提出了协同的概念，1976年系统地论述了协同理论，并发表了《协同学导论》等著作。协同论认为整个环境中的各个系统间存在着相互影响而又相互合作的关系。社会现象亦如此，例如，企业组织中不同单位间的相互配合与协作关系，以及系统中的相互干扰和制约等。一个企业可以是一个协同系统，协同是经营者有效利用资源的一种方式。这种使公司整体效益大于各个独立组成部分总和的效应，经常被表述为“1+1>2”或 “2+2=5”。安德鲁·坎贝尔等(2000）在《战略协同》一书中说：“通俗地讲，协同就是‘搭便车’。当从公司一个部分中积累的资源可以被同时且无成本地应用于公司的其他部分的时候，协同效应就发生了”。他还从资源形态或资产特性的角度区别了协同效应与互补效应，即“互补效应主要是通过对可见资源的使用来实现的，而协同效应则主要是通过对隐性资产的使用来实现的”。蒂姆·欣德尔（2004）概括了坎贝尔等人关于企业协同的实现方式，指出企业可以通过共享技能、共享有形资源、协调的战略、垂直整合、与供应商的谈判和联合力量等方式实现协同。 20世纪60年代美国战略管理学家伊戈尔·安索夫(H. Igor Ansoff)将协同的理念引入企业管理领域，协同理论成为企业采取多元化战略的理论基础和重要依据。

synergy 键鼠共享软件说明书

Synergy1.4 User Guide 1

1Intro3 2Install4 2.1Windows (4) 2.2Mac OS X (5) 2.3Linux (5) 3Con?gure7 3.1The wizard (7) 3.2Con?gure a server (7) 3.3Con?gure a client (9) 4Troubleshooting11 2

Synergy lets you easily share your mouse and keyboard between multiple computers on your desk,and it’s Free and Open Source.Just move your mouse off the edge of one computer’s screen on to another.Y ou can even share all of your clipboards.All you need is a network connection.Synergy is cross-platform(works on Windows,Mac OS X and Linux). This user guide is written for Synergy version1.4–please upgrade if you are using an older version(like1.3).To?nd your version,click the"Help"menu,and select"About"or"About Synergy". More information and download links can be found at our website:https://www.wendangku.net/doc/003451449.html, 3

Synergy使用

Synergy使用教程 Synergy是一款开源软件，通过网络实现键鼠分享，配置主要分为两个部分，server端和slave端，server端通过局域网发送指令给slave 端，slave同过接收到的消息响应鼠标和键盘事件。只要配置正确synergy可以控制多台电脑，即使两台机器的操作系统平台不一样。以下先介绍windows中的配置，后面附有Ubuntu中的配置。图1为windows中synergy运行后的界面，我这个是配置过的，第一次打开时最上方的文本框应该是空的。图1 以配置两台机器为例，开始配置前需要先了解机器的基本状况：（1）记住两台机器的ip地址，这个可以通过在两台机器的命令提示符窗口（win+R组合键）中运行命令”ipconfig”来查看，其中的ipv4

地址就是需要的地址。（2）确定两台机器可以相互连接，这个可以通过使用ping命令来确认。同样在命令提示符窗口中键入”ping ip”命令来确认，其中的ip 为另一台机器的ip地址，如”ping 192.168.1.102”。如果显示的是图2所示的情形，则说明两台机器能够通信，如果是图3或者显示请求超时等信息，说明对方机器没有开启远程桌面功能。可以在计算机—属性—高级系统设置—远程中勾选“允许远程协助连接这台计算机”。图2 图3 （3）想好哪台机器作为服务端，Synergy会将Server所在的键鼠分享给其他slave，但是slave的键鼠不能逆向分享。所以选取最常使用的键鼠来分享。以下配置以两台机器为例，假设两台机器的ip地址分别为192.168.1.101和192.168.1.102，前者为Server，后者为slave。 1.Server端配置勾选”share this computer’s keyboard and mouse”单选框，再点击Configure按钮出现图5的窗口。点击上方的加号出现add screen窗口，在窗口的screen name中填写一台机器的ip地址即可，再添加一次将另外一台机器的ip地址添加进去，效果如图6。