文档库 最新最全的文档下载
当前位置:文档库 › CAPI_ A Coherent_Accelerator Processor_Interface

CAPI_ A Coherent_Accelerator Processor_Interface

CAPI:A Coherent Accelerator Processor Interface J.Stuecheli

B.Blaner

C.R.Johns M.S.Siegel

Heterogeneous computing systems combine different types

of compute elements that share memory.A speci?c class of heterogeneous systems discussed in this paper pairs traditional general-purpose processing cores and accelerator units.While this arrangement enables signi?cant gains in application performance, device driver overheads and operating system code path overheads can become prohibitive.The I/O interface of a processor chip is a well-suited attachment point from a system design perspective,in that standard server models can be augmented with application-speci?c accelerators.However,traditional I/O attachment protocols introduce signi?cant device driver and operating system software latencies.With the Coherent Accelerator Processor Interface(CAPI), we enable attaching an accelerator as a coherent CPU peer over the I/O physical interface.The CPU peer features consist of a homogeneous virtual address space across the CPU and accelerator, and hardware-managed caching of this shared data on the I/O device.This attachment method greatly increases the opportunities for acceleration due to the much shorter software path length required to enable its use compared to a traditional I/O model.

Introduction

Modern general-purpose cores are built to execute sequentially de?ned sequences of instructions at higher throughputs by extracting parallel work.This extraction

of parallel work from sequential code requires speculative execution in an attempt to start processing before prior dependent work has completed.While this methodology does enable faster serial code execution,ef?ciency is lost

in both the incorrectly speculated work and the tracking structures required to enable such execution[1].Thus, while serial cores are important,they come at a cost. While serial cores extract parallel work from sequentially de?ned program sequences,greater ef?ciency is possible when the algorithm itself can be made parallel.With parallel algorithms,the hardware can be made much simpler. Taking this concept further,the most ef?cient execution is possible when the hardware directly implements the algorithm.This level of customization is possible through a custom designed circuit(called an ASIC,or Application Speci?c Integrated Circuit),but development expense

and in?exibility limit the applicability of ASICs.This can be mitigated with the usage of Field Programmable Gate Array(FPGA)devices.FPGAs emulate custom logic through arrays of directly programmable logic devices.This ?exibility comes at the expense of logic speed and density compared to ASIC devices,though modern FPGAs provide very-large logic arrays,on the order of one million logic elements[2,3]at very reasonable clock speeds.

As applications contain both serial code(best suited

for general-purpose cores)combined with highly

parallel components(best suited for parallel engines),the requirements arise for ef?cient communication between heterogeneous computation elements.In current systems, the communication between general-purpose cores and external accelerators require the use of I/O-based software stacks.These software components impose a higher overhead and a cumbersome communication model when compared to the shared memory model used between multiple CPUs. As such,the overhead and complexity of interfacing to external engines decreases the potential speedup of these heterogeneous systems.

óCopyright2015by International Business Machines Corporation.Copying in printed form for private use is permitted without payment of royalty provided that(1)each reproduction is done without alteration and(2)the Journal reference and IBM copyright notice are included on the?rst page.The title and abstract,but no other portions,of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems.Permission to republish any other portion of this paper must be obtained from the Editor.

0018-8646/15B2015IBM

Digital Object Identifier:10.1147/JRD.2014.2380198

The emergence of B Big Data[problem spaces,which require analysis of exabytes(1018)of data,has further forced a rethinking of computational system design[4].These data-centric problems differ from traditional problems in several aspects.First,the volume of data is much greater, potentially thousands of times greater than traditional workloads.Beyond the size of the data,inputs are unstructured,as the vast amounts of data to be mined are inherently disorganized[5,6].These features drive the need to scan and restructure large volumes of data,which are fed into more manageable data structures for further processing.

If the processing of this large amount of data is visualized as a computational tree,many parallel tasks may process subsets,or B leafs,[of the full dataset at once.As the leafs are processed,intermediate results are formed,passed up the tree,and are used in the next iterative step in processing the data until the root node,or the?nal result,of this computational tree is reached.The combination of computation on vast unstructured data at the leaf operations and the serially structured?nal result generation motivates the use of heterogeneous computation systems.The initial stages are inherently parallel,as the massive volume of data is scanned.These raw data typically consist of irregular data types,and are poorly suited for traditional processor register types.In contrast to the initial stages or leaf processing steps,once the data has been?ltered and formed into structured data,parallelism is greatly reduced[7].

To address these inef?ciencies and address the needs

of emerging big data workloads,the POWER8*platform introduces the Coherent Accelerator Processor Interface (CAPI).This new interface,which will be described in more detail herein,provides the capability for off-chip accelerators to be plugged into PCIe**(Peripheral Component Interconnect Express**)[8]slots and participate in the system memory coherence protocol as a peer of other caches in the system.Additionally,CAPI enables the use of effective addresses to reference data structures in the same manner as applications running on the cores.These PCIe-card based accelerators can be implemented in FPGAs for development?exibility or hardened in ASIC chips, depending on user requirements.

The PCIe attachment point provides for simple integration of a range of easily developed PCIe based designs;however, the requirements of a scalable and robust attachment point introduce several challenges.The PCIe protocol does not follow the highly optimized coherent and resilient protocol utilized amongst the POWER8modules in the system.This protocol mismatch motivated the creation of a proxy unit resident on the POWER8chip to isolate the two divergent protocols,enabling coherent traf?c,and provide failure isolation.These are mandatory requirements for an attached accelerator to act as a peer of other CPUs in the system.CAPI system description

A block diagram of the CAPI hardware is shown in Figure1. Each POWER8processor chip contains a symmetric

multi-processor(SMP)bus interconnection fabric

which enables the various units to communicate and coherently share system memory.These units are twelve general-purpose cores,two memory controller(MC)blocks, and units to bridge multiple chips in an SMP system.On the POWER8processor chip,the PCIe Host Bridge(PHB) provides connectivity to PCIe Gen3I/O links.The Coherent Accelerator Processor Proxy(CAPP)unit,in conjunction with the PHB,act as memory coherence,data transfer, interrupt,and address translation agents on the SMP interconnect[9]fabric for PCIe-attached accelerators. These accelerators comprise a POWER Service Layer(PSL) and Accelerator Function Units(AFUs)that reside in an FPGA or ASIC connected to the processor chip by the PCIe Gen3link.Up to sixteen PCIe lanes per direction

are supported.The combination of PSL,PCIe link,PHB, and CAPP provide AFUs with several capabilities.AFUs may operate on data in memory,coherently,as peers of other caches in the system.AFUs further use effective addresses to reference memory,with address translation provided by a memory management unit(MMU)in the PSL.The PSL may also generate interrupts on behalf of AFUs to signal AFU completion,or to signal a system service when a translation fault occurs.

Coherence

In order to provide coherent access to system memory,CAPP and PSL each contain a directory of cache lines used by the AFUs.The CAPP snoops the fabric on behalf of the PSL, accesses its local directory,and responds to the fabric with latency that is the same as other caches on the chip.In this way,the insertion of an off-chip coherent accelerator does not affect critical system performance parameters such as cache snoop latency.Snoops that hit in the CAPP directory may generate messages that are sent to PSL by means

of the PHB and PCIe link.The PSL may then respond to the message in a variety of ways depending on the contents of the message.

The PSL may master operations on the SMP interconnect fabric using the combination of the PCIe link,PHB,and master read and write?nite state machines(FSMs)in CAPP. For example,to store into a line on behalf of an AFU,the PSL must?rst have ownership of the line.The PSL?rst checks for presence of the line in its cache directory.If the line is present(directory hit)and in the modi?ed state,the PSL allows the store from AFU to proceed.However,if the access misses in the PSL directory,then the PSL initiates a fabric master operation to gain ownership of the line and may further request the cache line data.This is accomplished by sending a command to a CAPP master read FSM.The CAPP master FSM performs the access on the fabric and

ultimately gains ownership of the line and sends a message that it has obtained such to the PSL.If the data was also requested,it will be directly returned by the source,which could be a memory controller or another cache in the system,to the PHB where it is transferred across the PCIe link to PSL and installed in its cache.The store from the AFU is then allowed to complete.

To push a line from the PSL cache to memory,which may occur for example when a line owned by PSL needs to be evicted to make space for another line in the cache,PSL issues a write command to a CAPP master write FSM.The PSL also pushes the modi?ed data to the PHB for write-back to memory,and updates the state for the line in its directory to indicate that it no longer owns the line.The master write FSM obtains routing information for the destination of the write data and passes it to the PHB via sideband signals.The PHB then pushes the data onto the fabric to the destination.Additionally,the master write FSM updates the CAPP directory to re?ect that the line is now invalid.

In the previous examples,the combination of evicting a line to make room for a new line and reading the new line,with or without intent to modify the line,were illustrated as separate operations.This common combination between the PSL and CAPP is optimized by providing a single compound operation that both evicts a directory entry,possibly with data push to memory,and loads a new entry into the CAPP directory,possibly with read data provided

back to PSL.A compound command concurrently activates both write and read FSMs in CAPP to perform the operation.This saves two crossings of the PCIe link compared to the discrete operations.

The PSL is further provisioned with the capability to master reads and writes on the fabric to copy lines to outside of the coherence domain as would be the case of an I/O device operating with a checkout model of memory.This provision allows AFUs,with no need to maintain coherent copies of lines,to entirely bypass the PSL and CAPP caches.Address translation

To enable AFUs to reference memory with effective

addresses,as would an application running on a core,the PSL contains an MMU comprising table-walk machines to perform address translations and caches of recent translations,thereby frequently avoiding table walks.Table-walk

machines use the mechanisms described above to read and update tables in memory during the translation process.Since the PSL contains a translation cache,it must participate in translation invalidation (tlbi)[10]operations on the fabric.The CAPP snoops tlbi operations on behalf of the PSL and sends them in messages to the PSL,either one at a time or bundled into groups.The PSL looks up the address presented by the tlbi in its caches.If the address misses,it responds immediately back to the CAPP tlbi snooper that the operation is complete.If the tlbi hits,the PSL follows a protocol to ensure all storage

operations

Figure 1

CAPI system block diagram.

associated with that translation cache entry are completed before sending a completion message to the CAPP tlbi snooper.

Address translations may generate faults requiring noti?cation of system software to repair the fault.For this and other needs,the PSL provides a means to signal interrupts to software.This is accomplished by using the message signaled interrupt (MSI)mechanism provided by the PHB [8].PSL sends a command to the PHB using a particular address and data value indicative of the particular interrupt being asserted.The PHB responds as it would to an MSI from any I/O device;the details may be found in [8].CAPP hardware description

This section considers the hardware structures internal to the CAPP that are required to enable attached accelerators to participate in the distributed cache coherence protocol provided by the SMP interconnect fabric as peers of other caches in the system.The CAPP structures and machines parallel those of the L2cache directory described in [9],while the data-portion of the cache is maintained by the PSL.Figure 2shows the CAPP hardware in greater detail.The CAPP is divided into three areas:machines and transport,snoop pipeline,and SMP interconnect fabric interface.The SMP interconnect fabric interface provides snooper,master,and data interfaces to the fabric.The snooper interface comprises the re?ected command (rcmd)bus and partial response buses (presp).A command issued by a master is broadcast to the fabric on a command/address (cmd/addr)bus and enters the CAPP snoop pipeline on its rcmd bus.The snooped re?ected command is decoded,and if it is not one supported by the CAPI,it proceeds no further down the pipeline.If the snooped re?ected command is supported,has an address,and requires a CAPP directory lookup,arbitration for read access to the directory occurs in the next pipeline phase.Master FSMs,snoop FSMs,and snooped re?ected commands arbitrate for read access to the directory (arb block shown in Figure 2).Having won arbitration,the snooped re?ected command reads the directory,and the result may be a cache hit or miss.The address is also compared to addresses held by master and snoop FSMs to see if any are already performing an action on the address.Depending on the outcome,the snoop control logic determines the next action the hardware will take.This may include dispatching to one of the 16snoop FSMs when,for example,the CAPI owns the line in a modi?ed state,and another master is requesting ownership of the line.In this case,the PSL must provide the line as described earlier.A snoop FSM is required to change the CAPP directory state,in which case it must arbitrate for write access to the directory as shown in the ?gure.

Generally,a snooped re?ected command that proceeds to this point requires a partial response (presp)on the SMP bus fabric to indicate the state of affairs in the CAPP back

to

Figure 2

CAPP unit components.

the fabric controller.A presp appropriate to the re?ected command and the state of the cache line in the CAPP is formed by the presp logic and issued on the presp bus.The fabric controller combines all presps and returns a combined response(cresp)to all agents on the bus so they may see the ?nal results of the operation and act accordingly.

The action may also include sending a message to the PSL that is descriptive of the snooped re?ected command,the CAPP state,and any actions the CAPP took on behalf of the PSL.The PSL may then take further actions in response to the message,as in the line push example where data needs to be written back to memory.Messages to the PSL from both master and snoop FSMs are queued and packed into fabric data packets by the command/message transport block and pushed on to the fabric data_out bus to the PHB.The PHB performs a PCIe write to transmit the message packet to the PSL.

To master a command on the fabric cmd/addr bus,the PSL selects one of32master read FSMs or32master write FSMs,or a pair of FSMs in the case of compound operations,to master the command.It forms a command packet containing details of the operation for the FSM

to perform.Multiple commands to multiple FSMs may

be packed into a single command packet.The PSL issues a PCIe write packet to transmit the command packet to

the PHB.The PHB decodes address bits in the packet to learn that it is a command packet to be pushed toward the CAPP on its fabric data_out bus.The packet arrives on

the CAPP fabric data_in bus,is received and unpacked

by the command/message transport logic,and distributed

to the appropriate master FSMs.

Upon receiving a command,a master machine then sequences through steps that may include a CAPP directory look-up,cross-checking an address against snoop FSMs, issuing the command on the fabric cmd/addr bus,receiving and acting on a cresp,updating the directory state,and sending a message to the PSL.Consider the line push example described previously.The line is held in the PSL and CAPP directories in the modi?ed state.The PSL issues a command to the CAPP master write FSM0to evict the line from the directory,i.e.,move the line from the modi?ed

to invalid state.Master write FSM0activates,arbitrates for the snoop pipeline,looks the line up in the CAPP directory, obtains the memory address of the line from the directory entry,and enters a line protection state where any snoops that hit the line will be retried,i.e.,a retry response is issued on the presp bus.The master machine issues a B push[ command and address on the cmd/addr bus and waits for the cresp.Assume a particular memory controller responds as owning the memory address of the line.The cresp contains information for routing the data to the memory controller[9]. Master FSM0sends this routing information to the PHB via the PHB sideband interface so that when the data packet containing the modi?ed cache line arrives from the PSL,the PHB may push the line on its data_out bus directly to that particular memory controller.Master FSM0also arbitrates to update the CAPP directory entry state to invalid,and

?nally sends a message to the PSL containing the requisite information so that PSL may update its directory properly and push out the modi?ed data.

Master read operations proceed similarly,but in the case of reads,data from a source V a memory controller or another cache in the system V is to be returned to the PSL.The CAPP master read FSM selected for the operation provides routing information so that the data may be returned directly from the source to the PHB and on to the PSL over the PCIe link.

The tlbi operations discussed previously are another form of re?ected commands that the CAPP snoops.A snooped tlbi generates a message to be sent to the PSL,and after performing the actions described previously,the PSL returns a response to the CAPP.The command/message transport logic sends tlbi responses to the tlbi snoop logic where appropriate action is taken.

Reliability,availability,serviceability

POWER processors have a long-standing tradition of providing world-leading reliability,availability,and serviceability(RAS)[11].The addition of an off-chip device that participates in cache coherence protocols and address translations must ful?ll expectations with respect to that high standard,and the CAPI system incorporates a variety of measures to achieve this.Single-bit error correction and double-bit error detection error correction codes(ECC)are used on all memory arrays in the CAPP,the PHB,and

the PSL.All temporal operations between the CAPP and the PSL are timed,as,for example,a directory state that temporarily protects an entry from other snoopers.This provides protection against errors on the FPGA that manifest themselves as the PSL ceasing communications with the CAPP.FSMs in the CAPP and the PSL use parity to protect against invalid state errors.Con?guration registers are parity protected.The most common errors V correctable errors on memory arrays V are handled(corrected)with minimal disruption of on-going CAPI activity.For most other more severe errors,for example when a timer expires on

a temporal operation because the PCIe link went down,the CAPI system is able to gracefully go off-line.The CAPP accomplishes this by severing the connection to the PSL, quiescing its various FSMs,and walking its copy of the directory and sending poison data to the address of any lines held in the various forms of modi?ed state.(B Poison[data contains a special error checking code detectable by all data consumers in the system that marks the data as unusable.) When all this is accomplished,the CAPP enters a quiescent state from which it is ready to be restarted when the error condition is cleared by appropriate system actions.Only when an error threatens to cause a data integrity problem

is a more severe error signaled and the CAPI halted,as is the case with the other caches in the system.

User visible PSL interface

The interface provided to user-created AFUs is designed

to isolate the complexities of cache coherence and address https://www.wendangku.net/doc/1a15250530.html,er-designed accelerators access system memory through load and store requests to user space effective addresses.AFUs can select between cacheable and write-through requests.Write-through requests are for data manipulated outside the coherence domain and provide for reduced PCIe bus overhead due to reduced message overhead.Coherent operations are typically utilized

for control information where multiple processes must communicate to make data transfer decisions,and where write-through provides for large block transfers into and out of the coherence domain.Address translation is generally hidden from the accelerator,with the exception of page faults.In these cases,the AFU is noti?ed of the fault,giving the opportunity for the AFU to reschedule operations to hide the latency of the fault.

Conclusion

Modern microprocessors contain inef?ciencies when executing workloads that exhibit little instruction parallelism or data-level parallelism.Emerging big data workloads have exacerbated the problem.Direct hardware implementations of algorithms in FPGAs and ASICs can be far more

ef?cient,but integrating them into an SMP leads to different inef?ciencies,such as the software overhead required to share data with software threads running on CPUs in the SMP. The CAPI interface addresses these inef?ciencies by providing a coherent,user-address-based interface to enable low-overhead integration of PCIe-based accelerators into the POWER8ecosystem.This ef?cient combination of customized parallel accelerators and faster serial processors enables applications to target heterogeneous systems, previously not possible with I/O-based attachments.

The CAPI interface achieves this while maintaining the high standards for reliability,availability,and serviceability of POWER systems.

*Trademark,service mark,or registered trademark of International Business Machines Corporation in the United States,other countries, or both.

**Trademark,service mark,or registered trademark of PCI-SIG or Sony Computer Entertainment Corporation in the United States,other countries,or both.

References

1.M.Ferdman,A.Adileh,Y.O.Ko?berber,S.Volos,M.Alisafaee,

D.Jevdjic,C.Kaynak,A.D.Popescu,A.Ailamaki,and

B.Falsa?,B Quantifying the mismatch between emerging scale-out

applications and modern processors,[ACM https://www.wendangku.net/doc/1a15250530.html,put.

Syst.,vol.30,no.4,Nov.2012,Art.ID.15.

2.Stratix V Device Overview(SV51001),Altera,San Jose,CA,

USA,Jan.2014.[Online].Available:https://www.wendangku.net/doc/1a15250530.html,/

literature/hb/stratix-v/stx5_51001.pdf

3.7Series FPGAs Overview(DS180v1.15),Xilinx,San Jose,CA,

USA,Feb.2014.[Online].Available:https://www.wendangku.net/doc/1a15250530.html,/

support/documentation/data_sheets/ds180_7Series_Overview.pdf 4.M.Adrian,B Big data,[Teradata Magazine.[Online].Available:

https://www.wendangku.net/doc/1a15250530.html,/v11n01/Features/Big-Data/

5. D.A.Ferrucci,B Introduction to F This is Watson_,[IBM J.Res.

Dev.,vol.56,no.3,pp.1:1–1:15,May2012.

6.H.P.Hofstee,G.-C.Chen,F.H.Gebara,K.Hall,J.Herring,

D.A.Jamsek,J.Li,Y.Li,J.Shi,and P.W.Y.Wong,

B Understanding system design for big data workloads,[IBM

J.Res.Dev.,vol.57,no.3/4,pp.3:1–3:10,May/Jul.2013.

7.R.Polig,K.Atasu,L.Chiticariu,C.Hagleitner,H.P.Hofstee,

F.R.Reiss,E.Sitaridi,and H.Zhu,B Giving text analytics a

boost,[IEEE Micro,vol.34,no.4,pp.6–14,Jul./Aug.2014. 8.PCI Express Base Speci?cation,Revision3.0,PCI-SIG,Beaverton,

OR,USA,Nov.2010.[Online].Available:https://www.wendangku.net/doc/1a15250530.html,/ speci?cations/pciexpress/base3/

9.W.J.Starke,J.Stuecheli,D.Daly,J.S.Dodson,F.Auernhammer,

P.Sagmeister,G.Guthrie,C.F.Marino,M.Siegel,and B.Blaner,

B The cache and memory subsystems of the IBM POWER8

processor,[IBM J.Res.Dev.,vol.59,no.1,Paper3,pp.3:1–3:13, 2015.

10.Power ISA Version2.06Revision B,IBM,Armonk,NY,USA,

Jul.23,2010.[Online].Available:https://https://www.wendangku.net/doc/1a15250530.html,/

wp-content/uploads/2012/07/PowerISA_V2.06B_V2_

PUBLIC.pdf

11. D.Hendersen and J.Mitchell,B Power7System RAS:Key aspects

of Power Systems Reliability,Availability,Serviceability,[IBM Syst.Technol.Group,Somers,NY,USA,Dec.9,2012.[Online].

Available:https://www.wendangku.net/doc/1a15250530.html,/systems/power/hardware/

whitepapers/ras7.html

Received March24,2014;accepted for publication

April17,2014

Jeffrey Stuecheli IBM Systems and Technology Group,Austin, TX78758USA(jeffas@https://www.wendangku.net/doc/1a15250530.html,).Dr.Stuecheli is a Senior Technical Staff Member in the Systems and Technology Group.

He works in the area of server hardware architecture.His most recent work includes advanced memory architectures,cache coherence,and accelerator design.He has contributed to the development of numerous IBM products in the POWER*architecture family,most recently the POWER8design.He has been appointed an IBM Master Inventor, authoring about100patents.He received B.S.,M.S.,and Ph.D.degrees from The University of Texas Austin in Electrical Engineering.

Bart Blaner IBM Systems and Technology Group,Essex Junction, VT05452USA(blaner@https://www.wendangku.net/doc/1a15250530.html,).Mr.Blaner earned a B.S.E.E. degree from Clarkson University.He is a Senior Technical Staff Member in the POWER development team of the Systems and Technology Group.He joined IBM in1984and has held a variety

of design and leadership positions in processor and ASIC development. Recently,he has led accelerator designs for POWER7+*and POWER8technologies,including the Coherent Accelerator Processor Proxy design.He is presently focused on the architecture and implementation of hardware acceleration technologies spanning a variety of applications for future POWER processors.He is an IBM Master Inventor,a Senior Member of the IEEE,and holds more than 30patents.

Charles R.Johns IBM Systems and Technology Group,Austin, TX78758USA(crjohns@https://www.wendangku.net/doc/1a15250530.html,).Mr.Johns is an STSM (Senior Technical Staff Member)in the IBM Server and Technology Group.He received his B.S.degree in electrical engineering from the University of Texas at Austin in1984.After joining IBM

Austin in1984,Mr.Johns worked on various disk,memory,voice

communications,and graphics adapters for the IBM Personal Computer.From1988until2000,he was part of the Graphics Organization and was responsible for the architecture and development of entry and midrange3D graphics adapters and GPUs(graphics processing units).From2000to2010,Mr.Johns was part of the

STI(Sony,Toshiba,IBM)Project responsible for the Cell Broadband Engine Architecture**(CBEA)and participated in the development of the Cell Broadband Engine**(the?rst implementation of the CBEA).Currently Mr.Johns is working on hybrid computing solutions for the POWER processors.He is directly responsible for the Coherent Accelerator Interface Architecture(CAIA)and Chief Engineer of FPGA acceleration using the Coherent Accelerator Processor Interface (CAPI).Mr.Johns is an IBM Master Inventor with over100patents. Michael S.Siegel IBM Systems and Technology group,Research Triangle Park,NC27709USA(siegelm@https://www.wendangku.net/doc/1a15250530.html,).Mr.Siegel is a Senior Technical Staff member in the IBM Systems and Technology Group(STG).He currently works as the hardware architect of coherent bus architectures(PowerBus)developed for IBM System p*server applications.In2003,Mr.Siegel joined the PowerPC*development team to support the high-performance processor roadmap and create standard products for both internal and external customers.Mr.Siegel’s roles included memory controller design lead,and coherency bus design lead and architect.These roles led to the development of the PowerBus architecture in use in System p processor chips,starting with POWER7*chips.Mr.Siegel supported multiple projects including those involving IBM POWER7and POWER8server processors,

and assisted in customer discussions for future game and joint

chip development activity,by incorporating new functions into the architecture as system requirements evolved.While working on

the POWER8PowerBus architecture,Mr.Siegel worked as a hardware architect of the processor side of the CAPI,working with development teams spanning the processor chip and?rst generation coherently attached external coprocessor by specifying hardware behavior and the microcode architecture.Prior to his work in STG,Mr.Siegel worked in the Network Hardware Division(NHD)developing the Rainier network processor,the IEEE802.5DTR(dedicated token ring) standard,and the IBM token ring switch products based on

the standard.Mr.Siegel started working for IBM in Poughkeepsie, New York V on the3081I/O subsystem and the ES/9000Vector Facility.Mr.Siegel has been an IBM Master inventor since1996and is an inventor on over70patents issued by the U.S.Patent of?ce.

相关文档