当前位置：文档库 › Sparc T3

Sparc T3

O RACLE E XTENDS SPARC C OMMITMENT New Sparc T4 Processor Boosts Single-Thread Performance

By Bob Wheeler {9/19/11-01}

...................................................................................................................

Squashing any lingering concerns about its commit-ment to the SPARC architecture, Oracle has begun beta testing servers using its next-generation Sparc T4 proces-sors. At Hot Chips, the company disclosed details of the new CPU core used in this processor as well as future Oracle processors. Formerly code-named Yosemite Falls, Sparc T4 is the successor to Sun’s Rainbow Falls, which Oracle branded Sparc T3. When Oracle acquired Sun in early 2010, there were concerns about its commitment to former-Sun products and technologies including SPARC (see MPR 2/22/10,“Sun Fades Into Oracle’s Orbit”). In September 2010, Oracle released systems based on Sparc T3, but this processor had largely been completed before the acquisition.

Sparc T4 is the first processor designed under the Oracle regime. To better meet the needs of its customers, the company focused the design on improving single-thread performance. This approach reversed Sun’s prior direction of increasing the number of threads per proces-sor to increase throughput. The result is a processor with eight CPUs operating at 3.0GHz, compared with the T3’s 16 CPUs operating at a lowly 1.65GHz. Dubbed the S3, Oracle’s new CPU core is far more sophisticated than the relatively simple T1/T2/T3 CPUs from the Sun Niagara lineage. Designed from scratch, the S3 CPU is a dual-issue out-of-order design with extensive branch prediction. Like the T3 CPU, the S3 supports eight threads, yielding 64 threads in the eight-core T4 implementation.

Oracle’s internal benchmark results show Sparc T4 delivers about the same transaction-processing throughput as Sparc T3 despite using half as many CPUs/ threads. But single-thread integer performance is another story, with the T4 delivering performance about 5× that of the T3 as measured by SPECint2006. For floating-point workloads, the T4’s SPECfp2006 score is about 7×that of the T3. Unfortunately, Oracle disclosed only relative performance rather than actual scores, making it impossible to compare Sparc T4 with other vendors’ processors. In any case, the company appears to have preserved the throughput perfor-mance of its processors while making a major improve-ment in what had been poor single-thread performance.

The new Sparc T4 design illustrates the ongoing struggle of processor designers to balance single-thread performance versus throughput on highly threaded code. Not all workloads scale well with cores and threads, and response times can suffer if individual threads run too slowly. For Oracle, the T4 is a relatively low-risk imple-Figure 1. Oracle S3 CPU block diagram. Eight of these new cores are in the Sparc T4 processor. FGU=floating-point/

graphics unit; BRU=branch unit; LSU=load/store unit.

2Oracle Extends SPARC Commitment

mentation that should yield more application-agnostic server designs. It also sets the stage for the 16-core Sparc T5 design, which is already under development in 28nm. Going Deep for Speed

The simple path to increased single-thread performance is boosting clock speed using deeper pipelines. The Sparc T3 CPU, which Oracle refers to as the S2, used an eight-stage integer pipeline that carried over from the UltraSparc T2 (Niagara 2) design (see MPR 11/6/06,“Niagara 2 Opens the Floodgates”). Compared with the S2, the new S3 core used in Sparc T4 doubles the integer-pipeline depth to 16 stages, matching that of Intel’s Sandy Bridge CPU. As a result, the 40nm T4 processor operates at 3.0GHz or more. Oracle has not disclosed exact clock speeds, but the T4 appears to be the first processor to achieve 3GHz in TSMC’s 40nm process.

Figure 1 shows a block diagram of the S3 CPU with its more complex front-end pipeline. The execution units remain similar to those of the S2 design, with two ALUs, a single load-store unit (LSU), one floating-point and graphics unit (FGU), and a local crypto unit. The older design, however, can execute two instructions per cycle only if the instructions are from different threads, limiting single-thread performance. The S3 can dual-issue instruc-tions from a single thread.

The branch unit (BRU) is new to the S3 and deter-mines branch predictions. The branch predictor includes a return stack to predict return addresses and a branch target cache (BTC) to reduce the taken-branch penalty. The S2 has no branch prediction at all, relying on multithreading to hide branch latency. Thus, the new branch unit further enhances single-thread performance on the S3.

To further optimize single-thread execution, the S3 implements out-of-order execution with register renam-ing. The CPU includes a reorder buffer (ROB) with 128 entries. To reduce cache misses, the new design adds a stride-based data-cache hardware prefetcher. Another hardware prefetcher buffers sequential lines for the in-struction cache. Other changes from the S2 include a new private 128KB level-two cache; level-one caches remain relatively small at 16KB each for instructions and data.

The S3 also implements a number of new instruc-tions. The PAUSE instruction pauses a thread for a fixed number of cycles—a useful tool for spin-lock backoff or a lightweight sleep function. Pausing a thread frees all asso-ciated resources for use by other threads. A new fused compare-branch instruction reduces code-path length by more than 10% for some applications. Finally, the crypto unit is now accessed through user-level instructions, which are easily threaded. In the T3, the crypto unit operated as a coprocessor that was accessed through a hypervisor. The

new crypto architecture yields a 2.5× or

greater improvement in AES bulk-

encryption performance.

The contrast between the S2 and

S3 microarchitectures is stark. The S2 is

optimized for multithreaded execution,

using a simple CPU design with a short

pipeline and essentially scalar execution.

When the pipeline stalls, as it frequently

does, it just switches to another thread.

This approach works poorly when exe-

cuting a single thread, however. The S3

design nearly doubles the clock speed,

doubles the number of instructions per

cycle for a single thread, and adds

further optimizations such as branch

predication, instruction reordering, and

prefetching that minimize stalls. These

changes enable the aforementioned 5×

increase in single-thread integer per-

formance, but they also significantly in-

crease the die area and complexity of

the CPU design.

Flexible Threading Is Key

Figure 2 provides a threaded view of the

S3 core’s pipeline. At the front of the

pipeline, four instructions (16 bytes)

from a single thread are fetched from

Figure 2. Threaded view of the S3 core pipeline. The 16-stage integer pipeline ap-pears at the top for reference. The pick queue is the critical resource in thread allo-cation. Eight threads keep two issue slots busy.

Oracle Extends SPARC Commitment 3

the instruction cache. Each of the eight threads has a dedicated instruction buffer (IB). The thread-select stage (S) uses a least-recently-fetched algorithm for thread selection. Two instructions from a single thread are then decoded by the decode unit, which handles slot selection. Slot 0 handles load/store operations, whereas slot 1 handles branch, floating-point, and crypto operations. Both slots handle ALU operations, but the FGU performs integer multiply/divide operations.

Following the decode stages (D1–D2), the CPU per-forms register renaming (R1–R3) and loads the two instructions into the pick queue. The 40-entry unified pick queue is the key shared resource in Oracle’s dynamic threading approach. Whereas the S2 core used static per-thread resource allocation, the S3 design allows threads to use more than one-eighth of the available resources. Oracle did not detail how the pick queue is allocated, but it is apparently under programmatic control. Oracle’s Solaris operating system will support a new “critical thread” API, which enables the OS to recognize critical threads in appli-cations and assign them to a single processor core. Fur-thermore, Solaris can dedicate that core to the critical thread and will presumably allocate resources—such as the pick queue—accordingly.

The potential downside of dynamic resource alloca-tion is that threads can use too much of the available resources and starve other threads. The S3 design mitigates the impact of such “thread hogs” through multiple tech-niques. For shared resources, the CPU implements high and low watermarks. When the high watermark is reached, the core stalls thread-resource allocation until the low watermark is reached. In the pick queue, the CPU moni-tors the rate of entry deallocation by each thread. If a thread is not releasing entries at an acceptable rate, the CPU reduces the number of entries available to that thread. This technique frees entries associated with stalled threads for use by other threads. Several events will cause the CPU to flush a thread, releasing all associated resources. These events include an L3 cache miss, a load/ store timeout, and I/O access.

When issuing instructions (stages I1–I4), the pick queue is thread agnostic and simply picks the oldest ready-to-execute instruction per slot. Thus, it can issue two instructions from two different threads, fitting the defini-tion of simultaneous multithreading (SMT), or from a sin-gle thread. Instructions are executed (E), and if necessary, they access the data cache (C1–C3). The commit stage (C) commits two instructions from the same thread per cycle using a least-recently-committed algorithm.

As in the S2, the S3 design enables high resource utilization when executing multiple threads. When a thread stalls, the pick queue simply issues instructions from alternate threads. If several threads are stalled, the pick queue can issue subsequent instructions from the stalled threads, as long as they don’t depend on the result of the stalled instruction. Even if only a single thread is available, the CPU can still operate at its peak rate of two instructions per cycle, avoiding stalls by using the pick queue to reorder instructions.

The Evolving System-on-a-Chip

Although the S3 CPU is all new, much of the Sparc T4’s high-level design carries over from Sparc T3. In fact, Oracle reused the basic T3 floor plan, and we suspect the T4 is pin compatible with the T3. In the T4 design, one S3 core replaces what was a pair of S2 cores in the T3, indi-cating the larger size of the more complex CPU. The shared L2 cache of the T3 becomes a shared L3 cache in the T4. An 8×9 crossbar, called the CCX, sits at the center of the die and connects the eight cores with the eight banks of L3 cache; in the T3, two CPUs shared one crossbar port as did two banks of L2 cache.

As Table 1 shows, Sparc T4’s external interfaces are identical to those of Sparc T3. Because the T3 and T4 use the same 40nm TSMC process, Oracle was able to reuse existing I/O blocks. Like the T3, the T4 includes two memory controllers, two PCI Express v2.0 ports, and two 10G Ethernet ports, as well as coherent-bus interfaces to support system designs with up to four sockets (4P). Like Intel’s Xeon E7 chips, both Oracle processors use a fully buffered memory architecture, which Oracle calls “buffer-on-board” (BoB). The T3/T4 processors’ 6.4GHz serial channels connect to an external BoB ASIC that bridges to ECC-protected DDR3-1066 SDRAM. The dual integrated PCI Express controllers each support eight Gen2 lanes, and the pair of 10G Ethernet ports support XAUI. These inte-grated network ports leave the PCIe interfaces available for storage and other I/O such as InfiniBand.

The Sparc T3 processor has a die size of 371mm2, and Oracle rates the 16-core chip at 139W. Oracle’s Hot Chips presentation was silent on the issue of Sparc T4’s power dissipation and die size. Given the T4’s similar floor plan, the new processor appears to have essentially the same die Table 1. Key parameters for Oracle Sparc T3 and T4 proces-sors. With the exception of the new S3 core, much of the T4 design carries over from its predecessor. (Source: Oracle)

Sparc T3 Sparc T4

# of CPUs, Threads 16 CPUs, 128 threads 8 CPUs, 64 threads CPU Speed 1.65GHz ≥3.0GHz

L1 Cache (I/D) 16KB/8KB 16KB/16KB

L2 Cache Shared 6MB 128KB per CPU

L3 Cache None 4MB Memory Interfaces 4× serial, buffer ASIC to DDR3-1066 w/ECC Memory Bandwidth 34GB/s

Coherent Bus Six 9.6GHz links, up to 4 sockets

I/O Bus 2× PCI Express Gen2 ×8

Ethernet Ports 2×10G Ethernet

Power (typ) 139W Not disclosed

IC Process TSMC 40nm TSMC 40nm

Die Size 371mm2Not disclosed Server Production 3Q10 4Q11 (est)

4Oracle Extends SPARC Commitment

size. Assuming the company designed the T4 to be socket compatible with the T3, power dissipation is likely to fall in the 140–150W range. Oracle announced the beta program for Sparc T4 servers in June, and these systems should reach production by the end of 2011.

With Sparc T4 nearing production, Oracle is devel-oping a new 16-core T5 processor using 28nm technology. The company referred to the S3 as a “foundation” core for use in multiple processor generations, so we expect the T5 to use the same basic CPU design while improving the processor’s memory and I/O bandwidth to support the larger number of cores. Oracle’s roadmap indicates T5-based servers should debut in 2013 supporting up to eight sockets and three times the throughput of Sparc T4 servers, which scale to only four sockets. This implies that the T5 will deliver only 50% more performance per socket com-pared with the T4, suggesting clock speed will fall to fit the extra cores into a similar power envelope.

Lighting a SPARC Under Unix

Because Oracle SPARC servers run Solaris, they compete most directly with other Unix platforms including IBM POWER servers as well as HP Integrity servers based on Intel Itanium processors. Like Sparc T4, IBM’s Power7 processor has eight CPUs and supports simultaneous multithreading. Each Power7 core supports four threads, for a total of 32 threads per processor—half that of Oracle’s Sparc T4. But IBM’s design delivers greater single-thread performance and supports systems with up to 32 sockets. The Power7 CPU can dispatch up to six instructions per cycle versus just two for the T4.

Backing up these high-performance cores, Power7 includes a massive 32MB of L3 cache, dwarfing that of Sparc T4. In addition, each Power7 CPU has twice the L1 and L2 cache of Oracle’s S3 CPU. The IBM chip also has eight memory channels delivering much greater memory bandwidth. With all cores active, Power7 supports a maximum clock frequency of 3.86GHz, but we estimate the chip burns about 200 watts at this speed—approximately 40% more than Sparc T4 dissipates. With a die size of 567mm2, Power7 is also about 50% larger than the T4. Thus, the T4 should be more competitive in performance per watt and performance per dollar.

With only four cores, eight threads, and a 1.73GHz clock speed, Intel’s shipping Itanium 9350 falls far short of both the Oracle and IBM chips. Using Intel’s chipset, Itanium supports up to eight sockets. In 2012, Intel will boost Itanium performance with the new eight-core 32nm Poulson design. The company has withheld key details of this chip, however, such as target clock speed and multi-threading enhancements. Poulson is socket compatible with the Itanium 9300 series, which should enable HP to quickly upgrade its Integrity line.

Overall, Oracle’s T4 should help close the perform-ance gap relative to the heftier Power7 processor while maintaining a lead over Itanium. IBM is developing a 32nm processor (dubbed Power7+), however, which could also reach production by the end of 2011.

Why Larry Likes Chips

A year ago, CEO Larry Ellison caused a stir by saying Oracle would be buying chip companies. Although SPARC technology is already in house, Sparc T4 helps illustrate why chips are important to the company. Relative to Sparc T3, the T4 should dramatically improve the performance of Oracle applications and middleware that benefit from greater single-thread performance. For its vertical integra-tion to add value, Oracle must offer hardware systems op-timized to work with the company’s application software. The Sparc T4 processor will make Oracle’s T-Series servers more competitive for a variety of workloads, whereas Sun had perhaps focused too sharply on web applications.

Oracle’s processor upgrade also provides concrete evidence of the company’s commitment to the T-Series roadmap. With Sparc T4 systems already in beta testing and the T5 under development, former Sun customers should feel more comfortable making an ongoing invest-ment in SPARC. Sparc T4 provides Oracle a relatively low-risk approach to validating the new, more complex S3 core in a proven 40nm process. The company’s processor de-signers can then tweak the S3 design as they shrink it to 28nm technology for the T5. Power efficiency will likely be a primary focus as the team squeezes 16 of these high-speed cores into a power envelope of around 150W.

In stark contrast to Intel and AMD, which principally sell processors, Oracle now develops processors and sells servers, operating systems, and application software. Matched by only IBM, this level of vertical integration pro-vides opportunities for faster architectural innovation. For example, Oracle can quickly roll out Sparc T4 servers, a new Solaris release, and new versions of applications that take advantage of its new critical-thread API. Intel, to its chagrin, has at times waited years for Microsoft to take advantage of new features or instructions in its Xeon pro-cessors. Whether or not Oracle can benefit from its new level of integration will depend greatly on product execu-tion—synchronizing schedules across multiple hardware and software development groups is never easy. ?

Price and Availability

Oracle Sparc T4 processors are available only in system-level products. Sparc T4 servers are currently in beta testing and are expected to reach general availability in 4Q11. Registered Hot Chips attendees can use their passwords to download the Oracle T4 presentation at https://www.wendangku.net/doc/8313102830.html,/archives/hc23. Hot Chips plans to make the presentation available for public download in mid-December.