文档库 最新最全的文档下载
当前位置:文档库 › HC27.24.320-Mars-64core-Gao-Phytium-v1.0

HC27.24.320-Mars-64core-Gao-Phytium-v1.0

Mars: A 64-core ARMv8 Processor

Charles Zhang

Phytium Technology Co., Ltd

Phytium Technology Co., Ltd

Statements

The following slides are presented to introduce the general features of one of our products, instead of any commitment about it. It is for information purposes only, and may not be incorporated into any contract. It is not suggested to make purchasing decisions

accordingly. The development, release, and timing of any features or functionality described here remains at the sole discretion of Phytium.

2

Phytium Technology Co., Ltd

A Brief Introduction of Phytium

?

China corporation, founded in 2012

?Guangzhou ?

Tianjin

?Vision

?

Leading edge CPU and ASIC provider in China ?

Market focuses on chips for

?Internet & Cloud Computing infrastructure ?

Traditional workload mainframe servers

3

Phytium Technology Co., Ltd

China is a Fast-growing Server Market

4

Company 1Q15Revenue 1Q15 Market Share (%)1Q14Revenue 1Q14 Market Share (%)1Q15-1Q14

Growth (%)HP 3,191,694,94823.82,890,992,22925.510.4Dell 2,296,473,02617.12,006,639,00617.714.4IBM 1,887,939,14114.12,244,631,78919.8-15.9Lenovo 970,254,6597.2127,973,470 1.1658.2Cisco 890,179,930 6.6616,620,000 5.444.4Others 4,157,871,70431.03,469,383,44430.619.8Total 13,394,413,409100.011,356,239,939100.017.9

Company 1Q15Revenue 1Q15 Market Share (%)1Q14Revenue 1Q14 Market Share (%)1Q15-1Q14 Growth (%)Inspur 332,613,48021227,328,2561746Dell 322,063,14020246,281,2711931Lenovo 295,914,5711880,084,8266270

HP 217,487,45014167,775,9231330Huawei 197,490,41912189,963,266144Sugon 140,377.091970,705,366599Others 104,566,7376329,549,62125-68Total

1,610,512,888100.01,311,688,529100.023

Source: Gartner (May 2015)

China

WW

Phytium Technology Co., Ltd

What is Mars for?

5●High performance

●High volume of memory ●High bandwidth memory access ●High bandwidth I/O access

●Large scale cache coherency maintained

●Moderate performance ●High power efficiency ●High density computing ●High bandwidth memory access ●Low cost

Mars Earth

Phytium Technology Co., Ltd

Mars Overview

?

Architecture Features

?64 Xiaomi cores, ARMv8 compatible

?Hardware-maintained global cache coherency

?Panel-based data affinity architecture

?Mesh topology on chip network ?32MB L2 cache

?

8Cache & Memory Chips (CMC)

?128MB L3 cache ?

16 DDR3-1600channels

?Two 16-lane PCIE3.0 i/f

?

ECC and parity protection on all caches, tags and TLBs

6Physical ?~180M instances ? 2.0GHz@28nm

?120W

Performance

?Peak :512GFLOPS ?Mem BW :204GB/s ?

I/O BW :32GB/s

panel0

panel1

panel3

panel2

panel4

panel5

panel7

panel6

CMC

PCIe PCIe

DDR3

DDR3CMC

DDR3

DDR3CMC

DDR3

DDR3CMC

DDR3

DDR3

CMC D D R 3

D D R 3

CMC

D D R 3

D D R 3

CMC

D D R 3

D D R 3CMC D D R 3

D D R 3

Phytium Technology Co., Ltd Panel Architecture

?Eight Xiaomi Cores

?Compatible design with ARMv8 arch license

?Both AArch32 and AArch64 modes

?EL0~EL3 supported

?ASIMD-128 supported

?Adv. hybrid Branch Prediction

? 4 fetch/4 decode/4 dispatch Out-of-Order

superscalar pipeline

?Cache Hierarchy

?Separated L1 ICache and L1Dcache

?Shared L2 cache, totally 4MB

?Directory-based cache coherency

maintenance

?Directory Control Unit (DCU)

?Routing Cell

7

Xiaomi

Xiaomi

Xiaomi

Xiaomi

L2cache

Routing Cell

DCU

DCU

Xiaomi

Xiaomi

Xiaomi

Xiaomi

L2cache

6000μm

10600μm

Phytium Technology Co., Ltd

8

ITLB

I Cache

BTB DirPre IndPre SRS Loop Detect

Instruction Buffer decoder decoder decoder decoder

Rename Logic Arch.

Reg file Phy.Reg

file

Dispatch Logic

Reorder Buffer

Int/Bran Queue Integer Queue Integer Queue Integer Queue FP/VT

Queue LD/ST Queue ALU /BR

ALU /SHF ALU /SHF ALU /SHF

FP /SIMD

FP /SIMD

DTLB

D Cache

L2 Cache

STB & Prefetch

Prefetch Debug /Trace /Interrupt /Timer

Xiaomi Core

Phytium Technology Co., Ltd

9

ITLB I Cache BTB DirPre IndPre SRS Loop Detect Instruction Buffer decoder decoder decoder decoder

Rename Logic Arch.

Reg file Phy.Reg file Dispatch Logic

Reorder

Buffer Int/Bran Queue Integer Queue Integer Queue Integer Queue

FP/VT Queue LD/ST Queue ALU /SHF /BR

ALU /SHF ALU /SHF ALU /SHF FP /SIMD FP /SIMD DTLB D Cache L2 Cache

Prefetch Prefetch Debug /Trace /Interrupt /Timer Xiaomi Core Front End

ITLB

I Cache

BTB DirPre IndPre SRS

Loop

Detect

Instruction Buffer Prefetch ?32KB L1 instr. Cache

?Next line prefetch ?Hybrid Branch Predictor ?2048-entry BTB

?Direction predict with TAGE predictor

?512-entry indirect predictor ?48-entry Speculative Return Stack ?Four instructions fetched per cycle

?32-entry instruction buffer

?Loop detect and Instr. Cache bypass

Phytium Technology Co., Ltd

10

ITLB

I Cache

BTB DirPre IndPre SRS Loop Detect Instruction Buffer decoder decoder decoder decoder

Rename Logic Arch.

Reg file Phy.Reg

file

Dispatch Logic

Reorder Buffer

Int/Bran Queue Integer Queue Integer Queue Integer Queue FP/VT Queue LD/ST Queue ALU /SHF /BR

ALU /SHF ALU /SHF ALU /SHF FP /SIMD FP /SIMD DTLB D Cache L2 Cache STB & Prefetch

Prefetch Debug /Trace /Interrupt

/Timer Xiaomi Core Decode, Rename & Dispatch

?Up to four instructions

decoded per cycle ?192 physical registers ?Up to four instructions renamed per cycle

?Up to four instructions dispatched per cycle ?Reorder buffer can hold 160 instructions, and about 210+ instructions can be in-flight in the whole pipeline.?Dispatch in-order, execution out-of-order, retirement in-order.

Phytium Technology Co., Ltd

11

ITLB I Cache BTB DirPre IndPre SRS Loop Detect Instruction Buffer decoder decoder decoder decoder

Rename Logic Arch.Reg file Phy.

Reg file Dispatch Logic Reorder

Buffer Int/Bran Queue Integer Queue Integer Queue Integer Queue FP/VT

Queue LD/ST Queue ALU /SHF /BR

ALU /SHF ALU /SHF ALU /SHF

FP /SIMD

FP /SIMD

DTLB

D Cache

L2 Cache STB & Prefetch Prefetch Debug /Trace /Interrupt /Timer

Xiaomi Core Function Units

?Four separated 16-entry integer queues ?One integer unit can process both multi-cycle integer instructions and branch instructions ?The other three integer units can only process singe-cycle integer

instructions ?One shared16-entry floating point and ASIMD queue ?Two FP/ASIMD units equipped, which can be combined into one lockstep ASIMD unit.?FMA supported in both units.?FMUL: 3cycles, FADD: 3cycles, FMA: 6cycles ALU /BR

Phytium Technology Co., Ltd

12

ITLB I Cache BTB DirPre IndPre SRS

Loop Detect

Instruction Buffer decoder decoder decoder decoder Rename Logic Arch.Reg file Phy.Reg file

Dispatch Logic Reorder Buffer

Int/Bran Queue Integer Queue Integer Queue Integer

Queue FP/VT

Queue LD/ST Queue ALU /BR

ALU /SHF ALU /SHF ALU /SHF

FP /SIMD

FP /SIMD

DTLB

D Cache

L2 Cache

Prefetch Prefetch Debug /Trace /Interrupt /Timer

Xiaomi Core Function Units

?One 24-entry load/store queue

?32KB L1 data cache

? 6 outstanding loads

? 4 cycles latency from load to use ?Next line and stride detected data prefetch

?Streamlined pattern auto detected

STB & Prefetch

Phytium Technology Co., Ltd

Cache coherence protocol

?

Hawk cache coherence protocol

?Distributed directory-based global cache coherency ?MOESI-like packet-based coherence protocol ?

A home node DCU(directory control unit) supports

?Affinitive pairing of L2Cs and CMCs

?“Infinite” capacity for non-conflicting Reads & Writes ?Optimized transaction flow for exclusive atomic accesses ?

Reduced latency by cacheline forwarding

13

L2C L2C L2C

Hawk

L3C &Memory

I/O

Interconnects

Global Exclusive Monitor

Core0

Core7

Coherence Logic

Panel N

MEM

Core0

Core7

Coherence Logic

Panel 0

Local Monitor

Phytium Technology Co., Ltd

Uniform package format for each port, a port can be configured to be 4 physical channels for CC and 1 channel for debug, DOR Y-X routing https://www.wendangku.net/doc/2b10102771.html,t. (cycles)03

1

6293

124

155126976Avg.

9

5

6

Phytium Technology Co., Ltd

Cache & Memory Chip

?

L3cache

?16MB Data Array ?

2MB Data ECC

?

DDR bandwidth

?

2 x DDR3-800:25.6GB/s

?

Proprietary interface between Mars & CMC

?

Parallel interface

?

Needs more pins, but lower latency than serdes

?

Separate write/cmd and read data channel

L3Bank0Mars Interface L3Bank1L3Bank2

L3Bank3

Mem Ctrl0Mem Ctrl1D D R

D D R

15?Effective read channel bandwidth :12.8GB/s ?

Effective write/cmd channel bandwidth :6.4GB/s

Phytium Technology Co., Ltd

Latency of affinitive access

Memory access latency(ns)Local L1 cache hit ~2Local L2 cache hit ~8Affinitive L2 cache hit ~20Affinitive L3 cache hit ~36Affinitive DDR access

~70

?Panel : 2.0GHz ?NoC: 2.0GHz ?CMC: 1.5GHz

* PCB latency not considered

16

Xiaomi Xiaomi

Xiaomi Xiaomi

L2cache

Routing Cell

DCU

DCU

Xiaomi Xiaomi

Xiaomi Xiaomi

L2cache

Xiaomi Xiaomi

Xiaomi Xiaomi

L2cache

Routing Cell

DCU DCU

Xiaomi Xiaomi

Xiaomi Xiaomi

L2cache

CMC

Phytium Technology Co., Ltd

Memory Tune (mTune)

?

Rich Data Collection

?Number of cache hits/misses for L1/L2/L3?Workload of cache pipelines ?Busyness of the NoC

?

ECC corrections of the memory system

?

Support Multiple Metrics

?Average Miss rate/Hit rate

?Minimal/Maximal/Average Access Latency ?Bandwidth Analysis

?

Concurrent Average Memory Access Time (CAMAT)?

Support MPI/OpenMP Applications

?Thread behavior analysis

?

Inter-process behavior analysis

17

Phytium Technology Co., Ltd

Phytium Technology Co., Ltd

Physical Design

?28nm process ?0.9v core/1.8v IO ?10 metal layers ?

~180M instances ?2.0GHz ?120W

?640mm 2die size ?FCBGA ?

~3000 pins

1925.38mm

25.2mm

Phytium Technology Co., Ltd

Performance Evaluation

SpecCPU2006

20Single copy of SPEC CPU benchmark 64 copies of SPEC CPU benchmark

19.2

17.8

510152025INT

FP

SPEC_CPU2006_base

672

585

100200300400500600700800INT

FP

SPEC_CPU2006_rate

相关文档