Mars: A 64-core ARMv8 Processor
Charles Zhang
Phytium Technology Co., Ltd
Phytium Technology Co., Ltd
Statements
The following slides are presented to introduce the general features of one of our products, instead of any commitment about it. It is for information purposes only, and may not be incorporated into any contract. It is not suggested to make purchasing decisions
accordingly. The development, release, and timing of any features or functionality described here remains at the sole discretion of Phytium.
2
Phytium Technology Co., Ltd
A Brief Introduction of Phytium
?
China corporation, founded in 2012
?Guangzhou ?
Tianjin
?Vision
?
Leading edge CPU and ASIC provider in China ?
Market focuses on chips for
?Internet & Cloud Computing infrastructure ?
Traditional workload mainframe servers
3
Phytium Technology Co., Ltd
China is a Fast-growing Server Market
4
Company 1Q15Revenue 1Q15 Market Share (%)1Q14Revenue 1Q14 Market Share (%)1Q15-1Q14
Growth (%)HP 3,191,694,94823.82,890,992,22925.510.4Dell 2,296,473,02617.12,006,639,00617.714.4IBM 1,887,939,14114.12,244,631,78919.8-15.9Lenovo 970,254,6597.2127,973,470 1.1658.2Cisco 890,179,930 6.6616,620,000 5.444.4Others 4,157,871,70431.03,469,383,44430.619.8Total 13,394,413,409100.011,356,239,939100.017.9
Company 1Q15Revenue 1Q15 Market Share (%)1Q14Revenue 1Q14 Market Share (%)1Q15-1Q14 Growth (%)Inspur 332,613,48021227,328,2561746Dell 322,063,14020246,281,2711931Lenovo 295,914,5711880,084,8266270
HP 217,487,45014167,775,9231330Huawei 197,490,41912189,963,266144Sugon 140,377.091970,705,366599Others 104,566,7376329,549,62125-68Total
1,610,512,888100.01,311,688,529100.023
Source: Gartner (May 2015)
China
WW
Phytium Technology Co., Ltd
What is Mars for?
5●High performance
●High volume of memory ●High bandwidth memory access ●High bandwidth I/O access
●Large scale cache coherency maintained
●Moderate performance ●High power efficiency ●High density computing ●High bandwidth memory access ●Low cost
Mars Earth
Phytium Technology Co., Ltd
Mars Overview
?
Architecture Features
?64 Xiaomi cores, ARMv8 compatible
?Hardware-maintained global cache coherency
?Panel-based data affinity architecture
?Mesh topology on chip network ?32MB L2 cache
?
8Cache & Memory Chips (CMC)
?128MB L3 cache ?
16 DDR3-1600channels
?Two 16-lane PCIE3.0 i/f
?
ECC and parity protection on all caches, tags and TLBs
6Physical ?~180M instances ? 2.0GHz@28nm
?120W
Performance
?Peak :512GFLOPS ?Mem BW :204GB/s ?
I/O BW :32GB/s
panel0
panel1
panel3
panel2
panel4
panel5
panel7
panel6
CMC
PCIe PCIe
DDR3
DDR3CMC
DDR3
DDR3CMC
DDR3
DDR3CMC
DDR3
DDR3
CMC D D R 3
D D R 3
CMC
D D R 3
D D R 3
CMC
D D R 3
D D R 3CMC D D R 3
D D R 3
Phytium Technology Co., Ltd Panel Architecture
?Eight Xiaomi Cores
?Compatible design with ARMv8 arch license
?Both AArch32 and AArch64 modes
?EL0~EL3 supported
?ASIMD-128 supported
?Adv. hybrid Branch Prediction
? 4 fetch/4 decode/4 dispatch Out-of-Order
superscalar pipeline
?Cache Hierarchy
?Separated L1 ICache and L1Dcache
?Shared L2 cache, totally 4MB
?Directory-based cache coherency
maintenance
?Directory Control Unit (DCU)
?Routing Cell
7
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi
L2cache
6000μm
10600μm
Phytium Technology Co., Ltd
8
ITLB
I Cache
BTB DirPre IndPre SRS Loop Detect
Instruction Buffer decoder decoder decoder decoder
Rename Logic Arch.
Reg file Phy.Reg
file
Dispatch Logic
Reorder Buffer
Int/Bran Queue Integer Queue Integer Queue Integer Queue FP/VT
Queue LD/ST Queue ALU /BR
ALU /SHF ALU /SHF ALU /SHF
FP /SIMD
FP /SIMD
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch Debug /Trace /Interrupt /Timer
Xiaomi Core
Phytium Technology Co., Ltd
9
ITLB I Cache BTB DirPre IndPre SRS Loop Detect Instruction Buffer decoder decoder decoder decoder
Rename Logic Arch.
Reg file Phy.Reg file Dispatch Logic
Reorder
Buffer Int/Bran Queue Integer Queue Integer Queue Integer Queue
FP/VT Queue LD/ST Queue ALU /SHF /BR
ALU /SHF ALU /SHF ALU /SHF FP /SIMD FP /SIMD DTLB D Cache L2 Cache
Prefetch Prefetch Debug /Trace /Interrupt /Timer Xiaomi Core Front End
ITLB
I Cache
BTB DirPre IndPre SRS
Loop
Detect
Instruction Buffer Prefetch ?32KB L1 instr. Cache
?Next line prefetch ?Hybrid Branch Predictor ?2048-entry BTB
?Direction predict with TAGE predictor
?512-entry indirect predictor ?48-entry Speculative Return Stack ?Four instructions fetched per cycle
?32-entry instruction buffer
?Loop detect and Instr. Cache bypass
Phytium Technology Co., Ltd
10
ITLB
I Cache
BTB DirPre IndPre SRS Loop Detect Instruction Buffer decoder decoder decoder decoder
Rename Logic Arch.
Reg file Phy.Reg
file
Dispatch Logic
Reorder Buffer
Int/Bran Queue Integer Queue Integer Queue Integer Queue FP/VT Queue LD/ST Queue ALU /SHF /BR
ALU /SHF ALU /SHF ALU /SHF FP /SIMD FP /SIMD DTLB D Cache L2 Cache STB & Prefetch
Prefetch Debug /Trace /Interrupt
/Timer Xiaomi Core Decode, Rename & Dispatch
?Up to four instructions
decoded per cycle ?192 physical registers ?Up to four instructions renamed per cycle
?Up to four instructions dispatched per cycle ?Reorder buffer can hold 160 instructions, and about 210+ instructions can be in-flight in the whole pipeline.?Dispatch in-order, execution out-of-order, retirement in-order.
Phytium Technology Co., Ltd
11
ITLB I Cache BTB DirPre IndPre SRS Loop Detect Instruction Buffer decoder decoder decoder decoder
Rename Logic Arch.Reg file Phy.
Reg file Dispatch Logic Reorder
Buffer Int/Bran Queue Integer Queue Integer Queue Integer Queue FP/VT
Queue LD/ST Queue ALU /SHF /BR
ALU /SHF ALU /SHF ALU /SHF
FP /SIMD
FP /SIMD
DTLB
D Cache
L2 Cache STB & Prefetch Prefetch Debug /Trace /Interrupt /Timer
Xiaomi Core Function Units
?Four separated 16-entry integer queues ?One integer unit can process both multi-cycle integer instructions and branch instructions ?The other three integer units can only process singe-cycle integer
instructions ?One shared16-entry floating point and ASIMD queue ?Two FP/ASIMD units equipped, which can be combined into one lockstep ASIMD unit.?FMA supported in both units.?FMUL: 3cycles, FADD: 3cycles, FMA: 6cycles ALU /BR
Phytium Technology Co., Ltd
12
ITLB I Cache BTB DirPre IndPre SRS
Loop Detect
Instruction Buffer decoder decoder decoder decoder Rename Logic Arch.Reg file Phy.Reg file
Dispatch Logic Reorder Buffer
Int/Bran Queue Integer Queue Integer Queue Integer
Queue FP/VT
Queue LD/ST Queue ALU /BR
ALU /SHF ALU /SHF ALU /SHF
FP /SIMD
FP /SIMD
DTLB
D Cache
L2 Cache
Prefetch Prefetch Debug /Trace /Interrupt /Timer
Xiaomi Core Function Units
?One 24-entry load/store queue
?32KB L1 data cache
? 6 outstanding loads
? 4 cycles latency from load to use ?Next line and stride detected data prefetch
?Streamlined pattern auto detected
STB & Prefetch
Phytium Technology Co., Ltd
Cache coherence protocol
?
Hawk cache coherence protocol
?Distributed directory-based global cache coherency ?MOESI-like packet-based coherence protocol ?
A home node DCU(directory control unit) supports
?Affinitive pairing of L2Cs and CMCs
?“Infinite” capacity for non-conflicting Reads & Writes ?Optimized transaction flow for exclusive atomic accesses ?
Reduced latency by cacheline forwarding
13
L2C L2C L2C
Hawk
L3C &Memory
I/O
Interconnects
Global Exclusive Monitor
Core0
Core7
Coherence Logic
Panel N
MEM
Core0
Core7
Coherence Logic
Panel 0
Local Monitor
Phytium Technology Co., Ltd
Uniform package format for each port, a port can be configured to be 4 physical channels for CC and 1 channel for debug, DOR Y-X routing https://www.wendangku.net/doc/2b10102771.html,t. (cycles)03
1
6293
124
155126976Avg.
9
5
6
Phytium Technology Co., Ltd
Cache & Memory Chip
?
L3cache
?16MB Data Array ?
2MB Data ECC
?
DDR bandwidth
?
2 x DDR3-800:25.6GB/s
?
Proprietary interface between Mars & CMC
?
Parallel interface
?
Needs more pins, but lower latency than serdes
?
Separate write/cmd and read data channel
L3Bank0Mars Interface L3Bank1L3Bank2
L3Bank3
Mem Ctrl0Mem Ctrl1D D R
D D R
15?Effective read channel bandwidth :12.8GB/s ?
Effective write/cmd channel bandwidth :6.4GB/s
Phytium Technology Co., Ltd
Latency of affinitive access
Memory access latency(ns)Local L1 cache hit ~2Local L2 cache hit ~8Affinitive L2 cache hit ~20Affinitive L3 cache hit ~36Affinitive DDR access
~70
?Panel : 2.0GHz ?NoC: 2.0GHz ?CMC: 1.5GHz
* PCB latency not considered
16
Xiaomi Xiaomi
Xiaomi Xiaomi
L2cache
Routing Cell
DCU
DCU
Xiaomi Xiaomi
Xiaomi Xiaomi
L2cache
Xiaomi Xiaomi
Xiaomi Xiaomi
L2cache
Routing Cell
DCU DCU
Xiaomi Xiaomi
Xiaomi Xiaomi
L2cache
CMC
Phytium Technology Co., Ltd
Memory Tune (mTune)
?
Rich Data Collection
?Number of cache hits/misses for L1/L2/L3?Workload of cache pipelines ?Busyness of the NoC
?
ECC corrections of the memory system
?
Support Multiple Metrics
?Average Miss rate/Hit rate
?Minimal/Maximal/Average Access Latency ?Bandwidth Analysis
?
Concurrent Average Memory Access Time (CAMAT)?
Support MPI/OpenMP Applications
?Thread behavior analysis
?
Inter-process behavior analysis
17
Phytium Technology Co., Ltd
Phytium Technology Co., Ltd
Physical Design
?28nm process ?0.9v core/1.8v IO ?10 metal layers ?
~180M instances ?2.0GHz ?120W
?640mm 2die size ?FCBGA ?
~3000 pins
1925.38mm
25.2mm
Phytium Technology Co., Ltd
Performance Evaluation
SpecCPU2006
20Single copy of SPEC CPU benchmark 64 copies of SPEC CPU benchmark
19.2
17.8
510152025INT
FP
SPEC_CPU2006_base
672
585
100200300400500600700800INT
FP
SPEC_CPU2006_rate