Hot Chips 2018: NEC Vector Processor Live Blog

by Ian Cutress on August 21, 2018 8:55 PM EST

2 Comments | Add A Comment

2 Comments + Add A
Comment

08:59PM EDT - We saw this at Supercomputing last year: NEC's new Vector PCIe co-processor. I've wanted to write about it for a while, so I'm glad it's being presented here at Hot Chips. The talk is set to start at 6pm PT / 1am UTC.

09:00PM EDT - SX-Aurora TSUBASA

09:00PM EDT - NEC is a vector processor supercomputer

09:01PM EDT - based company

09:01PM EDT - SX-ACE in 2013 was 28 nm, 1.0 GHz, 256 GFLOPs, 256GB/sec per processor

09:01PM EDT - Do a lot of compiler work

09:02PM EDT - Vector computing for HPC

09:02PM EDT - Focused on large scale supercomputer before, so developed new supercomputer with a vector technology based on the engine

09:03PM EDT - Each node is x86 CPU + Vector Engine with high memory bandwidth and flexible configuration

09:03PM EDT - x86 and Linux, Fortran/C/C++

09:03PM EDT - Automatic vectorization and parallelization by proven vector compiler

09:04PM EDT - Vector Engine is TRANSPARENT to code

09:05PM EDT - Scales from Desktop Tower up to supercomputer

09:05PM EDT - Air cooled and liquid cooled cards

09:06PM EDT - Only one 8-pin connector

09:06PM EDT - PCIe 3.0 x16

09:06PM EDT - Dual slot, sub-300W

09:06PM EDT - only one 8-pin and 300W?

09:07PM EDT - 6 x HBM2

09:07PM EDT - TSMC and Broadcom assist

09:07PM EDT - 1.2 TB/s bandwidth

09:08PM EDT - 2.5D Interposer

09:08PM EDT - VE processor is 15x33mm

09:09PM EDT - 16MB LCC, 2D Mesh

09:09PM EDT - 1.6 GHz, 4.95 TFLOP single precision

09:09PM EDT - Up to 48GB HBM2

09:09PM EDT - 16nm FF

09:09PM EDT - 307GF in DP per core

09:09PM EDT - 8 vector cores inside

09:10PM EDT - Vector processing unit and scalar processing unit

09:10PM EDT - Scalar Processing Unit provides basic functionality - fetch, decode, branch, add, exceptions

09:10PM EDT - Controls the status of the core

09:11PM EDT - Address translation and data forwarding crossbar

09:11PM EDT - 16 elements/cycle vector address generation and translation, 17 requests/cycle issuing

09:11PM EDT - 409.6 GB/sec load and store data forwwarding

09:12PM EDT - Four pipelines, each 32-way parallel

09:12PM EDT - Total 96 FMAs per core

09:12PM EDT - Doubled SP perf by 32-bit x 2 packed vector data

09:12PM EDT - Vector register renaming with 256 physical VRs

09:13PM EDT - 7 VR banks per Vector Pipeline

09:13PM EDT - 32 VPPs per core

09:13PM EDT - OoO scheduling

09:13PM EDT - Dedicated complex operation pipeline

09:14PM EDT - Scalar part of core is traditional front end

09:14PM EDT - 4 instruction fetch and decode

09:14PM EDT - branch predictor

09:14PM EDT - 8-level speculative execution

09:15PM EDT - 32kB L1 cache and unified 256kb L2

09:15PM EDT - hardware prefetch

09:15PM EDT - Support for contiguous vector instruction pipes

09:16PM EDT - Memory subsystem to support 3 TB/s LLC bandwidth

09:16PM EDT - 1.2 TB HBM2 bandwidth

09:16PM EDT - Scalar L1/L2 ineach core

09:16PM EDT - 2 memory networks - 2D Mesh NoC for cores, also ring bus for DMA

09:16PM EDT - DMA engine used by vector cores and x86 node

09:17PM EDT - DMA engine can be virtualized

09:17PM EDT - Can access VE Memory, VE Registers, and x86 memory

09:17PM EDT - mapped through PCIe

09:17PM EDT - 2D mesh maximuses bandwith with minimal wiring

09:17PM EDT - 16 layer mesh

09:18PM EDT - Age based QoS control

09:18PM EDT - Dimension ordered routing

09:19PM EDT - L3 is 16MB write back

09:19PM EDT - Inclusive of L1 and L2

09:19PM EDT - 128 banks, auto data scrubbing

09:19PM EDT - assignable data buffer feature

09:20PM EDT - Vector Engine is much cheaper than V100

09:21PM EDT - Design target for VE not best Perf/Watt, but beat Xeon. Not optimized yet

09:22PM EDT - Performance on machine learning up to 107x over Xeon

09:23PM EDT - Very competitive with GPGPU using standard programming paradigms

09:23PM EDT - Q&A;

09:31PM EDT - Sorry, left to ask a question about power. That's a wrap, hope you enjoyed our Hot Chips coverage!

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

2 Comments

View All Comments

SarahKerrigan - Tuesday, August 21, 2018 - link
So, went from effectively two FMA pipes (2xMUL and 2xADD, for chained ops) to three, plus doubling SP throughput and significantly increasing ADB size. Not too shabby! Reply
Yojimbo - Wednesday, August 22, 2018 - link
Maybe an American company can come up with something like this so that the DOE can have the second architecture they are looking for while Intel are mucking about.

As far as the Vector Engine being much cheaper than a V100. I'm betting they are comparing the cost to build a Vector Engine with the cost to buy a V100. If so, that's a bad comparison. They aren't making their money by selling Vector Engines they are making their money on the supercomputer. And the V100s that go into a supercomputer aren't bought at the price an enterprise customer would pay. The real comparison is the price/performance of a supercomputer that uses Vector Engines and a supercomputer that uses V100s. I doubt implementing the Vector Engines is much cheaper than implementing the V100s for comparable performance. Reply

Jul	AUG	Sep
	22
2017	2018	2019

Hot Chips 2018: NEC Vector Processor Live Blog

Post Your Comment

2 Comments

View All Comments

SarahKerrigan - Tuesday, August 21, 2018 - link

Yojimbo - Wednesday, August 22, 2018 - link

Log in

Don't have an account? Sign up now