Intel Unified Memory Architecture (Intel UMA) for Graph Analytics A Step towards Future AI and Scalable HPC DPG/DPEA: P…

Document preview
File info: application/pdf · 12 pages · 678.40KB

Intel Unified Memory Architecture (Intel UMA) for Graph Analytics A Step towards Future AI and Scalable HPC DPG/DPEA: Pathfinding and Architecture Innovation Project

CTPClassification=CTP NT

Intel Unified Memory Architecture ...

Full PDF Document

If the viewer doesn’t load, open the file directly.

Extracted Text

Intel� Programmable Integrated Unified Memory Architecture (PIUMA)
Hardware for Faster and Deeper Insights from Large Scale Graph
Nikhil Deshpande Ph.D. Nikhil.m.Deshpande@intel.com
Product Director, AI and HPC Innovations Intel Corporation

Copyright � 2021, Intel Corporation

1

Today

 Environment  Graph Usages and Scalability Issues  Graph and Traditional Compute  Rethinking Hardware - Intel� PIUMA  Programming Intel� PIUMA  Summary

Copyright � 2021, Intel Corporation

2

Environment

 Data Deluge

 What are we after: Data, Knowledge or Insights?

 Is Data still the oil? Refining perhaps?

 Graphs: Best Representation for Capturing Relationships/Insights

 Faster and Deeper

Source: Data Age 2025, Sponsored by Seagate with Data from IDC Global Datasphere, Nov. 2018

Copyright � 2021, Intel Corporation

3

Usages and Scalability Wall

Knowledge Graph

Intrusion detection

Fraud Detection

Risk Mitigation

Genomics
Precision Medicine

Healthcare

Telecom Finance

Anti Money Laundering

Drug Discovery
Deep Analytics

Platforms
Social networks

Retail

Identity Graph

Industrial

Recommendation

Energy

Customer 360

Predictive Monitoring

Supply Chain Optimization

Real Time Analytics

Performance Scales

Performance does not scale

Communications Intensity
Traditional IaaS

Scalability Wall
Search, Hadoop Data Size

Large Scale Graphs

Copyright � 2021, Intel Corporation

4

New Workload Behavior Needs New Thinking

Branch Prediction

Regular
Branches have regular pattern

Graph
Branch outcome is data dependent
("Pointer hopping")

Locality

Same or neighboring data
likely used

Data is randomly scattered

Data Access

Same operation on neighboring data

Operations on scattered data

Graph Traversal is all about Pointer Chasing ... with lowest latency!

Today: Caching, Large message sizes, FLOPs...

Needs: Granular access, Small message sizes, +Traversed Edges Per Second!

Copyright � 2021, Intel Corporation

5

Intel� PIUMA Technology
Balance I/O, Memory, Compute and Prioritize in that order

� 1 PIUMA core = multiple processor pipelines with multiple threads
� Novel 64-bit RISC instruction set, graph optimized instructions � Global address space (GAS), accessible for all other cores � Memory controller with 8B granularity � Network interface with 8B packets � Tiles  Nodes  Systems  Millions of threads...

TOPS: Tera-Operations Per Second TTEPS: Tera-Traversed Edges Per Second

Copyright � 2021, Intel Corporation

6

Intel's Approach: Intel� PIUMA
A Programmable Integrated Unified Memory Architecture

Re-imagined Architecture

Fully Integrated

Architected to scale

CPU Support for Small, Irregular Memory Accesses
Near-Memory Atomics

Global Memory Model
Packaging for High IO & Memory bandwidth

Network as 1st-class Citizen Flatten Latency Hierarchy Point-to-Point Messages

Copyright � 2021, Intel Corporation

7

Intel� PIUMA � Scalable System
Sub-nodes

1

2

N

P1

P2

1

2

N

N

Host (x86)
1
ENet

PIUMA Node 1
PIUMA Node 2

PIUMA Node P-1
PIUMA Node P

PIUMA Global Address Space Network

Host

PIUMA Node 1

PIUMA Node P-1

2

Disk Pool

(x86) Q

PIUMA

PIUMA

1

Px

Node 2

Node P

� Distributed global address space across the full system

� Hierarchical topology with all-to-all connections  low diameter

� High radix fabric between tiles and between nodes for high bandwidth and low latency

Copyright � 2021, Intel Corporation

Credit: Fryman et.al. HPEC 2020

8

Speedup: Intel� PIUMA versus 1 Intel� Xeon� node

Application
Application Classification Random Walks Graph Search Louvain Community TIES Sampler
Graph2Vec GraphSAGE

Intel� PIUMA 1 node
6.9 x

Intel� PIUMA 16 nodes
111 x

Application Graph Wave

Intel� PIUMA 1 node
8.0 x

Intel� PIUMA 16 nodes
125 x

279 x 34 x 41 x 93 x 42 x

2,606 x 544 x 555 x 419 x 178 x

Parallel Decoding FST Geolocation SpMV SpMSpV
Breadth-First Search

6.8 x 15 x 29 x 111 x 7.5 x

109 x 243 x 467 x 1,387 x 117 x

3.1 x

46 x

*Results have been estimated or simulated. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.

Copyright � 2021, Intel Corporation

9

Raw Performance Ease of Use

PIUMA Software Stack
 Performance, Performance, Performance!  Currently support for C++, pthreads, OpenMP  Under development: graph libraries (GraphBLAS),
other programming languages (OpenCL), Python frontend  Customer Choice and Options

Applications

APIs

Frameworks (Caffe, Tensorflow, Neon nGraph, etc.)

LLVM

Intel MKL, GraphBLAS, etc.

Hardware Low-level APIs

Native ISA

Intel� PIUMA

Copyright � 2021, Intel Corporation

10

How to Engage?

Simulation
(up to a few tiles)

Intel� PIUMA binary

Functional simulation

Intel� PIUMA configuration

Timing simulation

performance profile data

 Learn how your workload kernels would perform on Intel� PIUMA

 Establish software readiness with your choice of productivity suite

 Be ready to take advantage of hardware!

Credit: Eyerman et.al. FODSEM 2020

Copyright � 2021, Intel Corporation

11

Summary

 Real-time Insights from large scale Data via Graph Analytics require new hardware thinking
 Intel� PIUMA is a programmable instruction set processor optimized for sparse graph applications
 Main features: high-bandwidth system-wide shared memory, small granularity memory accesses, massive multithreading, configurable caching, offload engines
 Programming Intel� PIUMA via C/C++, MPI, OpenMP plus other productivity suite
 Collaborate on workloads and datasets; Co-design with customer input. Join us.

Thank you!

Copyright � 2021, Intel Corporation

12