Intel Unified Memory Architecture (Intel UMA) for Graph Analytics A Step towards Future AI and Scalable HPC DPG/DPEA: P…
File info: application/pdf · 12 pages · 678.40KB
Intel Unified Memory Architecture (Intel UMA) for Graph Analytics A Step towards Future AI and Scalable HPC DPG/DPEA: Pathfinding and Architecture Innovation Project
CTPClassification=CTP NT
Intel Unified Memory Architecture ...
Extracted Text
Intel� Programmable Integrated Unified Memory Architecture (PIUMA)
Hardware for Faster and Deeper Insights from Large Scale Graph
Nikhil Deshpande Ph.D. Nikhil.m.Deshpande@intel.com
Product Director, AI and HPC Innovations Intel Corporation
Copyright � 2021, Intel Corporation
1
Today
Environment Graph Usages and Scalability Issues Graph and Traditional Compute Rethinking Hardware - Intel� PIUMA Programming Intel� PIUMA Summary
Copyright � 2021, Intel Corporation
2
Environment
Data Deluge
What are we after: Data, Knowledge or Insights?
Is Data still the oil? Refining perhaps?
Graphs: Best Representation for Capturing Relationships/Insights
Faster and Deeper
Source: Data Age 2025, Sponsored by Seagate with Data from IDC Global Datasphere, Nov. 2018
Copyright � 2021, Intel Corporation
3
Usages and Scalability Wall
Knowledge Graph
Intrusion detection
Fraud Detection
Risk Mitigation
Genomics
Precision Medicine
Healthcare
Telecom Finance
Anti Money Laundering
Drug Discovery
Deep Analytics
Platforms
Social networks
Retail
Identity Graph
Industrial
Recommendation
Energy
Customer 360
Predictive Monitoring
Supply Chain Optimization
Real Time Analytics
Performance Scales
Performance does not scale
Communications Intensity
Traditional IaaS
Scalability Wall
Search, Hadoop Data Size
Large Scale Graphs
Copyright � 2021, Intel Corporation
4
New Workload Behavior Needs New Thinking
Branch Prediction
Regular
Branches have regular pattern
Graph
Branch outcome is data dependent
("Pointer hopping")
Locality
Same or neighboring data
likely used
Data is randomly scattered
Data Access
Same operation on neighboring data
Operations on scattered data
Graph Traversal is all about Pointer Chasing ... with lowest latency!
Today: Caching, Large message sizes, FLOPs...
Needs: Granular access, Small message sizes, +Traversed Edges Per Second!
Copyright � 2021, Intel Corporation
5
Intel� PIUMA Technology
Balance I/O, Memory, Compute and Prioritize in that order
� 1 PIUMA core = multiple processor pipelines with multiple threads
� Novel 64-bit RISC instruction set, graph optimized instructions � Global address space (GAS), accessible for all other cores � Memory controller with 8B granularity � Network interface with 8B packets � Tiles Nodes Systems Millions of threads...
TOPS: Tera-Operations Per Second TTEPS: Tera-Traversed Edges Per Second
Copyright � 2021, Intel Corporation
6
Intel's Approach: Intel� PIUMA
A Programmable Integrated Unified Memory Architecture
Re-imagined Architecture
Fully Integrated
Architected to scale
CPU Support for Small, Irregular Memory Accesses
Near-Memory Atomics
Global Memory Model
Packaging for High IO & Memory bandwidth
Network as 1st-class Citizen Flatten Latency Hierarchy Point-to-Point Messages
Copyright � 2021, Intel Corporation
7
Intel� PIUMA � Scalable System
Sub-nodes
1
2
N
P1
P2
1
2
N
N
Host (x86)
1
ENet
PIUMA Node 1
PIUMA Node 2
PIUMA Node P-1
PIUMA Node P
PIUMA Global Address Space Network
Host
PIUMA Node 1
PIUMA Node P-1
2
Disk Pool
(x86) Q
PIUMA
PIUMA
1
Px
Node 2
Node P
� Distributed global address space across the full system
� Hierarchical topology with all-to-all connections low diameter
� High radix fabric between tiles and between nodes for high bandwidth and low latency
Copyright � 2021, Intel Corporation
Credit: Fryman et.al. HPEC 2020
8
Speedup: Intel� PIUMA versus 1 Intel� Xeon� node
Application
Application Classification Random Walks Graph Search Louvain Community TIES Sampler
Graph2Vec GraphSAGE
Intel� PIUMA 1 node
6.9 x
Intel� PIUMA 16 nodes
111 x
Application Graph Wave
Intel� PIUMA 1 node
8.0 x
Intel� PIUMA 16 nodes
125 x
279 x 34 x 41 x 93 x 42 x
2,606 x 544 x 555 x 419 x 178 x
Parallel Decoding FST Geolocation SpMV SpMSpV
Breadth-First Search
6.8 x 15 x 29 x 111 x 7.5 x
109 x 243 x 467 x 1,387 x 117 x
3.1 x
46 x
*Results have been estimated or simulated. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Copyright � 2021, Intel Corporation
9
Raw Performance Ease of Use
PIUMA Software Stack
Performance, Performance, Performance! Currently support for C++, pthreads, OpenMP Under development: graph libraries (GraphBLAS),
other programming languages (OpenCL), Python frontend Customer Choice and Options
Applications
APIs
Frameworks (Caffe, Tensorflow, Neon nGraph, etc.)
LLVM
Intel MKL, GraphBLAS, etc.
Hardware Low-level APIs
Native ISA
Intel� PIUMA
Copyright � 2021, Intel Corporation
10
How to Engage?
Simulation
(up to a few tiles)
Intel� PIUMA binary
Functional simulation
Intel� PIUMA configuration
Timing simulation
performance profile data
Learn how your workload kernels would perform on Intel� PIUMA
Establish software readiness with your choice of productivity suite
Be ready to take advantage of hardware!
Credit: Eyerman et.al. FODSEM 2020
Copyright � 2021, Intel Corporation
11
Summary
Real-time Insights from large scale Data via Graph Analytics require new hardware thinking
Intel� PIUMA is a programmable instruction set processor optimized for sparse graph applications
Main features: high-bandwidth system-wide shared memory, small granularity memory accesses, massive multithreading, configurable caching, offload engines
Programming Intel� PIUMA via C/C++, MPI, OpenMP plus other productivity suite
Collaborate on workloads and datasets; Co-design with customer input. Join us.
Thank you!
Copyright � 2021, Intel Corporation
12