Flexible options for inference implementation at the edge

Markus Levy

Part 2: Computing at the Edge | Flexible Options for Inference Implementation at the Edge | NXP Semiconductors

Edge Computing 2018 Day 2
FLEXIBLE OPTIONS FOR INFERENCE IMPLEMENTATION AT THE EDGE
MARKUS LEVY HEAD OF AI AND ML TECHNOLOGIES
NXP SEMICONDUCTORS
PUBLIC

1 PUBLIC

Machine Learning Concepts
2 PUBLIC
2

What is Machine Learning (ML)
· ML is a field of computer science (starting in 1960s) that gives computers the ability to learn without being explicitly programmed.
· It is not a single algorithm! It is not only Neural Nets. · Biggest field of ML is supervised learning (learning with a teacher). · In supervised learning an algorithm is provided with a set of examples
­ inputs and desired outputs. · During training, an algorithm tries to minimize an error (on the output)
by adjusting internal parameters.
3 PUBLIC

First Stage Considerations for ML at the Edge
· IoT, Industrial, Automotive Application - Can I utilize machine learning? · Training Time and amount and type of data required for training · Availability of labeled data (e.g. supervised versus unsupervised) · Tolerated accuracy · Number of features · Computational resources available (e.g. RAM & CPU) · Latency required/tolerated (cost versus performance) · Ease of Interpretation · How will I deploy
4 PUBLIC

Inference Time (log scale)

Gen. Purpose MCU (e.g. Cortex® -Mx)

Edge Compute Enabler ­ Scalable Inference

5-10x improvement

Balancing cost vs. end-user experience

High Compute MCU (e.g. Cortex® -M7)

6-8x improvement

Multi-core Applications
Processor (GHz +)
(e.g. Cortex® -Ax)

5x improvement

GPU (Open CL) / DSP complexes

> 10x improvement

ML Accelerators (incl. Neural Nets)

5 PUBLIC

Improving Performance, Increasing Systems Cost

PUBLIC 5

Processing unit comparison (Resnet-50)

Size

Frequency Inference/s

1x M7
4x A53
4x A55
Mid-range GPU Gen 1 ML IP
Google TPU

1 (normalized) 5.9 8.3 8.3 3.3 550

600 MHz 1.8 GHz 1.8 GHz 800 MHz 1 GHz 750 MHz

1 (normalized) 5.4 33 11 350 ~15000

Cost efficiency
1 (normalized) 0.95 4.0 1.3 106 27

6 PUBLIC

Rule-of-Thumb ML Considerations
· Convolutional neural networks - object recognition, image and computer vision · Recurrent neural networks - speech, handwriting recognition and time series · Don't consider training a deep neural net unless you have LOTS of training data. · Classical ML model types can be trained with smaller data sets.
7 PUBLIC

What can machine learning do
Regression (Calculation)
· Predict continuous values
Classification (Choice)
· Recognition, object detection
Anomaly detection (Judgement)
· Detect abnormal conditions
Clustering
· Discover patterns / partitions
Learn strategies
· Reinforcement Learning
8 PUBLIC

X=a, y=?
It is a ( ) A: Dog B: Cat C: Cow D: Neither
Heart is going to malfunction? Y/N
Find crowds No need labels
How to play the game?

How To Speak ML and Sound Like an Expert:

The Neural Net inferred a label of `Dancing Banana Man'

With a confidence factor of 85%

9 PUBLIC

Input

85% A dancing banana man 10% Eyeballs on a peach slice 2% A moon rising over an island 1% A taco with cauliflower 1% A banana

Neural Nets Infer/Predict a Label with a Confidence Factor They Do Not Inherently `Decide' What Something Is

What is Classical ML?

· Every ML algorithm except neural nets: SVM with linear and RBF kernels, Decision
trees, Random forest, K-Nearest neighbors, Boosting algorithm (ada-boost), Logistic regression, k-means

· Usually much smaller number of parameters and don't need big training datasets

· Usually faster (both training and inference) compared to NNs

· Might be used in combination with NNs

· Most of the algorithms require careful feature selection

SVM

Decision Tree

https://docs.opencv.org/2.4/_images/optimal-hyperplane.png
10 PUBLIC

https://en.wikipedia.org/wiki/Decision_tree_learning

https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d

Supervised vs Unsupervised
· Supervised ­ Given Xi and Yi compute f where Yi= f(Xi) · Unsupervised ­ Given only Xi , find the patterns
+ input labels x2

· Known State 1 · Known State 2

x1 Raw Data
11 PUBLIC

x1
Unsupervised  you can cluster, but not identify cluster label

x1 Supervised  you can fully classify

12 PUBLIC

https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/

Traditional Machine Learning Workflow

80%

Training Data

Sensor Data, (a table)

PreProcess
Data

Feature Extraction

10% Validation Data

data must be labelled for supervised learning

Training data is pseudo-randomly chosen
Validation used to tune hyperparameters
Test to evaluate and predict generalization for new data

Train Model

model hypothesis set

Validate Model

Training can be considered an optimization problem which starts with labelled input data and an expression to be minimized

acceptable error?

adjust hyperparameters

13 PUBLIC

10%
Test Data

Test Final Model

model score

Trained Model Optimizations for Mapping to HW Capabilities
Quantize parameters - 32-bit floating point to 8-bit fixed-point -> 4x memory reduction - Weights can be shared to improve compression
Operation fusing - Fuse or chain operations to avoid roundtrips to accelerators - Next gen NN supports operations for: convolution, activation, pooling and normalization
Pruning (sparsity) - Remove weights and neurons w/small impact on accuracy, reducing memory size 4-10x - Requires additional training
Next gen IP supports decompression scheme to further reduce weights memory footprint
14 PUBLIC

Move from the Cloud to the Edge
15 PUBLIC
15

Cloud Access With Amazon & Google ML Services

1. AWS SageMaker
Build, train & deploy ML models
2. AWS HyperParameter Optimization
Optional ­ to achieve better accuracy 3. AWS Greengrass ML ­ IoT service Train on Cloud, Deploy on Edge

1. Google ML Engine
- Training & predictions services - TensorFlow support
2. Google AutoML
- Designed for beginners/users which want to obtain fast a model
- Based on dataset is able to build a NN, train it and obtain a model
- 2 flavors
 Based model (for free)
 Advanced model (550$)

16 PUBLIC
Copyright © 2018 Google, all rights reserved

Google Cloud Interoperability

Cloud cookbook details interoperability between Cloud and ML SDK w/OCV
- Train using Google Cloud - Deployed on i.MX 8 using OpenCV DNN
Instructions to teach user how to
- train a neural network (written in TensorFlow) on Google Cloud - use the ML service - store the model on Google Cloud storage - download it locally - use the Cloud model to perform inference locally

17 PUBLIC

Edge Device

Machine Learning Deployment Overview
18 PUBLIC
18

NXP eIQ Machine Learning Software Development Environment

ML Platform Direct
Interface

NXP Turnkey ML Solutions

Facial Recognition

Speech Recognition

Anomaly Detection

Vision & Sensors Applications

Soft ISP

Sensors

StereoVision

Audio Front End

Inference Engine

OpenCV/Neon

Open Source Inference

OpenCV/GPU TF Lite

Arm NN

CMSIS-NN

Tencent NCNN

Android NN

TensorFlow

NN Compiler Technology
GLOW

MLCPMlSaItSf-oNNrm:

Arm® Compute Library

Hardware Abstraction Layer OpenCL Open VX

Custom API

Cortex®-M

Cortex®-A

GPU

DSP

ML accelerators

Kinetis and LPC MCUs

i.MX RT Crossover Processors

i.MX and Layerscape Apps Processors

COMPANY CONFIDENTIAL 19

Machine Learning Deployment
(The Easy Way)
20 PUBLIC
20

Open Source Computer Vision Library: http://opencv.org
· Open-source BSD-licensed library
· Includes several hundreds of computer vision algorithms
- Image processing, image encoding/decoding - Video encoding/decoding - Video analysis - Object detection - Deep neural networks - Machine learning
· Supports ARM NEON and OpenCL for acceleration
21 PUBLIC

OpenCV introduction
· Can be used in combination with deep neural networks
- Example: facial recognition

Face detection using OpenCV object detection
22 PUBLIC

Feature extraction using deep neural
network

Face classification using OpenCV
machine learning

New AppNote
23 PUBLIC

ML SDK with OpenCV 1.0

· OpenCV DNN Module
- Inputs Caffe/TensorFlow formats - Provides NN inference engine - Optimized for Neon
· OpenCV ML Module
- Classical ML algorithms - Optimized for Neon

Yocto Recipe Build Per BSP
Linux

Bindings: Python, Java

Demos, Apps

OpenCV (e.g. image processing, machine learning)

OpenCV HAL (e.g. Neon)

i.MX 6, 7, 8

24 PUBLIC

DCoacffuemaenndtaTteiSonnosopCrroFvloidwesmsocdriepltss

& to

detailed description run inference using

to modify OpenCV

Why OpenCV for CML

SVM (linear) SVM (RBF) Decision Trees Gradient Boosting EM (GMM) Logistic Regression AdaBoost (ml::Boost) Random Forests KNN k-means NEON support
25 PUBLIC

OpenCV Dlib

x

x

x

x

x

-

x

-

x

-

x

-

x

-

x

x

x

x

x

x

x

x

mlpack x
x x x x x x -

shark x x x

shogun H2O Libsvm liblinear svm^perf ThunderSVM

x

x

x

x

x

x

x

-

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

-

-

-

-

x

Training and Inference Performance on M7 (e.g. i.MX RT)
Notes: 1. For training, OCV almost 2 orders of magnitude slower than libsvm due to some problem with class separability; could be solved by using RBF kernel, but we haven't done measurements with that (refer to benchmarking presentation). 2. OCV is faster on testing in all cases, and even 2 orders of magnitude faster on smartphone data 26 PUBLIC

Training Can Be Done in a Few Function Calls
#include <opencv2/core/core.hpp> #include <opencv2/ml/ml.hpp> ... using namespace cv; using namespace cv::ml; ...
Mat samples = Mat_<float>(150, 4, samplesData); Mat labels = Mat_<int>(150, 1, labelsData); Mat_<int> responses;
/* Prepare training data and labels */ Ptr<TrainData> trainData = TrainData::create(samples, ROW_SAMPLE, labels);
/* Create a model */ Ptr<NormalBayesClassifier> trainedModel = NormalBayesClassifier::create();
/* Train the model */ trainedModel->train(trainData);
/* Predict values */ trainedModel->predict(samples, responses); cout << "Classes predicted from trained model: " << responses.t(); cout << " / Accuracy: " << (countNonZero(responses == labels) / (float)labels.rows) * 100.0 << "%" << endl;
27 PUBLIC

Super Boring Example Output
It's just a "Hello world". It demonstrates that it works ­ model training, loading, prediction.
28 PUBLIC

Anomaly Detection as a Subset of Machine Learning
29 PUBLIC
29

It's All About the Data
· Multi-class supervised learning requires representative data for all classes
· In machine condition monitoring applications, this can be impractical to get
- Hard to run machinery to failure, certainly not a statistically significant number of times
· Enter "Anomaly Detection", essentially a one-class learning problem
- Only needs "nominal" data for training!!!
· The Goal:
- Given a sample point X, compute the likelihood that X is a member of population all_X's. - Compare that to a specified threshold to determine if you have a nominal sample or not
30 PUBLIC

Bearing Faults Have Specific Frequency Signatures

Pd Bd

For ball defects: BSF = ½ (Pd/Bd) x S x [1 ­ (Bd/Pd x cos )2]
For outer trace defects: BPFO = ½ Nb x S x [1 ­ (Bd/Pd x cos )]
For inner trace defects: BPFI = ½ Nb x S x [1 + (Bd/Pd x cos )]
31 PUBLIC

Defect signals may be swamped by

other noise in the system, in which case

Pd = pitch diameter

additional filtering may be needed to

Bd = ball diameter

extract the signature.

Nb = number of balls

S = speed (revolutions/sec)

 = contact angle

BSF = Ball Spin Frequency

BPFO = Ball Pass Frequency of Outer Trace

BPFI = Ball Pass Frequency of Inner Trace

One Class Support Vector Machines

· Used for anomaly detection · The algorithm tells us if a sample is part of a known population or not · Computing a probability by comparing with a threshold value
- Each contour line corresponds to a different threshold

32 PUBLIC

We are using a Gaussian Kernel

 


=  -

- 


=

where:

 is a d-dimensional feature vector



 =1

are

support

vectors

(SVs)



 =1

are coefficients for SVs

 is known as the "kernel size"

a sample is considered True (1) if pdf () > threshold or False (-1) otherwise
   - 

Product Life Cycle Intelligence ­ Monitoring/Tracking Use Cases

Factory ­ Key/ certificates provisioning, datalogging, movement monitoring, manufacturing and self tests (BLE/Wi-Fi/NFC)

Transport to store or warehouse

Review transport data, display/demo, or storage

Ship to consumer

Continuous sensor logging (Battery powered)

Installer reviews transport data, runs self-tests, securely onboards device w/cloud, initiates run-time data collection

Secure run-time data upload. OEM uses data to further tune ML models, add capabilities.

Secure periodic ML model &/or general SW capability updates
33 PUBLIC

Continuous secure monitoring & cloud upload

Data initiated preventative maintenance request

Preventive maintenance, selftest w/ data history sent to cloud

Longer product life due to preventive maintenance and periodic capability upgrades. E.g. "Clean Clothes as a Service".

End-of-life decommissioning and credentials recovery

Other Open Source Options
34 PUBLIC
34

Deployment of Arm NN
1. Connect to Arm NN through high level frameworks
·Using framework parsers provided by Arm NN
2.Connect to existing inference engine
·With inference engine calling Arm NN API ·Or inference engine calling ACL directly
3.Connect to ACL directly
35 PUBLIC

CMSIS-NN ­ Efficient NN Kernels for Cortex-M CPUs
· CNN library for Cortex-M by ARM, new in CMSIS 5.3
· Fixed-point inference · High level API · Low level API · Make use of some CMSIS
math & DSP lib APIs. · 4.6x performance & 4.9x
energy efficiency than baseline CMSIS-DSP
36 PUBLIC

CIFAR-10 model
· CIFAR-10 classification ­ classify images into 10 different object classes
· 3 convolution layer, 3 pooling layer and 1 fully connected layer (~80% accuracy)
· https://www.cs.toronto.edu/~kriz/cifar.html
37 PUBLIC

dataset input for training

DL frameworks (Google TensorFlow&Keras, Caffe/Caffe2, Facebook PyTorch, Amazon MxNet, ONNX, etc)

trainer, quantizer, converter

PC tools to generate model MCU firmware: inference engine

Data input: video,audio/time-series, structured data

Data feeder and preprocessor

Model runner (load, parse, and inference)

Map model operations to CMSIS-NN APIs

CMSIS-NN adapter

CMSIS-NN

Silicon
38 PUBLIC

Cortex-M parts

Inference engine based on CMSIS-NN
Loadable NN models: Can be stored in SD card, or be flattened to C arrays as source file.
NN SDK responsibilities
PC side : PC tool set Firmware side: inference engine

Benchmark Results using CMSIS-NN with CIFAR-10 Model
· Cortex-M4F(LPC54114) 212mS · Cortex-M33(LPC55s69) 179mS IDE · IAR 8. 30.1
- High / Speed / No size constrains
Video
39 PUBLIC

Advanced Techniques
40 PUBLIC
40

GLOW ­ Graph Lowering Compiler
· Facebook open-sourced Glow in March 2018 · Machine learning compiler to accelerate the performance of deep learning
frameworks · More rapidly design and optimize new silicon products for AI and ML by leveraging
community-driven compiler software. · Glow accepts computation graphs from a variety of machine learning frameworks
and works with a range of accelerators.
41 PUBLIC

GLOW ­ From Graph to Machine Code
42 PUBLIC

IoT Solutions
Putting it All Together
43 PUBLIC
43

ML-Related Functions That Can Be Done on i.MX 8M Mini

· OpenCV support accelerated on NEON · TensorFlow and Caffe · Classical machine learning algorithms
· Other open source options -Arm NN w/NEON acceleration using Arm Compute Library (ACL) -Android NN -TensorFlow, TF Lite (direct deployment) -EdgeScale deployment thru docker images -ACL for image segmentation, feature detection/ extraction, image processing, etc.
· Sensor integration (e.g. anomaly detection) -M4 manages sensor reading/fusion, feature extraction -Then use RPmsg to send data to the A53 for inferencing

i.MX 8M Mini
Main CPU Platform

QQuaudaQ/dDu/DuadaulaCClooorrtretteex--xAA-A553533

323K23BK23BIK2-NcBIK-aENBcIcO-aENhcIc-NOaeNEchcNaOEehcONehNe

323K23BK23BDK2-BDKcB-aDccD-ahcc-aechcaehcehe FPFUPFUPFUPU

512KB L2 Cache Low Power, Security CPU

Cortex-M4

16KB I-cache

16KB D-cache

256KB TCM (SRAM)

Graphics (OpenGL ES

Multimedia 3D Graphics: GC NanoUltra

Can be used for color space conversion

2D Graphics: GC328 1080p60 VP8, VP9, H.264, H.265 decoder

1080p60 VP8, H.264 encoder

Connectivity & I/O Security
System Control External Memory

44 PUBLIC

Horizontal Machine Learning Technologies at the IoT Edge

Vision

Face and Object Recognition

Voice Control

Local and Cloud Commands, Near and Far Field Support

Anomaly Detection Monitoring/Tracking: Vibration, Acoustic, and Pressure

45 PUBLIC

IoT Edge Compute Enabling Technologies

Vision

Face and Object Recognition

Voice Control

Local and Cloud Commands, Near and Far Field Support

Anomaly Detection Monitoring/Tracking: Vibration, Acoustic, and Pressure

Secure IoT Capabilities

Manufacturing Boot

Provisioning Onboarding

OTA Decommissioning

Connectivity
Processor and OS Platform

46 PUBLIC

Combining Horizontal Capabilities to Build Vertical Solutions

Vision

Face and Object Recognition

Voice Control

Local and Cloud Commands, Near and Far Field Support

Anomaly Detection Monitoring/Tracking: Vibration, Acoustic, and Pressure

Secure IoT Capabilities

Manufacturing Boot

Provisioning Onboarding

OTA Decommissioning

Connectivity
Processor and OS Platform

47 PUBLIC

Smart Appliance

Smart Home

Smart Retail

Smart Industry

Example Customer Engagements Today

Voice Control

Anomaly Detection

Secure Facial Recognition

48 PUBLIC

Empowering Future Products

Smart appliance / Smart Panel · Embedded smart display

Smart lock · Face unlocks · Local voice
commands

Smart appliance · Anomaly detection for
predictive maintenance · Voice commands

voice assistant, video calling · Secure facial recognition · Anomaly detection for
predictive maintenance

(after face rec.)

"Wash cold, heavily soiled"

Secure Facial Recognition

Face Recognition

Anomaly Detection

Anomaly Detection

Voice Control

Voice Control

Voice Control

49 PUBLIC

Production Grade, Certified IoT Edge Machine Learning Solutions

· Implemented with best in class silicon, software and IP from NXP and 3rd parties

OOB HW/SW

Software Source

· Near production ready hardware

- Cost and form factor optimized

· Pre-integrated production ready software, fully tested & certified
· NXP provides a single point of contact for support, licensing and procurement

Schematics Layouts Certifications

· Use case dependent solutions:

- Turnkey ­ for well defined use cases - Customizable ­ can be modified, tuned and
trained for specific use cases
50 PUBLIC

BOMs

Documentation

U1 U3 U4 U5 U6 U7 U8,U9 U10

MIMXRT1052DVL6B i.MXRT1050 Cross Processor

W9812G6JB-6I 128Mbit SDRAM 3V 166MHz

IS26KL256S-DABLI00256Mb Hyperflash 3V 100MHz Flash

A7101CHTK2

Secure IoT/Authentication Conn IC

LBEE5KL1DX-883 IC WIFI BT/BLE B/G/N 3-4.8V LGA46

MKW21Z512VHT4 KINETIS L 32-BIT MCU CORTEX-M0

NX3L2267GM,115 IC ANALOG SWITCH SPDT 10XQFN

XCL214B333DR DC/DC CONVERTER 3.3V 5W

Trademark and copyright statement The trademarks featured in this presentation are registered and/or unregistered trademarks of NXP Semiconductors in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
Copyright © 2018
51 PUBLIC
51

Thank You!

NXP and the NXP logo are trademarks of NXP B.V. All other product or service names are the property of their respective owners. © 2018 NXP B.V.


Microsoft PowerPoint 2016