Part 2: Computing at the Edge | Flexible Options for Inference Implementation at the Edge | NXP Semiconductors
FLEXIBLE OPTIONS FOR INFERENCE IMPLEMENTATION AT THE EDGE MARKUS LEVY HEAD OF AI AND ML TECHNOLOGIES NXP SEMICONDUCTORS PUBLIC 1 PUBLIC Machine Learning Concepts 2 PUBLIC 2 What is Machine Learning (ML) · ML is a field of computer science (starting in 1960s) that gives computers the ability to learn without being explicitly programmed. · It is not a single algorithm! It is not only Neural Nets. · Biggest field of ML is supervised learning (learning with a teacher). · In supervised learning an algorithm is provided with a set of examples inputs and desired outputs. · During training, an algorithm tries to minimize an error (on the output) by adjusting internal parameters. 3 PUBLIC First Stage Considerations for ML at the Edge · IoT, Industrial, Automotive Application - Can I utilize machine learning? · Training Time and amount and type of data required for training · Availability of labeled data (e.g. supervised versus unsupervised) · Tolerated accuracy · Number of features · Computational resources available (e.g. RAM & CPU) · Latency required/tolerated (cost versus performance) · Ease of Interpretation · How will I deploy 4 PUBLIC Inference Time (log scale) Gen. Purpose MCU (e.g. Cortex® -Mx) Edge Compute Enabler Scalable Inference 5-10x improvement Balancing cost vs. end-user experience High Compute MCU (e.g. Cortex® -M7) 6-8x improvement Multi-core Applications Processor (GHz +) (e.g. Cortex® -Ax) 5x improvement GPU (Open CL) / DSP complexes > 10x improvement ML Accelerators (incl. Neural Nets) 5 PUBLIC Improving Performance, Increasing Systems Cost PUBLIC 5 Processing unit comparison (Resnet-50) Size Frequency Inference/s 1x M7 4x A53 4x A55 Mid-range GPU Gen 1 ML IP Google TPU 1 (normalized) 5.9 8.3 8.3 3.3 550 600 MHz 1.8 GHz 1.8 GHz 800 MHz 1 GHz 750 MHz 1 (normalized) 5.4 33 11 350 ~15000 Cost efficiency 1 (normalized) 0.95 4.0 1.3 106 27 6 PUBLIC Rule-of-Thumb ML Considerations · Convolutional neural networks - object recognition, image and computer vision · Recurrent neural networks - speech, handwriting recognition and time series · Don't consider training a deep neural net unless you have LOTS of training data. · Classical ML model types can be trained with smaller data sets. 7 PUBLIC What can machine learning do Regression (Calculation) · Predict continuous values Classification (Choice) · Recognition, object detection Anomaly detection (Judgement) · Detect abnormal conditions Clustering · Discover patterns / partitions Learn strategies · Reinforcement Learning 8 PUBLIC X=a, y=? It is a ( ) A: Dog B: Cat C: Cow D: Neither Heart is going to malfunction? Y/N Find crowds No need labels How to play the game? How To Speak ML and Sound Like an Expert: The Neural Net inferred a label of `Dancing Banana Man' With a confidence factor of 85% 9 PUBLIC Input 85% A dancing banana man 10% Eyeballs on a peach slice 2% A moon rising over an island 1% A taco with cauliflower 1% A banana Neural Nets Infer/Predict a Label with a Confidence Factor They Do Not Inherently `Decide' What Something Is What is Classical ML? · Every ML algorithm except neural nets: SVM with linear and RBF kernels, Decision trees, Random forest, K-Nearest neighbors, Boosting algorithm (ada-boost), Logistic regression, k-means · Usually much smaller number of parameters and don't need big training datasets · Usually faster (both training and inference) compared to NNs · Might be used in combination with NNs · Most of the algorithms require careful feature selection SVM Decision Tree https://docs.opencv.org/2.4/_images/optimal-hyperplane.png 10 PUBLIC https://en.wikipedia.org/wiki/Decision_tree_learning https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d Supervised vs Unsupervised · Supervised Given Xi and Yi compute f where Yi= f(Xi) · Unsupervised Given only Xi , find the patterns + input labels x2 · Known State 1 · Known State 2 x1 Raw Data 11 PUBLIC x1 Unsupervised you can cluster, but not identify cluster label x1 Supervised you can fully classify 12 PUBLIC https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/ Traditional Machine Learning Workflow 80% Training Data Sensor Data, (a table) PreProcess Data Feature Extraction 10% Validation Data data must be labelled for supervised learning Training data is pseudo-randomly chosen Validation used to tune hyperparameters Test to evaluate and predict generalization for new data Train Model model hypothesis set Validate Model Training can be considered an optimization problem which starts with labelled input data and an expression to be minimized acceptable error? adjust hyperparameters 13 PUBLIC 10% Test Data Test Final Model model score Trained Model Optimizations for Mapping to HW Capabilities Quantize parameters - 32-bit floating point to 8-bit fixed-point -> 4x memory reduction - Weights can be shared to improve compression Operation fusing - Fuse or chain operations to avoid roundtrips to accelerators - Next gen NN supports operations for: convolution, activation, pooling and normalization Pruning (sparsity) - Remove weights and neurons w/small impact on accuracy, reducing memory size 4-10x - Requires additional training Next gen IP supports decompression scheme to further reduce weights memory footprint 14 PUBLIC Move from the Cloud to the Edge 15 PUBLIC 15 Cloud Access With Amazon & Google ML Services 1. AWS SageMaker Build, train & deploy ML models 2. AWS HyperParameter Optimization Optional to achieve better accuracy 3. AWS Greengrass ML IoT service Train on Cloud, Deploy on Edge 1. Google ML Engine - Training & predictions services - TensorFlow support 2. Google AutoML - Designed for beginners/users which want to obtain fast a model - Based on dataset is able to build a NN, train it and obtain a model - 2 flavors Based model (for free) Advanced model (550$) 16 PUBLIC Copyright © 2018 Google, all rights reserved Google Cloud Interoperability Cloud cookbook details interoperability between Cloud and ML SDK w/OCV - Train using Google Cloud - Deployed on i.MX 8 using OpenCV DNN Instructions to teach user how to - train a neural network (written in TensorFlow) on Google Cloud - use the ML service - store the model on Google Cloud storage - download it locally - use the Cloud model to perform inference locally 17 PUBLIC Edge Device Machine Learning Deployment Overview 18 PUBLIC 18 NXP eIQ Machine Learning Software Development Environment ML Platform Direct Interface NXP Turnkey ML Solutions Facial Recognition Speech Recognition Anomaly Detection Vision & Sensors Applications Soft ISP Sensors StereoVision Audio Front End Inference Engine OpenCV/Neon Open Source Inference OpenCV/GPU TF Lite Arm NN CMSIS-NN Tencent NCNN Android NN TensorFlow NN Compiler Technology GLOW MLCPMlSaItSf-oNNrm: Arm® Compute Library Hardware Abstraction Layer OpenCL Open VX Custom API Cortex®-M Cortex®-A GPU DSP ML accelerators Kinetis and LPC MCUs i.MX RT Crossover Processors i.MX and Layerscape Apps Processors COMPANY CONFIDENTIAL 19 Machine Learning Deployment (The Easy Way) 20 PUBLIC 20 Open Source Computer Vision Library: http://opencv.org · Open-source BSD-licensed library · Includes several hundreds of computer vision algorithms - Image processing, image encoding/decoding - Video encoding/decoding - Video analysis - Object detection - Deep neural networks - Machine learning · Supports ARM NEON and OpenCL for acceleration 21 PUBLIC OpenCV introduction · Can be used in combination with deep neural networks - Example: facial recognition Face detection using OpenCV object detection 22 PUBLIC Feature extraction using deep neural network Face classification using OpenCV machine learning New AppNote 23 PUBLIC ML SDK with OpenCV 1.0 · OpenCV DNN Module - Inputs Caffe/TensorFlow formats - Provides NN inference engine - Optimized for Neon · OpenCV ML Module - Classical ML algorithms - Optimized for Neon Yocto Recipe Build Per BSP Linux Bindings: Python, Java Demos, Apps OpenCV (e.g. image processing, machine learning) OpenCV HAL (e.g. Neon) i.MX 6, 7, 8 24 PUBLIC DCoacffuemaenndtaTteiSonnosopCrroFvloidwesmsocdriepltss & to detailed description run inference using to modify OpenCV Why OpenCV for CML SVM (linear) SVM (RBF) Decision Trees Gradient Boosting EM (GMM) Logistic Regression AdaBoost (ml::Boost) Random Forests KNN k-means NEON support 25 PUBLIC OpenCV Dlib x x x x x - x - x - x - x - x x x x x x x x mlpack x x x x x x x - shark x x x shogun H2O Libsvm liblinear svm^perf ThunderSVM x x x x x x x - x x x x x x x x x x x x x x x x x - - - - x Training and Inference Performance on M7 (e.g. i.MX RT) Notes: 1. For training, OCV almost 2 orders of magnitude slower than libsvm due to some problem with class separability; could be solved by using RBF kernel, but we haven't done measurements with that (refer to benchmarking presentation). 2. OCV is faster on testing in all cases, and even 2 orders of magnitude faster on smartphone data 26 PUBLIC Training Can Be Done in a Few Function Calls #include <opencv2/core/core.hpp> #include <opencv2/ml/ml.hpp> ... using namespace cv; using namespace cv::ml; ... Mat samples = Mat_<float>(150, 4, samplesData); Mat labels = Mat_<int>(150, 1, labelsData); Mat_<int> responses; /* Prepare training data and labels */ Ptr<TrainData> trainData = TrainData::create(samples, ROW_SAMPLE, labels); /* Create a model */ Ptr<NormalBayesClassifier> trainedModel = NormalBayesClassifier::create(); /* Train the model */ trainedModel->train(trainData); /* Predict values */ trainedModel->predict(samples, responses); cout << "Classes predicted from trained model: " << responses.t(); cout << " / Accuracy: " << (countNonZero(responses == labels) / (float)labels.rows) * 100.0 << "%" << endl; 27 PUBLIC Super Boring Example Output It's just a "Hello world". It demonstrates that it works model training, loading, prediction. 28 PUBLIC Anomaly Detection as a Subset of Machine Learning 29 PUBLIC 29 It's All About the Data · Multi-class supervised learning requires representative data for all classes · In machine condition monitoring applications, this can be impractical to get - Hard to run machinery to failure, certainly not a statistically significant number of times · Enter "Anomaly Detection", essentially a one-class learning problem - Only needs "nominal" data for training!!! · The Goal: - Given a sample point X, compute the likelihood that X is a member of population all_X's. - Compare that to a specified threshold to determine if you have a nominal sample or not 30 PUBLIC Bearing Faults Have Specific Frequency Signatures Pd Bd For ball defects: BSF = ½ (Pd/Bd) x S x [1 (Bd/Pd x cos )2] For outer trace defects: BPFO = ½ Nb x S x [1 (Bd/Pd x cos )] For inner trace defects: BPFI = ½ Nb x S x [1 + (Bd/Pd x cos )] 31 PUBLIC Defect signals may be swamped by other noise in the system, in which case Pd = pitch diameter additional filtering may be needed to Bd = ball diameter extract the signature. Nb = number of balls S = speed (revolutions/sec) = contact angle BSF = Ball Spin Frequency BPFO = Ball Pass Frequency of Outer Trace BPFI = Ball Pass Frequency of Inner Trace One Class Support Vector Machines · Used for anomaly detection · The algorithm tells us if a sample is part of a known population or not · Computing a probability by comparing with a threshold value - Each contour line corresponds to a different threshold 32 PUBLIC We are using a Gaussian Kernel = - - = where: is a d-dimensional feature vector =1 are support vectors (SVs) =1 are coefficients for SVs is known as the "kernel size" a sample is considered True (1) if pdf () > threshold or False (-1) otherwise - Product Life Cycle Intelligence Monitoring/Tracking Use Cases Factory Key/ certificates provisioning, datalogging, movement monitoring, manufacturing and self tests (BLE/Wi-Fi/NFC) Transport to store or warehouse Review transport data, display/demo, or storage Ship to consumer Continuous sensor logging (Battery powered) Installer reviews transport data, runs self-tests, securely onboards device w/cloud, initiates run-time data collection Secure run-time data upload. OEM uses data to further tune ML models, add capabilities. Secure periodic ML model &/or general SW capability updates 33 PUBLIC Continuous secure monitoring & cloud upload Data initiated preventative maintenance request Preventive maintenance, selftest w/ data history sent to cloud Longer product life due to preventive maintenance and periodic capability upgrades. E.g. "Clean Clothes as a Service". End-of-life decommissioning and credentials recovery Other Open Source Options 34 PUBLIC 34 Deployment of Arm NN 1. Connect to Arm NN through high level frameworks ·Using framework parsers provided by Arm NN 2.Connect to existing inference engine ·With inference engine calling Arm NN API ·Or inference engine calling ACL directly 3.Connect to ACL directly 35 PUBLIC CMSIS-NN Efficient NN Kernels for Cortex-M CPUs · CNN library for Cortex-M by ARM, new in CMSIS 5.3 · Fixed-point inference · High level API · Low level API · Make use of some CMSIS math & DSP lib APIs. · 4.6x performance & 4.9x energy efficiency than baseline CMSIS-DSP 36 PUBLIC CIFAR-10 model · CIFAR-10 classification classify images into 10 different object classes · 3 convolution layer, 3 pooling layer and 1 fully connected layer (~80% accuracy) · https://www.cs.toronto.edu/~kriz/cifar.html 37 PUBLIC dataset input for training DL frameworks (Google TensorFlow&Keras, Caffe/Caffe2, Facebook PyTorch, Amazon MxNet, ONNX, etc) trainer, quantizer, converter PC tools to generate model MCU firmware: inference engine Data input: video,audio/time-series, structured data Data feeder and preprocessor Model runner (load, parse, and inference) Map model operations to CMSIS-NN APIs CMSIS-NN adapter CMSIS-NN Silicon 38 PUBLIC Cortex-M parts Inference engine based on CMSIS-NN Loadable NN models: Can be stored in SD card, or be flattened to C arrays as source file. NN SDK responsibilities PC side : PC tool set Firmware side: inference engine Benchmark Results using CMSIS-NN with CIFAR-10 Model · Cortex-M4F(LPC54114) 212mS · Cortex-M33(LPC55s69) 179mS IDE · IAR 8. 30.1 - High / Speed / No size constrains Video 39 PUBLIC Advanced Techniques 40 PUBLIC 40 GLOW Graph Lowering Compiler · Facebook open-sourced Glow in March 2018 · Machine learning compiler to accelerate the performance of deep learning frameworks · More rapidly design and optimize new silicon products for AI and ML by leveraging community-driven compiler software. · Glow accepts computation graphs from a variety of machine learning frameworks and works with a range of accelerators. 41 PUBLIC GLOW From Graph to Machine Code 42 PUBLIC IoT Solutions Putting it All Together 43 PUBLIC 43 ML-Related Functions That Can Be Done on i.MX 8M Mini · OpenCV support accelerated on NEON · TensorFlow and Caffe · Classical machine learning algorithms · Other open source options -Arm NN w/NEON acceleration using Arm Compute Library (ACL) -Android NN -TensorFlow, TF Lite (direct deployment) -EdgeScale deployment thru docker images -ACL for image segmentation, feature detection/ extraction, image processing, etc. · Sensor integration (e.g. anomaly detection) -M4 manages sensor reading/fusion, feature extraction -Then use RPmsg to send data to the A53 for inferencing i.MX 8M Mini Main CPU Platform QQuaudaQ/dDu/DuadaulaCClooorrtretteex--xAA-A553533 323K23BK23BIK2-NcBIK-aENBcIcO-aENhcIc-NOaeNEchcNaOEehcONehNe 323K23BK23BDK2-BDKcB-aDccD-ahcc-aechcaehcehe FPFUPFUPFUPU 512KB L2 Cache Low Power, Security CPU Cortex-M4 16KB I-cache 16KB D-cache 256KB TCM (SRAM) Graphics (OpenGL ES Multimedia 3D Graphics: GC NanoUltra Can be used for color space conversion 2D Graphics: GC328 1080p60 VP8, VP9, H.264, H.265 decoder 1080p60 VP8, H.264 encoder Connectivity & I/O Security System Control External Memory 44 PUBLIC Horizontal Machine Learning Technologies at the IoT Edge Vision Face and Object Recognition Voice Control Local and Cloud Commands, Near and Far Field Support Anomaly Detection Monitoring/Tracking: Vibration, Acoustic, and Pressure 45 PUBLIC IoT Edge Compute Enabling Technologies Vision Face and Object Recognition Voice Control Local and Cloud Commands, Near and Far Field Support Anomaly Detection Monitoring/Tracking: Vibration, Acoustic, and Pressure Secure IoT Capabilities Manufacturing Boot Provisioning Onboarding OTA Decommissioning Connectivity Processor and OS Platform 46 PUBLIC Combining Horizontal Capabilities to Build Vertical Solutions Vision Face and Object Recognition Voice Control Local and Cloud Commands, Near and Far Field Support Anomaly Detection Monitoring/Tracking: Vibration, Acoustic, and Pressure Secure IoT Capabilities Manufacturing Boot Provisioning Onboarding OTA Decommissioning Connectivity Processor and OS Platform 47 PUBLIC Smart Appliance Smart Home Smart Retail Smart Industry Example Customer Engagements Today Voice Control Anomaly Detection Secure Facial Recognition 48 PUBLIC Empowering Future Products Smart appliance / Smart Panel · Embedded smart display Smart lock · Face unlocks · Local voice commands Smart appliance · Anomaly detection for predictive maintenance · Voice commands voice assistant, video calling · Secure facial recognition · Anomaly detection for predictive maintenance (after face rec.) "Wash cold, heavily soiled" Secure Facial Recognition Face Recognition Anomaly Detection Anomaly Detection Voice Control Voice Control Voice Control 49 PUBLIC Production Grade, Certified IoT Edge Machine Learning Solutions · Implemented with best in class silicon, software and IP from NXP and 3rd parties OOB HW/SW Software Source · Near production ready hardware - Cost and form factor optimized · Pre-integrated production ready software, fully tested & certified · NXP provides a single point of contact for support, licensing and procurement Schematics Layouts Certifications · Use case dependent solutions: - Turnkey for well defined use cases - Customizable can be modified, tuned and trained for specific use cases 50 PUBLIC BOMs Documentation U1 U3 U4 U5 U6 U7 U8,U9 U10 MIMXRT1052DVL6B i.MXRT1050 Cross Processor W9812G6JB-6I 128Mbit SDRAM 3V 166MHz IS26KL256S-DABLI00256Mb Hyperflash 3V 100MHz Flash A7101CHTK2 Secure IoT/Authentication Conn IC LBEE5KL1DX-883 IC WIFI BT/BLE B/G/N 3-4.8V LGA46 MKW21Z512VHT4 KINETIS L 32-BIT MCU CORTEX-M0 NX3L2267GM,115 IC ANALOG SWITCH SPDT 10XQFN XCL214B333DR DC/DC CONVERTER 3.3V 5W Trademark and copyright statement The trademarks featured in this presentation are registered and/or unregistered trademarks of NXP Semiconductors in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright © 2018 51 PUBLIC 51 Thank You! NXP and the NXP logo are trademarks of NXP B.V. All other product or service names are the property of their respective owners. © 2018 NXP B.V.Microsoft PowerPoint 2016