# An Overview of the QorlQ Qonverge B4860 Base Station-on-a-Chip EUF-NET-T0645 Roy Shor B4 Design Manager Presents, the Presental logs, ARANIC, C.S., Cost/EST, Cost/Martin, Cost/Ins, # **Agenda** - The baseband market trends and requirements - QorlQ Qonverge B4860 block diagram overview - e6500 and SC3900 cores - Memory system and interconnect - CPRI - MAPLE B3 - Data Path Architecture (DPAA) - Power Management - Q&A # Need for Capacity, Speed and QoS ### **Higher Capacities and Data Rates** #### **Smartphone Density** • 32x increase per km² by 2015 #### **Internet over Mobile** 70% of mobile data by 2014 (Source: Bell Labs, Apr-11) ### **Wireless Standards & Topologies Evolution** - Multi-standard support required (LTE, LTE-Advanced, 3G-HSPA, TD-SCDMA, GSM) - Move to heterogeneous networks; Macro, RRH, Metrocell, Picocell & Cloud RAN architectures #### **Rapidly Evolving Macro Infrastructure** - Drive for increased processing & performance - Need to leverage spectrum availability - Desire for efficient energy consumption - Continual strive for increased efficiency and lower costs Global LTE Macro Base Station Deployments to Reach 1.5 Million by 2015 (In-Stat, Sep-11) Global Capital Expenditures by Wireless Carriers for 4G LTE Infrastructure Gear will Reach \$36.1 Billion by 2015 (iSuppli Research, Jan-12) ## **Macro Base Station Challenges** ### Connectivity - · Coverage: Urban, highways and rural - Spectral efficiency: Radio and network performance - Multi-standard: Supports variety of users - · Reliability: Zero down time ### **Capacity** - · Users: Hundreds of active users - Throughputs: Over 1Gbps data rate - Scalable/Modular: Sectors, antennas, users... - · Active Antenna, MIMO: Improved QoS #### Cost - **Space:** Miniaturization and consolidation of equipment - Low Impact: Power & Cost - Future Proof: Easy upgrades, SDR - Complete solutions: Ease of development, faster time to market # QorlQ Qonverge B4860 - Block Diagram & Benefits - Next generation, e6500 Dual-Thread Power Architecture® cores offer highest CoreMark/Watt with AltiVec technology for dramatic L2 scheduling acceleration - Next generation, SC3900 StarCore™ provides 2x DS performance compared to competitive offerings - 20GHz of Programmable Performance - Smart hardware acceleration for Layer 1, 2, Control and Transport allows for best in class performance, power and cost 12x RISCs 2x aTVPE 8x eFTPE 2x EQPE2 - Large scale SoC integration allows for simpler programming models and easier load balancing - Integrated, Rich I/O including backhaul & antenna interfaces provides flexibility, interoperability and reduces overall system cost Perse, Classify Distribute 16xDMA chi 8 PClass eRioss Watchpoint Cross Trigger Perf. Monitor Corenet Trace Ашого хВ Security Monitor (FO 2» DUART 4x120 SD/MMC eSPI **GPIOs** Secum Engine 53 2xULF2 1x TCPE 1x CRPE 3x CROPE 2x DEPE 2x PUPE2 2x PDPE2 2x ULB2 # **Benefit of Intelligent Integration** 3 sector, 20 MHz LTE with 5 major components # 3 sector, 20 MHz LTE on a single SoC **COST POWER** B4860 SoC 4X Cost Reduction 3X Power Reduction # **Industry's Highest Capacity** ### **High Density Baseband Solution** #### LTE/LTE-A SOLUTION Base station on a chip 20 MHz, 3-sector, 24x24 Ant. 1.4 Gbps Aggregated Throughput #### **WCDMA SOLUTION** Base station on a chip 5 MHz, up to 6 cells, 12x12 ant. 318 Mbps Aggregated Throughput Supports Large cells Supports multiple Radio Access Technologies Supports LTE, LTE-A, WCDMA, TD-SCDMA, GSM > Multi-mode Support #### LTE-Advanced SOLUTION 60 MHz sector on a chip 16 Ant. 1.8Gbps Aggregated Throughput #### **TD-SCDMA SOLUTION** Base station on a chip 32 carriers with a single device - Industry leading performance solution for base stations - Based on advanced Power Architecture, StarCore, CoreNet and MAPLE technologies - First SoC in 28nm technology node for Wireless Infrastructure - Supports the most advanced mobile wireless standards ## 'All New' e6500 Core & Clusters #### **High Performance** - 64-bit Power Architecture core - Dual threads provide 1.7 times the performance of a single thread - Clustered L2 cache allowing strict allocation or full sharing - 128b AltiVec SIMD unit #### **Large Memory Space** - 40-bit real address - Terabyte physical address #### **Increase Productivity** - Core Virtualization - Hypervisor - Logical to Real Address Translation #### **Energy Efficiency** Drowsy: core, cluster, AltiVec #### Core Performance: CoreMark™ \*Based on simulation ## 'All New' SC3900 Core & Clusters #### StarCore SC3900 Flexible Vector Processor - High DSP performance without compromising flexibility - Step function in performance over previous generation - 8 instructions per cycle - Up to 8 data lanes vector in a single instruction (SIMD8) - 38.4 GMACS per core @1.2 GHz & 1.2 Tbps memory bandwidth per core - State-of-the-art support for control code with Branch Prediction - Fully featured Memory Management Unit and Logical to Real Address Translation #### StarCore SC3900 FVP Clusters - Six SC3900 Cores - Clustering two SC3900 under a 2MB, multi-banked L2 cache - High bandwidth accelerator ports (up to 1Tbps per cluster) - Hardware support for memory coherency between L1, L2 caches and the main memory Texas Instruments SC3900 C66x 1.5GHz Freescale 1.2GHz ConsTEST Contactions of Conflicts, Conflicts, College, the France Pfficient Solutions laws, Alexan, evolution, PRC Properties Co. Jaywanapa, Mignif, MRC. Platforn in a Pactuga, GoriG Gorverga, GUIDC Engris, Ready Play, SWMPMOS, Trave, TurboLois, Vytmil ## **Smart Acceleration for Optimal Performance** | DPAA Control/Transport Hardware Accelerators | | | |----------------------------------------------|--------------------------------------------------------------------------------------|--| | FMAN<br>Frame Manager | >20 Gbps aggregate throughput,<br>Parse, Classify, Distribute | | | BMAN<br>Buffer Manager | Manages buffer pools for accelerators and network interfaces | | | QMAN<br>Queue Manager | Simplified sharing of network interfaces and hardware accelerators by multiple cores | | | RMAN<br>Rapid IO Manager | Seamless mapping sRIO to DPAA | | | SEC<br>Security | SNOW-3G, Kasumi, ZUC, IPSec, AES, DES, MD5, SHA-1/2 | | | Saving CPU Cycles for higher value work | | | | j | MAPLE-B Layer-1 Hardware Accelerators | | | |-----|---------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--| | 15 | Standards support | LTE, WCDMA, WiMAX, GSM and LTE-Advanced | | | eri | Throughputs | Very high throughputs enabling low processing latencies | | | A | Programming | Simple API | | | i | Multimode operations | LTE, LTE-A, WCDMA | | | • | Advanced MiMO | Innovative MiMO Equalizer for improved spectral efficiency and reduce processing latencies compared to conventional techniques | | | | Streaming | Direct streaming to/from antenna without core intervention | | | | Internal Embedded<br>Flows | Internal embedded flows for PUSCH/PDSCH Uplink/Downlink processing without core intervention | | | | Completely offloads extensive Baseband algorithms | | | # **Advanced Interfaces for Macro Deployments** # e6500 and sc3900 Clusters and Cores Prescript, the Prescript Sop. ARMAC C.S. COSHTEEL COSMITTER COSHTEEL COMMITTER CONTINUE CHIPMAN CHI # NP - Each thread: Superscalar, seven-issue, out-of-order execution/inorder completion, Branch units with a 512-entry, 4-way set associative Branch Target/History - Execution units: 1 Load/Store Unit per thread, 2 Simple integer per thread. - 1 Complex for integer Multiply & Divide, 1 Floating-point Unit, Altivec - 64 TLB SuperPages, 1024-entry 4K Pages, 36-bit Physical Address ## e6500 Cluster - 64-bit Power Architecture - e5500 core features plus: - Two threads per core (SMT) - Dual load/store units, one per thread - Shared L2 in cluster of 4 cores (8 threads per cluster) - 2048KB 16-way, 4 Banks - High-performance eLink bus between coreLd/St and instruction fetch units - Power - Drowsy core - Power Mgt Unit - Wait-on-reservation instruction - Enhanced MPPerformance - Accelerated Atomic Operations - Optimized Barrier Instructions - Fast intra-cluster sharing - AltiVec SIMD Unit - CoreNet BIU - 256-bit Din and Dout data busses - 36-bit Real Address - 256 GByte physical addr. space - Hardware Table Walk - LRAT - Logical to Real Addr. translation mechanism for improved hypervisor performance ## SC3900 DSP Cluster - Cluster consists of 2x SC3900 under large and fast shared memory - Multibank Cache of 2 MB - AXI based accelerators coupling port (45-90 GBps) - Advanced bus architecture - Out of order transaction completion - Deep pipeline - Full MESI+L HW Coherency - Advanced L1 memory subsystem architecture - 36bit address towards memory (64GB space) - 32 Kbyte L1 Instruction cache - Streaming data paths for read and write - 32 Kbyte read only Data Cache - 8 way, 128B line, streaming PLRU - 1 Kbyte Store/Gather Buffer - Advanced debug and profiling support - Rich event monitoring - Multi-core tracing - Improved core peripherals - All core peripherals are SoC accessible - Low latency interrupt support # **SC3900 Memory Allocation Options** - SC3900 has a fully cache based memory: - No constraints due to internal memory sizes and rigid allocations - No DMA management and scheduling overheads - Smaller internal memories required as only used code/data is allocated - In addition, the SC3900 supports: - Lock/Unlock of DDR space on the L2 Cache M2 behavior - Partition of the L2 Cache to several orthogonal L2 caches - Cache equivalent, DMA operations - · Tightly coupled accelerator port Ultra High bandwidth and low latency - I/O Stashing and Intervention support - · Cache Management Engine per core # rect MAPLE/CPRI Access to SC3900 Clusters L2 Caches - MAPLE/CPRI read/write coherent accesses to clusters L2 caches - Provides tight coupling of MAPLE/CPRI to the DSP cores - Provides high BW (>76GBps per cluster) and low latency - Significantly reduces DDR bus load and coherency traffic - Parallel access to multiple SC3900 clusters - MAPLE/CPRI accesses to DDRs directly via CoreNet fabric Target selection is based on MAPLE MMU # SC3900 FVP Core Block Diagram #### SC3900 Main Features: - 4 symmetric DMUs in the DALU - 32xMACs per cycle - 16 FLOPs/cycle Floating Point support - 4xSIMD8/vector support - High memory bandwidth - Program bus 256 bit/cycle - Data bus up to 1024 bit/cycle - Address & integer unit - 2 Load/Store units - General Integer Processing Unit (IPU) - Support multiply and shift - Large register files - For both AGU and DALU - Enhanced compiler support - Improved predication - Enhanced prediction mechanisms - Enriched Instruction Set - New binary & syntax - Improved Multi-core Debug and trace features # SC3900 Optimized for Baseband L1 Processing - SC3900 is optimized to efficiently handle Baseband PHY Layer processing - PHY layer processing can be divided into three categories: - Computation intensive DSP code (mainly MAC intensive) - Data manipulation and less intensive DSP code - Control code - Each one of the categories is non-negligible in processing requirements - There is no clear boundary separation - SC3900 accelerates all types of Baseband L1 processing ## **Computation Intensive DSP Code Acceleration** - SC3900 provides Vector processor capability by increasing the execution units and the whole data-path accordingly - Up to 32 MACs per cycles - 64 dedicated data registers of 40bit each) - Up to 1024bit (128B) core-to-L1 Data Cache throughput per cycle - Strong and flexible cache based data streaming abilities - SC3900 optimized datapath lead to high MAC utilization - Performance: - SC3900 is 3.5x-4x better than SC3850 in intensive DSP code ## L1 Processing - Data Manipulation Acceleration - "Data manipulation" stands for many different functions existing in Baseband Layer 1 For examples: - Data preparation before/after intensive kernels - Ex: data re-ordering, matrix transpose, pack/unpack - Less regular kernels or serial/cyclic kernels with low parallelism - Ex: QR Decomposition, IIR, Interleaver, encoder. - SC3900 architecture addresses "Data manipulation" by different means: - Data-path flexibility: This is the "Flexible Vector Processor" essence - Register file flexibility: Each unit can read/write any registers - Execution unit flexibility: Each unit can run different and independent instructions - Rich and flexible Instructions set - Efficient instruction set which large support of different data type and size - New powerful data manipulation specific instructions - Performance: - SC3900 is 2x-3x better than SC3850 in "Data Manipulation" ## L1 Processing - Control Code Efficiency - SC3900 control code efficiency - L1 control functions are tightly integrated with the Arithmetic intensive SW - Useful for running scheduling functions that are control intensive - Control code performance is affected by two main aspects: - Core and Compiler efficiency in typical control code constructs - Memory system efficiency - Both have been addressed on the SC3900 few examples: - Ability to flatten decision trees using multiple predicates - Third AGU unit with multiply and shift for address calculation and boost control code performance - Full support for non-aligned memory access without penalty - Larger, clustered 2MB L2 cache to keep the program close to the core ## SC3900 Instruction Set - Robust and flexible Instruction set - Significant improvement over previous generations - Instructions are highly flexible and fit wide range of DSP operations - For example, MAC instruction is defined to support: - Single precision 16bx16b, mixed-precision 16bx32b and double precision 32bx32b - Saturated and non-saturated arithmetic - SIMD and dot-product - Real x Real, Real x Complex, Complex x Complex, Complex x Conjugate - Uniform and consistent Instruction set - All instructions can use every register, support all data type, and all addressing modes - Remove grouping restrictions - All instructions can be grouped together ## **Baseband Optimized Instruction Set** - The Instruction set definition is based on deep analysis of baseband requirements and MAPLE™ offloaded functionality - Powerful application specific instructions are introduced in SC3900 (patents pending) – few examples: - Maximum/Peak search - Find maximum value and index between 4 words and previous results - Up to 4 Maxsearch (total of 20 elements) per cycle - Filter and correlation dedicated instructions - Support for both complex and real filters - FFT/DFT highly optimized kernel and instructions - Specialized load/store instructions - Support matrix transposed & manipulation (2x2, 2x4, 4x4, 2x8, 4x8, 8x8) - Bit manipulation - · Dedicated instruction for scrambling, puncturing, interleaver - Reciprocal (1/x), 1/Square Root and Log approximation instructions # Memory System and Interconnect Prescript, the Prescript Sop. ARMAC C.S. COSHTEEL COSMITTER COSHTEEL COMMITTER CONTINUE CHIPMAN CHI # **CoreNet Platform Cache (CPC)** - Instead of using a conventional L3 cache, the B4860 has a CoreNet Platform Cache (CPC) — 512KB for each of the two DRAM controllers. - The platform cache can function as: - L3 cache, or - Scratchpad memory, optimizing traffic to main memory and providing transient storage for sharing data among the DSPs and CPUs. - Fully coherent with the other caches in the SoC # Why Clustering? - Data and code sharing between cores on the same cluster - Low L2-cache access latency - Increased cores utilization - Reduce DDR traffic - Lower power - Simplified SoC interconnect - Reduced area - Lower power - Higher performance - Full hardware coherency - Within the cluster - SoC level ## **SoC Interconnect** - B4860 has two main fabrics - Corenet fabric - AXI fabric - CoreNet fabric is the major fabric - Connect between CPU cluster, DSP clusters, OCeaN world, FM and other to CPC/DDR memories. - 42.5GB/s of raw bandwidth per cluster. - AXI fabric - Connect between Maple units, CPRI to SC3900 clusters and to CoreNet fabric - Allows for high throughput, low latency transfers in Layer 1 sub-system ### **CoreNet Overview** - Coherent fabric - Maintains coherence among all the CPU and DSP caches and memories. - Reduces multicore software development effort - Easier software partitioning and upgrade - High data bandwidth buses: 256-bit at 667 MHz - 42.5 GB/s of raw bandwidth per cluster - Performance features - Parallel accesses - Deep pipeline - Out-of-order completion - Inter-processor communication - Stashing - PAMU Peripheral Access Management Unit ## **CoreNet Flow Example - Intervention** - Step 1 Initiator (Core/ Accelerator/ IO) requests data from CoreNet fabric - Step 2 CoreNet broadcasts the request to relevant initiators/target that might have the data - Step 3 An initiator/Target which has the latest data responds and the data reaches the requestor (and optionally from memory) without a need for SW intervention MAPLE B3 Baseband Accel. # **CoreNet Flow Example - Invalidation** Step 3 – Initiators which hold an old version of the data invalidate it and the write data is written by the requestor # **CoreNet Flow Example - Stashing** - Step 2 CoreNet broadcasts the request to find the relevant caches that might hold old copies of the data - Step 3 The data is written only to the designated target(s) - B4860 instantiates two OceAN DMAs - Transfer data between PCI/SRIO to/from the device memories - Transfer data between two locations in the memory - Each DMA includes the following features - Eight channels - Advanced chaining and strides capabilities - Priority support - Can be activated using external signal Pressons, the Presposite logic, ARANIC, C.S. CachettiST, DodoWarter, Cachera, Costifices, C.Ware, the Energy Efficient Solutions stopy, Notesta, modalises ENG, Flower (JACC, Presidence Expert, 2004), Quorino, Sale-Novavo, Habe Jackerson (pp.), Sale-Cent, Spranger and March Cachera, ## **CPRI Antenna Interface in B4860** - The CPRI complex enables communication among radio devices over a CPRI bus. - The CPRI complex is designed to support the CPRI V4.2 specification and can be configured to support several air interface standards, including WiMAX, LTE, and WCDMA. - The complex supports up to 8 CPRI links (4 pairs) with each link configurable as a master or slave port. - Up to 9.8Gbps per lane - Each CPRI link supports three types of service access points (SAPs): - IQ samples for antenna transferred through the SAP IQ Interface - CPRI frames synchronized by the SAP synchronization interface - CPRI link control and management (C&M) data transferred between SAPs in both CPRI master and slave ports. # **CPRI – Block Diagram** # **MAPLE-B3** Princescale, the Precisals logg, ARMAN, C.S., Cashi PEST, Coshi Microsa, Cashi Pesa, # APLE Modular Concept #### **PSIF** Central programmable control for: - Tasks Scheduling - Efficient DMAing from/to SoC and internally - Flexible processing flow allow multiple standards support - BD parsing and job configurations - Interrupts handling - Internal Embedded Data Flows #### PE-s - Highly efficient HW implementation of baseband computational extensive algorithms - · Lego like concept allowing: - Fast solution derivation (Macro to Femto) - Use of algorithms commonality between technologies (LTE, WCDMA, CDMA, WiMAX) # **B4860 Baseband Accelerator Platform - Highlights** - LTE/LTE-Advanced, HSPA/WCDMA, WiMAX and Multi-Mode acceleration solution. - LTE/LTE-A acceleration R.10/R.11 compliant: - PUSCH acceleration including Cancelation, Flexible MIMO Equalization, iDFT, De-Modulator, De-Scrambler, DINTLV, UCI decoding, HARQ, FEC decoding. - PDSCH acceleration including full PDSCH/PMCH from FEC to IFFT, internal RS generation, multiplexing of PBCH, PSS, SSS at RE mapping. - FFT/DFT and vector multiplication acceleration for PUSCH, PCFICH, PHICH, PDCCH, PRACH, Sounding and general purpose use - WCDMA/HSPA acceleration R.10/R.11 compliant: - HSDPA FEC encoding - HSUPA, WCDMA FEC decoding - Downlink Chip Rate acceleration - Uplink Chip Rate acceleration with flexible scheme addressing: - Low latency control channels processing - Flexible Interference Cancelation, Grouping, Despreading data channels processing - Flexible Path Searcher and RACH correlations **MAPLE-B3** Pre- # MAPLE-B3 Layer-1 Processing Accel. (LTE/LTE-A) #### **PDSCH Processing** 3GPP TS36.211/212/213 # Data Path Architecture (DPAA) Prescrib, the Prescrib Rop. Allahoc. C.S. Crakt TST, Disblatters Craft Fac, Coalities, C.Wwo, the Image Info:ent Solvanian Supp. Allahot, mobiles T. P.Dr. Proving IACC, Projector Expert, David, Opinion, Safekeano, the Info/source legis. Selection Spriptionly and NerOS are trademants of Prescribe Residentials. Inc. Roy, U.S. Pat. & Tw. Off. Antor, Devill, Servicias C. Carrier, Tever, Lyvercope, Mayalik Mot, Wolfown in a Puckage, Carolic Goverage, GUISC Cirpsin, Bushy Page, SAMARMATS, Exem. Tablistics, Virtual and Virtual are trademanks of Freedock Solvacovabulos, Inc. All other product or general earlies are the property of their seguences owners. Inc. 2013. These areas for several earlies are the property of their seguences owners. Inc. 2013. These areas for several earlies are the property of their seguences owners. Inc. 2013. These areas for several earlies are the second of the control of the property of their seguences owners. Inc. 2013. These areas for several earlies are the second of the control of the second of the control of the second of the control of the second of the control ## **DPAA Purposes** - Acceleration of frame/packet processing - Network protocols (Layer-1 to 4) - Standard algorithmics (security and content processing) - Classification and Distribution of data flows among cores and software partitions - Load balancing through parse/classify/distribute - Load spreading through queues shared among multiple consumers - Abstract and manage efficiently Intercore communications and the access to shared resources (NW interfaces, HW-Accelerators, Buffers, Queues) - More sophisticated approach compared to basic BD/buffer list - Scalability and Portability - Across 'any' mix of cores, accelerators and device boundaries - Across device generations ## **Basic Hardware Infrastructure** - queuing - Class based prioritization - Large number of queues - Universal: between all blocks - Lock free, low software overhead queuing - Hardware buffer management - Hardware blocks acquire and release buffers without software intervention - Lock free, low software overhead buffer pool management for hardware use ## ame manager # Frame Manager is responsible for preprocessing and moving Ethernet packets into and out of the datapath #### Parsing - Packet Parsing at wire speed - Supports standard protocols parsing and identification by HW (VLAN/IP/UDP/TCP/SCTP/PPPoE/PPP/MPL S/GRE/IPSec ...) - Supports non-standard UDF header parsing for custom protocols #### Classification / Distribution - Coarse classification based on Key generation Hash and exact match lookup - Supports aggregated speed of 20Gbps, 30Mpps@667MHz - Lookups configured by user, can be chained - Classification result is frame queue ID, storage profile and policing profile. #### Ingress Policing - Two rate three colour marking algorithm (rfc 2968 & 4115) - Up to 256 internal profiles #### General - Supports offline PCD on frames extracted from QMan - Supports "Independent" mode (no work with BMan & QMan, BD ring model) - Per port egress rate limiting - Statistics & Multicast support - Support for IEEE1588 thru HW-Timestamping Presents, the Freezeld logs, MWw., C.E., Cody EST, CodyMarch, CodyFre, CodyFre, Codyme, the Everyy Ethier's Soldieric legs, Xineta, redelect. PSC, FreewOUCC. Processor Super, CodyD, Carrina, Edit-Assaul in Scial-Assaul logs, StarCare Symptony and Vordia on trainworks of Freezeld Services Super, Cody, StarCare Symptony and Vordia on trainworks of Freezeld Services StarCare Symptony and Vordia on trainworks of Freezeld Services StarCare Symptony and Vordia One-spay, Colifo Engres, Resight Symptony, StarCare Symptony, March StarCare Symptony, Vordia and Starcare StarCare Symptony, StarCare Symptony, Vordia and Starcare StarCare Symptony, StarCare Symptony, Vordia StarCare Symptony, StarCare Symptony, Vordia StarCa # **B4860QDS** Development Platform - Dual-AMC form factor - Standalone operation or pluggable into open-top standard MicroTCA chassis - 2x DDR3 - 4GB Dual rank 64b/72b, 1.867GHz w/ ECC - 2GB 64b/72b, 1.866GHz w/ ECC - Ethernet - Up to 6x 1G/2.5G SGMII - Up to 2x 10G XFI/XAUI - CPRI v4.2 up to 8 ports at 9.8G - sRIO v2.1 up to 2 four lanes ports at 5G su - PCIe v2.0 one port four lanes at 5G support - AMC connector for HSSI expansions - SFP+ Two optical transceivers - NOR, NAND, I2C & SPI FLASH memories - USB, UART - JTAG & Aurora interfaces