Guide To Reliable Internet Services And Applications (Computer Communications Networks)
User Manual:
Open the PDF directly: View PDF .
Page Count: 637
Download | |
Open PDF In Browser | View PDF |
Computer Communications and Networks For other titles published in this series, go to www.springer.com/series/4198 The Computer Communications and Networks series is a range of textbooks, monographs and handbooks. It sets out to provide students, researchers and non-specialists alike with a sure grounding in current knowledge, together with comprehensible access to the latest developments in computer communications and networking. Emphasis is placed on clear and explanatory styles that support a tutorial approach, so that even the most complex of topics is presented in a lucid and intelligible manner. Charles R. Kalmanek Y. Richard Yang • Sudip Misra Editors Guide to Reliable Internet Services and Applications 123 Editors Charles R. Kalmanek AT&T Labs Research 180 Park Ave. Florham Park NJ 07932 USA crk@research.att.com Y. Richard Yang Yale University Dept. of Computer Science 51 Prospect St. New Haven CT 06511 USA yry@cs.yale.edu Sudip Misra Indian Institute of Technology Kharagpur School of Information Technology Kharagpur-721302, India smisra.editor@gmail.com Series Editor Professor A.J. Sammes, BSc, MPhil, PhD, FBCS, CEng Centre for Forensic Computing Cranfield University DCMT, Shrivenham Swindon SN6 8LA UK ISSN 1617-7975 ISBN 978-1-84882-827-8 e-ISBN 978-1-84882-828-5 DOI 10.1007/978-1-84882-828-5 Springer London Dordrecht Heidelberg New York British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2010921296 c Springer-Verlag London Limited 2010 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: SPi Publisher Services Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Foreword An oft-repeated adage among telecommunication providers goes, “There are five things that matter: reliability, reliability, reliability, time to market, and cost. If you can’t do all five, at least do the first three.” Yet, designing and operating reliable networks and services is a Herculean task. Building truly reliable components is unacceptably expensive, forcing us to construct reliable systems out of unreliable components. The resulting systems are inherently complex, consisting of many different kinds of components running a variety of different protocols that interact in subtle ways. Inter-networks such as the Internet span multiple regions of administrative control, from campus and corporate networks to Internet Service Providers, making good end-to-end performance a shared responsibility borne by sometimes uncooperative parties. Moreover, these networks consist not only of routers, but also lower-layer devices such as optical switches and higher-layer components such as firewalls and proxies. And, these components are highly configurable, leaving ample room for operator error and buggy software. As if that were not difficult enough, end users understandably care about the performance of their higher-level applications, which has a complicated relationship with the behavior of the underlying network. Despite these challenges, researchers and practitioners alike have made tremendous strides in improving the reliability of modern networks and services. Their efforts have laid the groundwork for the Internet to evolve into a worldwide communications infrastructure – one of the most impressive engineering artifacts ever built. Yet, much of the amassed wisdom of how to design and run reliable networks has been spread across a variety of papers and presentations in a diverse array of venues, in tools and best-common practices for managing networks, and sometimes only in the minds of the many engineers who design networking equipment and operate large networks. This brings us to this book, which captures the state-of-the-art for building reliable networks and services. Like the topic of reliability itself, the book is broad, ranging from reliability modeling and planning, to network monitoring and network configuration, to disaster preparedness and reliable applications. A diverse collection of experts, from both industry and the academe, have come together to distill the collective wisdom. The book is both grounded in practical challenges and v vi Foreword forward looking to put the design and operation of reliable networks on a strong foundation. As such, the book can help us build more reliable networks and services today, and face the many challenges of achieving even greater reliability in the years ahead. Jennifer Rexford Princeton University Preface Overview This book arose from a conversation at the Internet Network Management workshop (INM) in 2007. INM’07 was subtitled “The Five Nine’s Workshop” because it focused on raising the availability of Internet services to “Five Nine’s” or 99.999%, an availability metric traditionally associated with the telephone network. During our conversation, we talked about and vehemently agreed that there was a need for a comprehensive book on reliable Internet services and applications – a guide that would collect in one volume the accumulated wisdom of leading researchers and practitioners in the field. Networks and networked application services using the Internet Protocol have become a critical part of society. Service disruptions can have significant impact on people’s lives and business. In fact, as the Internet has grown, application requirements have become more demanding. In the early days of the Internet, the typical applications were nonreal-time applications, where packet retransmission and application layer retry would hide underlying transient network disruptions. Today, applications such as online stock trading, online gaming, Voice over IP (VoIP), and video are much more sensitive to small perturbations in the network. For example, following one undersea cable failure in the Pacific, AT&T restored the service on an alternate route, which introduced 5 ms of additional packet delay. This seemingly small additional delay was sufficient to cause problems for an enterprise customer that operated an application between a call center in India and a data center in Canada. This problem led to subsequent re-engineering of the customer’s end-to-end connection. In addition, networked application services have become an increasingly important part of people’s lives. The Internet and virtual private networks support many mission critical business services. Ten years ago, it would have been just an inconvenience if someone lost their IP service. Today, people and businesses depend on Internet applications. Online stock trading companies are not in business if people cannot implement their trades. The Department of Defense cannot operate their information-based programs if their information infrastructure is not operating. Call centers with VoIP services cannot serve their customers without their IP network. vii viii Preface Although we started work on this book with a focus on network reliability, it should be obvious from the preceding description that it is important to consider both reliability and performance, and to consider both networks and networked application services. Examples of networked applications include email, VoIP, search engines, ecommerce sites, news sites, or content delivery networks. Features This book has a number of features that make it a unique and valuable guide to reliable Internet services and applications. Systematic, interdisciplinary approach: Building and operating reliable network services and applications requires a systematic approach. This book provides comprehensive, systematic, and interdisciplinary coverage of the important technical topics, including areas such as networking; performance, and reliability modeling; network measurement; configuration, fault, and security management; and software systems. The book provides an introduction to all of the topics, while at the same time, going into enough depth for interested readers that already understand the basics. Specifically, the book is divided into seven parts. Part I provides an introduction to the challenges of building reliable networks and applications, and presents an overview of the structure of a large Internet Service Provider (ISP) network. Part II introduces reliability modeling and network capacity planning. Part III extends the discussion beyond a single network administrative domain, covering interdomain reliability and overlay networks. Part IV provides an introduction to an important aspect of reliability: configuration management. Part V introduces network measurements, which provide the underpinning of network management. Part VI covers network and security management, and disaster preparedness. Part VII describes techniques for building application services, and provides a comprehensive overview of capacity and performance engineering for these services. Taken in total, the book provides a comprehensive introduction to an important topic. Coverage of pragmatic problems arising in real, operational deployments: Building and operating reliable networks and applications require an understanding of the pragmatic challenges that arise in an operational setting. This book is written by leading practitioners and researchers, and provides a unique perspective on the subject matter arising from their experience. Several chapters provide valuable “best practices” to help readers translate ideas into practice. Content and structure allows reference reading: Although the book can be read from cover to cover, each chapter is designed to be largely self-contained, allowing readers to jump to specific topics that they may be interested in. The necessary overlap across a few of the chapters is minimal. Preface ix Audience The goal of this book is to present a comprehensive guide to reliable Internet services and applications in a form that will be of broad interest to educators and researchers. The material is covered in a level of detail that would be suitable for an advanced undergraduate or graduate course in computer science. It can be used as the basis or supplemental material for a one-or-two semester course, providing a solid grounding in both theory and practice. The book will also be valuable to researchers seeking to understand the challenges faced by service providers and to identify areas that are ripe for research. The book is also intended to be useful to practitioners who want to broaden their understanding of the field, and/or to deepen their knowledge of the fundamentals. By focusing our attention on a large ISP network and associated application services, we consider a problem that is large enough to expose the real challenges and yet broad enough to expose guidelines and best practices that will be applicable in other domains. For example, though the book does not discuss access or wireless networks, we believe that the principles and approaches to reliability that are presented in this book apply to them and are in fact, broadly applicable to any large network or networked application. We hope that you will find the book to be informative and useful. Florham Park, NJ India New Haven, CT Charles R. Kalmanek Sudip Misra Y. Richard Yang Acknowledgments The credit for this book goes first and foremost to the authors of the individual chapters. It takes a great deal of effort to crystallize one’s understanding of a topic into an overview that is self-contained, technically deep, and interesting. The authors of this volume have done an outstanding job. The editors acknowledge the contributions of many reviewers, whose comments clearly improved the quality of the chapters. Simon Rees and Wayne Wheeler, our editors at Springer, have been helpful and supportive. The editors also acknowledge the support that they have been given by their families and loved ones during the long evenings and weekends spent developing this book. xi Contents Part I Introduction and Reliable Network Design 1 2 The Challenges of Building Reliable Networks and Networked Application Services .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . Charles R. Kalmanek and Y. Richard Yang 3 Structural Overview of ISP Networks.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 19 Robert D. Doverspike, K.K. Ramakrishnan, and Chris Chase Part II Reliability Modeling and Network Planning 3 Reliability Metrics for Routers in IP Networks . . . . . . . . . . . . . . . .. . . . . . . . . . . 97 Yaakov Kogan 4 Network Performability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .113 Kostas N. Oikonomou 5 Robust Network Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .137 Matthew Roughan Part III Interdomain Reliability and Overlay Networks 6 Interdomain Routing and Reliability .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .181 Feng Wang and Lixin Gao 7 Overlay Networking and Resiliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .221 Bobby Bhattacharjee and Michael Rabinovich Part IV 8 Configuration Management Network Configuration Management . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .255 Brian D. Freeman xiii xiv 9 Contents Network Configuration Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .277 Sanjai Narain, Rajesh Talpade, and Gary Levin Part V Network Measurement 10 Measurements of Data Plane Reliability and Performance .. .. . . . . . . . . . .319 Nick Duffield and Al Morton 11 Measurements of Control Plane Reliability and Performance.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .357 Lee Breslau and Aman Shaikh Part VI Network and Security Management, and Disaster Preparedness 12 Network Management: Fault Management, Performance Management, and Planned Maintenance . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .397 Jennifer M. Yates and Zihui Ge 13 Network Security – A Service Provider View . . . . . . . . . . . . . . . . . .. . . . . . . . . . .447 Brian Rexroad and Jacobus Van der Merwe 14 Disaster Preparedness and Resiliency . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .517 Susan R. Bailey Part VII Reliable Application Services 15 Building Large-Scale, Reliable Network Services.. . . . . . . . . . . . .. . . . . . . . . . .547 Alan L. Glasser 16 Capacity and Performance Engineering for Networked Application Servers: A Case Study in E-mail Platform Planning . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .581 Paul Reeser Index . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .629 Part I Introduction and Reliable Network Design Chapter 1 The Challenges of Building Reliable Networks and Networked Application Services Charles R. Kalmanek and Y. Richard Yang 1.1 Introduction In the decades since the ARPANET interconnected four research labs in 1969 [1], computer networks have become a critical infrastructure supporting our information-based society. Our dependence on this infrastructure is similar to our dependence on other basic infrastructures such as the world’s power grids and the global transportation systems. Failures of the network infrastructure or major applications running on top of it can have an enormous financial and social cost with serious consequences to the organizations and consumers that depend on these services. Given the importance of this communications and applications infrastructure to the economy and society as a whole, reliability is a major concern of network and service providers. After a survey of major network carriers including AT&T, BT, and NTT, Telemark [7] concludes that, “The three elements which carriers are most concerned about when deploying communication services are network reliability, network usability, and network fault processing capabilities. The top three elements all belong to the reliability category.” Unfortunately, the challenges associated with running reliable, large-scale networks are not well documented in the research literature. Moreover, while networking and software-educational curricula provide a good theoretical foundation, there is little training in the techniques used by experienced practitioners to address reliability challenges. Another issue is that while traditional telecommunications vendors gained extensive experience in building reliable software, the pace of change has accelerated as the Internet has grown and Internet system vendors do not meet the level of reliability traditionally associated with “carrier grade” systems. Newer vendors accustomed to building consumer software are C.R. Kalmanek () AT&T Labs, 180 Park Ave., 07932, Florham Park, NJ, USA e-mail: crk@research.att.com Y.R. Yang Yale University, 51 Prospect Street, New Haven, CT, USA e-mail: yry@cs.yale.edu C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 1, c Springer-Verlag London Limited 2010 3 4 C.R. Kalmanek and Y.R. Yang entering the service provider market, but they do not have a culture that focuses on the higher level of required reliability. This places a greater burden on service providers who integrate their software to help these vendors “raise the bar” on reliability to offer reliable services. Although we emphasize network reliability in the foregoing section, it is important to consider both reliability and performance and to consider both networks and networked application services. Users are interested in the performance of an endto-end service. When a user is unable to access his e-mail, he does not particularly care whether the network or the application is at fault. Examples of network applications include e-mail, Voice over IP, search engines, e-commerce sites, news sites, or content delivery networks. 1.2 Why Is Reliability Hard? Supporting reliable networks and networked application services involves some of the most complex engineering and operational challenges that are dealt with in any industry. Much of this complexity is intentionally transparent to the end users, who expect things to “just work.” Moreover, the end users are typically not exposed to the root causes of network or service problems when their service is degraded or interrupted. As a result, it is natural for end users to assume that network and service reliability are not hard. In part, users get this impression because most service providers and Internet-facing web services operate at very high levels of reliability. Though it may look easy, this level of reliability is a result of solid engineering and “constant vigilance.” The best service providers engage in a process of continuous improvement, similar to the Japanese “Kaizen” philosophy that was popularized by Deming [2]. In this book, we address the challenges faced by service providers and the approaches that they use to deliver reliable services to their users. Before delving into the solution, we ask ourselves, why is it so hard to build highly reliable networks and networked application services? We can characterize the difficulty as resulting from three primary causes. The first challenge is scale and complexity; the second is that the services operate in the presence of constant change. These challenges are inherent to large-scale networks. The third challenge is less fundamental but still important. It relates to challenges with measurement and data. 1.2.1 Scale and Complexity Challenges Scale and complexity challenges are fundamental to any large network or service infrastructure. As Steve Bellovin remarked, “Things break. Complex systems break in complex ways” [8]. In particular, large service provider networks contain hundreds of thousands of network elements distributed around the world, and tens of 1 The Challenges of Building Reliable Networks and Networked Application Services 5 thousands of different models of equipment. These network elements are interconnected and must interoperate correctly to offer services to the network users. Failures in one part of the network can impact other parts of the network. Even if we consider only the infrastructure needed to provide basic IP connectivity services, it consists of a vast number of complex building blocks: routers, multiplexers, transmission equipment, servers, systems software, load balancers, storage, firewalls, application software, etc. At any given point in time, some network elements have failed, have been taken out of service, or will be operating at a degraded performance level. The preceding description only hints at the challenges. Despite the careful engineering and modeling that is done through all stages of the service life cycle, if we look at the service infrastructure as a system, we note that the system does not always behave as expected. There are many reasons for this, including: Software defects in network elements; Inadequate modeling of dependencies; Complex software-support systems. The vast majority of the elements involved in providing a network service contain software, which can be buggy, particularly when the software function is complex. If a bug is triggered, a piece of equipment can behave in unexpected ways. Even though the correct operation of router software is critical to service, we have seen design flaws in the way that the router-operating system handles resource management and scheduling, which manifest themselves as latent outages. The history of the telephone network contains examples of major network outages caused by software faults, such as the famous “crash” of the AT&T long-distance telephone network in 1990 [3]. Similarly, the network elements that make up the IP network infrastructure contain complex control-plane software implementing distributed protocols that must interoperate properly for the network to work. When compared to the telephone switching software, control plan software of IP networks changes more frequently and is far more likely to be subject to undetected software faults. These faults occasionally result in unexpected behaviors that can lead to outages or degraded performance. In a large complex infrastructure, operators do not have a comprehensive model of all of the dependencies between systems supporting a given service: they rely on simplifying abstractions such as network layering and administrative separation of concerns. These abstractions can break down in unexpected ways. For example, there are complex interactions between network layers, such as the transport and IP layers, that affect reliability. Consider a link between two routers that is transported over a SONET ring. Networks are typically designed so that protection switching at the SONET layer is transparent to the IP layer. However, several years ago, AT&T experienced problems in the field, whereby a SONET “protection switching event” triggered a router-software bug that caused several minutes of unexpected customer downtime. Since the protection switch occurred correctly, the problem did not trigger an alarm and was only uncovered by correlating customer trouble tickets with 6 C.R. Kalmanek and Y.R. Yang network event data. This cross-layer interaction is an example of the kinds of dependency that can be difficult to anticipate and troubleshoot. In addition to the scale of the network and the complexity of the network equipment, correct operation depends on the operation of complex software systems that manage the network and support customer care. Router-configuration files contain a large number of parameters that must be configured correctly. Incorrect configuration of an access control list can create security vulnerabilities, or alternatively, can cause traffic to be “blackholed” by blocking legitimate traffic. If there is a mismatch between the Quality of Service settings on a customer-edge router and those on the provider-edge router that it connects, some applications may experience performance problems under heavy load. An inconsistency between the network inventory database and the running network can lead to stranded network capacity, service degradations, network outages, etc. These problems sometimes manifest themselves weeks or months after the inconsistency appeared – for this reason, they are sometimes referred to as “time bombs.” 1.2.2 Constant Change The second challenge relates to the fact that any large-scale service infrastructure undergoes constant change. Maintenance and customer-provisioning activities in a large global network are ongoing, spanning multiple time zones. On a typical workday, new customers are being provisioned, service for departing customers is being turned down, and change orders to change some service characteristic are being processed for existing customers. Capacity augmentation and traffic grooming, whereby private-line connections are rearranged to use network resources more efficiently, take place daily. Routine maintenance activities such as software upgrades also take place during predefined maintenance “windows.” More complex maintenance activities, such as network migrations, also occur periodically. Examples of network migration include moving a customer connection from one access router to another, replacing a backbone router, or consolidating all of a regional network’s traffic onto a national backbone network in order to retire an older backbone. Replacing a backbone router in a service provider network requires careful planning and execution of a sequence of moves of the “uplinks” from access routers in order to minimize the amount of traffic that is dropped. Decision-support tools are used to model the traffic that impinges on all of the affected links at every step of the move to ensure that links are not congested. In the midst of these day-to-day changes, network failures can occur at any time. The network is designed to automatically restore service after a failure. However, during planned maintenance activities, it is possible that some network capacity has been removed from service temporarily, potentially leaving the network more vulnerable to specific failures. Under normal conditions, maintenance to repair the failed network element is scheduled to occur later at a convenient time, after which the network traffic may revert back to its original path. 1 The Challenges of Building Reliable Networks and Networked Application Services 7 Finally, in addition to the day-to-day changes of new customers, or the occasional changes that come from major network migrations, there are also architectural changes. These changes might result from the introduction of new features and services, or new protocols. An example might be the addition of a new “class of service” in the backbone. Another example might be turning up support for multicast services in MPLS-based VPNs. The first example (class of service) involves configuration changes that may touch every router in the network. The second example involves introducing a new architectural element (i.e., a PIM rendezvous point), enabling a new protocol (i.e., PIM), validating the operation of multicast monitoring tools, etc. All of these changes would have been tested in the lab prior to the First Field Application (FFA), which is typically the first time that everything comes together in an operational network carrying live customer traffic. If there are problems during the FFA with the new feature that is being deployed, network operations will execute procedures to gracefully back out of the change until the root cause of the problem is analyzed and corrected. 1.2.3 Measurement and Data Challenges The third challenge associated with building reliable networks is associated with measurement and data. Vendor products deployed by service providers often suffer from an inadequate implementation of basic telemetry functions that are necessary to monitor and manage the equipment. In addition, because of the complexity of the operating environment described earlier, there are many, diverse data sources, with highly variable data quality. We present two examples. Despite the maturity of SNMP [4], AT&T has seen an implementation of a commercial SNMP poller that did not correctly handle the data impacts of router reboots or loss of data in transit. Ideally, problems like this are discovered in the lab, but occasionally they are not discovered until the equipment is deployed and supporting live service. Data problems are not limited to network layer equipment: vendor-developed software components running on servers may not support monitoring agents that export the data necessary to implement a comprehensive performance-monitoring infrastructure. When these software components are combined in a complex, multitiered application, the workflow and dependencies among the components may not be fully understood even by the vendor. When such a system is deployed, even with a well-designed server instrumentation, it may be difficult to determine exactly which component is the bottleneck with limited system throughput. Another issue is that data are often “locked up” in management system “silos.” This can result from selecting a vendor’s proprietary element-management system. Typically, proprietary systems are not designed to make data export easy, since the vendor seeks to lock the service provider into a complete “solution.” Data silos can also result from internal implementations. These often result from organizational silos: a management system is specified and built to address a specific set of functions, without the involvement of subject matter experts from other domains. 8 C.R. Kalmanek and Y.R. Yang Whatever the cause, the end result is that the data necessary to monitor and manage the infrastructure may not exist or may be difficult to access by analysts who are trying to understand the system. 1.3 Toward Network and Service Reliability The examples in Section 1.2 give only a glimpse into the complex challenges faced by service providers who seek to provide reliable services. Despite these complexities, the vast majority of users receive good service. How is this achieved? At the highest level, network and service reliability involve both good engineering design and good operational practices. These practices are inextricably linked: no matter how good the operations team is, good operation practices cannot make up for a poorly thought out design. Likewise, a good design that is implemented or operated poorly will not result in reliable service. It should be obvious that reliable services start with good design and engineering. The service design process relies on extensive domain knowledge and a good understanding of the business and service-level objectives. Network engineers develop detailed requirements for each network element in light of the end-to-end objectives for reliability, availability, and operability. Network elements are selected carefully. After a detailed paper and lab evaluation, an engineering team selects a specific product to meet a particular need. Once the product is selected, it enters a change control process where differences between the requirements and the product’s capabilities are managed by the service provider in conjunction with the vendor. The service designers, working closely with test engineers, develop comprehensive engineering rules for each of the network elements, including safe operating limits for resources such as bandwidth or CPU utilization. Detailed engineering documents are developed that describe how the network element is to be used, its engineering limits, etc. Network management requirements for the new network element are developed in conjunction with operations personnel and delivered to the IT team responsible for the operations-support systems (OSSs). Before the FFA of the new element, the element, and OSSs undergo an Operations Readiness Test (ORT), which verifies that the element and the associated OSSs work as expected, and can be managed by network operations. The preceding paragraph gives a brief overview of some of the engineering “best practices” involved in building a reliable network. In addition, reliability and capacity modeling must be done for the network as a whole. The network architecture includes the appropriate recovery mechanisms to address potential failures. Reliability modeling tools are used to model the impact on the network of failures in light of both current and forecast demands. Where possible, the tools model cross-layer dependencies between IP layer links and the underlying transport or physical layer network, such as the existence of “shared risk groups” – links or elements that may be subject to simultaneous failure. By simulating all possible failure scenarios, these tools allow the network designers to trade off network cost against survivability. The 1 The Challenges of Building Reliable Networks and Networked Application Services 9 network design also includes a comprehensive security design that considers the important threats to the network and its customers, and implements appropriate access controls and other security detection and mitigation strategies. An operations organization is typically responsible for managing the network or service on a day-to-day basis. The operations team is supported by the operationssupport systems mentioned earlier. These include configuration-management systems responsible for maintaining network inventory data and configuring the network elements, and service assurance systems that collect telemetry data from the network to support fault and performance management functions. The fault and performance management systems are the “eyes” of the operations team into the service infrastructure to figure out, in the case of problems, what needs to be repaired. We can consider fault and performance management systems as involving the following areas: Instrumentation layer; Data management layer; Management application layer. We start thinking about the instrumentation layer by asking what telemetry or measurement data need to be collected to validate that the service is meeting its service-level objectives (or to troubleshoot problems if it is not). Standardized router MIB data provide a base level of information, but additional instrumentation is needed to manage large networks supporting complex applications. Passive monitoring techniques support collection of data directly from network elements and dedicated passive monitoring devices, but active monitoring, involving the injection and monitoring of synthetic traffic, is also required and is commonly used. Since the correct operation of the IP forwarding layer (data plane) critically depends on the correct operation of the IP control plane, both data plane and the control-plane monitoring are important. In software-based application services, the telemetry frequently does not adequately capture “soft” failure modes, such as transaction timeouts between devices or errors in software settings and parameters. Both the servers supporting application software and the applications themselves need to be instrumented and monitored for both faults and key performance parameters. Large service providers typically have a significant number of data sources that are relevant to service management, and the data management layer needs to be able to handle large volumes of telemetry and alarm data. As a result, the data-collection and data-management infrastructure presents challenging systems design problems. A good design allows data-source-specific collectors to be easily integrated. It also provides a framework for data normalization, so that common fields such as timestamps, router names, etc., can be normalized to a common key during data ingest so that application developers are spared some of the complexity of understanding details of the raw data streams. Ideally, the design of the data management layer supports a common real-time and archival data store that is accessed by a range of applications. 10 C.R. Kalmanek and Y.R. Yang The management applications supported on top of the data management layer support routine operations functions such as fault and performance management, in addition to supporting more complex analyses. Given the vast quantity of event data that is generated by the network, the event management system must appropriately filter the information that must be acted upon by the operations team to avoid flooding them with spurious information. The impact of alarm storms (and the importance of alarm filtering) can be illustrated by the story of Three Mile Island, in which the computer system noted 700 distinct error conditions within the first minute of the problem, followed by thousands of error reports and updates [5]. The operators were drowning in a sea of information at a time when they needed a small number of actionable items to work on. Management applications also enable operations personnel to control the network, including performing routine tasks such as resetting a line card on a router as well as more complex tasks. Standard tasks are handled through an operations interface to an operations-support system. Ad hoc tasks that involve a complex workflow may require operations staff to use a scripting language that accesses the network inventory database and sends commands to network elements or element-management systems. Ideally, the operations-support systems automate most of the routine tasks to a large extent, audit the results of these tasks, and back them out if there are problems. It is useful to note that operations personnel are typically organized in multiple response tiers. The lower tiers of operations staff work on immediate problems, following established procedures. The tools that they use have constrained functionality, targeted at the functions that they are expected to perform. The highest tier of operations personnel consists of senior operations staff charged with diagnosing complex problems in real-time or performing postmortem analysis of complex, unresolved problems that occurred in the past. These investigations may take more time than lower-tier operations staff can afford to spend on a specific problem. When there are serious problems affecting major customers or the network as a whole, engineers from the network engineering team are also called upon to assist. In these cases, one or more analysts do exploratory data mining (EDM) using data exploration tools [6] that support data drill down, statistical data analysis, and data visualization. Well-designed data exploration tools can make a huge difference when analysts are faced with the “needle in the haystack” problem – trying to sort through huge quantities of telemetry data to draw meaningful conclusions. When analysts uncover the root cause of a particular problem, this information can be used to eliminate the problem, e.g., by pressing a vendor to fix a software bug, by repairing a configuration error, etc. As we mentioned in Section 1.2, a broad goal of both the network designers and network operations is to maintain and continuously improve network reliability, availability, and performance, despite the challenges. “Holding the gains” or staying flat on network performance is insufficient to meet increasingly tight customer and application requirements. There is evidence that the principles and best practices presented in this book have results. Figure 1.1 shows measured Defects-per-Million 1 The Challenges of Building Reliable Networks and Networked Application Services 11 DPM (linear scale) UNPLANNED DPM 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 YEAR Fig. 1.1 Unplanned DPM for AT&T IP Backbone (DPM) on the AT&T IP Backbone since the AT&T Managed Internet Service was first offered in 1999. This chart plots the total number of minutes of port outages during a year (i.e., the number of minutes each customer port was out of service), divided by the number of port minutes in that year (i.e., the number of ports times the number of minutes each was in service), times a normalization factor of 1,000,000. The points are measured data; the smooth curve resembles a classic improvement curve. Over the first 2 years of the service, DPM was reduced significantly as vendor problems were addressed, architectural improvements were put in place, and operations processes were matured. Further improvements continue to be achieved. While DPM is only one of the many fault and performance metrics that must be tracked and managed, this chart illustrates how good design and good operations pay off. The principles that underlie design and operation of reliable networks are also critical to the design and operation of reliable application services. However, there are also many differences between these two domains, including wide differences in the domain knowledge of the typical network engineer and the typical software developers. The life cycle of reliable software starts with understanding the requirements, and involves every step of the development process, including field support and application monitoring. As in networks, capacity and performance engineering of application services rely on both modeling and data collection. This section has described some of the design and network management practices that are performed by large service providers that run reliable networks and services. In Section 1.4, we provide an overview of the material that is covered in the book. 12 C.R. Kalmanek and Y.R. Yang 1.4 A Bird’s Eye View of the Book The book consists of six parts, covering both reliable networks and reliable network application services. 1.4.1 Part I: Reliable Network Design Part I introduces the challenges of building reliable networks and services, and provides background for the rest of the book. Following this chapter, Chapter 2 presents an overview of the structure of a large ISP backbone network. Since IP network reliability is tied intimately to the underlying transport network layers, this chapter presents an overview of these technologies. Section 2.4 provides an overview of the IP control plane, and introduces Multi-Protocol Label Switching (MPLS), a routing and forwarding technology that is used by most large ISPs to support Internet and Virtual Private Network (VPN) services on a shared backbone network. Section 2.5 introduces network restoration, which allows the network to rapidly recover from failures. This section provides a performance analysis of the limitations of OSPF failure detection and recovery to motivate the deployment of MPLS Fast Reroute. The chapter concludes with a case study of an IP network supporting IPTV services that links together many of the concepts. 1.4.2 Part II: Reliability Modeling and Network Planning Part II of the book covers network reliability modeling, and its close cousin, network planning. Chapter 3 starts with an overview of the main router elements (e.g., routing processors, line cards, switching fabric, power supply, and cooling system), and their failure modes. Section 3.2 introduces redundancy mechanisms for router elements, as they are important for availability modeling. Section 3.3 shows how to compute the reliability metrics of a single router with and without redundancy mechanisms. Section 3.4 extends the reliability model from a single router to a large network of edge routers and presents reliability metrics that consider device heterogeneity. The chapter also provides an overview of the challenges in measuring end-to-end availability, which is the focus of Chapter 4. Chapter 4 provides a theoretical grounding in performance and reliability (performability) modeling in the context of a large-scale network. A fundamental challenge is that the size of the state space is exponential in the number of network elements. Section 4.2 presents a hierarchical network model used for performability modeling. Section 4.3 discusses the performability evaluation problem in general and presents the state-generation approach. The chapter also introduces the nperf network performability analyzer, a software package developed at AT&T Labs 1 The Challenges of Building Reliable Networks and Networked Application Services 13 Research. Section 4.4 concludes by presenting two case studies that illustrate the material of this chapter, the first involving an IPTV distribution network, and the second dealing with architecture choices for network access. Chapter 5 focuses on network planning. Since capacity planning depends on utilization and traffic data, the chapter takes a systems view: since network measurements are of varying quality, the modeling process must be robust to data-quality problems while giving useful estimates that can be used for planning: “Essentially, all models are wrong, but some are useful.” This chapter is organized around the key steps in network planning. Sections 5.2 and 5.3 cover measurements, analysis, and modeling of network traffic. Section 5.4 covers prediction, including both incremental planning and green-field planning. Section 5.5 presents optimal network planning. Section 5.6 covers robust planning. 1.4.3 Part III: Interdomain Reliability and Overlay Networks Part III extends beyond the design of a large backbone network to interdomain and overlay networks. Chapter 6 provides an overview of interdomain routing. Section 6.3 highlights the limitations of the BGP routing protocol. For example, the protocol design does not guarantee that routing will converge to a stable route. Section 6.4 presents measurement results that quantify the impact of interdomain routing impairments on end-to-end path performance. Section 6.5 presents a detailed overview of the existing solutions to achieve reliable interdomain routing, and Section 6.6 points out possible future research directions. Overlay networks are discussed in Chapter 7 as a way of providing end-to-end reliability at the application or service layer. The overlay topology can be tailored to application requirements; overlay routing may choose application-specific policies; and overlay networks can emulate functionality not supported by the underlying network. This chapter surveys overlay applications with a focus on how they are used to increase network resilience. The chapter considers how overlay networks can make a distributed application more resilient to flash crowds, to component failures and churn, network failures and congestion, and to denial-of-service attacks. 1.4.4 Part IV: Configuration Management Network design is just one part of building a reliable network or service infrastructure; configuration management is another critical function. Part IV discusses this topic. Chapter 8 discusses network configuration management, presenting a high-level view of the software system involved in managing a large network of routers in support of carrier class services. Section 8.2 reviews key concepts to structure the types 14 C.R. Kalmanek and Y.R. Yang of data items that the system must deal with. Section 8.3 describes the subcomponents of the system and the requirements of each subcomponent. This section also discusses two approaches that are commonly used for router configuration – policybased and template-based, and highlights the different requirements associated with provisioning consumer and enterprise services. Section 8.4 gives an overview of one of the key challenges in designing a configuration-management system, which is handling changes. Finally, the chapter presents a step-by-step overview of the subscriber provisioning process. While a well-designed configuration-management system does configuration auditing, Chapter 9 looks at auditing from a different perspective, describing the need for bottom-up, network-wide configuration validation. Section 9.2 provides a case study of the challenges of configuring a multi-organization “collaboration network,” the types of vulnerabilities caused by configuration errors, the reasons these arise, and the benefits derived from using a configuration validation system. Section 9.3 abstracts from experience and proposes a reference design of a validation system. Section 9.4 discusses the IPAssure system and the design choices it has made to realize this design. Section 9.5 surveys related technologies for realizing this design. Section 9.6 discusses the experience with using IPAssure to assist a US government agency with compliance with FISMA requirements. 1.4.5 Part V: Network Measurement While measurement was not a priority in the original design of the Internet, the complexity of networks, traffic, and the protocols that mediate them now require detailed measurements to manage the network, to verify that performance meets the required goals, and to diagnose performance degradations when they occur. Part V covers network measurement, with a focus on reliability and performance monitoring. Chapter 10 covers data plane measurements. Sections 10.2–10.5 describe a spectrum of passive traffic measurement methods that are currently employed in provider networks, and also describe some newer approaches that have been proposed or may even be deployed in the medium term. Section 10.6 covers active measurement tools. Sections 10.7–10.8 review IP performance metrics and their usage in service-level agreements. Section 10.9 presents multiple approaches to deploy active measurement systems. The control plane in an IP network controls the overall flow of traffic in the network, and is critical to its operation. Chapter 11 covers control-plane measurements. Section 11.2 gives an overview of the key protocols that make up the “unicast” control plane (OSPF and BGP) describes how they are monitored, and surveys key applications of the measurement data. Section 11.3 presents the additional challenges that arise in performing multicast monitoring. 1 The Challenges of Building Reliable Networks and Networked Application Services 15 1.4.6 Part VI: Network and Security Management, and Disaster Preparedness Chapter 12 focuses on the network management systems and the tasks involved in supporting the day-to-day operations of an IP network. The goal of network operations is to keep the network up and running, and performing at or above designed levels of service performance. Section 12.2 covers fault and performance management – detecting, troubleshooting, and repairing network faults and performance impairments. Section 12.3 examines how process automation is incorporated in fault and performance management to automate many of the tasks that were originally executed by humans. Process automation is the key ingredient that enables a relatively small Operations group to manage a rapidly expanding number of network elements, customer ports, and complexity. Section 12.4 discusses tracking and managing network availability and performance over time, looking across larger numbers of network events to identify opportunities for performance improvements. Section 12.5 then focuses on planned maintenance. The chapter also presents areas for innovation and a set of best practices. Chapter 13 presents a service provider’s view of network security. Section 13.2 provides an exposition of the network security threats and their causes. A fundamental concern is that in the area of network security, the economic balance is heavily skewed in favor of bad actors. Section 13.3 presents a framework for network security, including the means of detecting security incidents. Section 13.4 deals with the importance of developing good network security intelligence. Section 13.5 presents a number of operational network security systems used for the detection and mitigation of security threats. Finally, Section 13.6 summarizes important insights and then briefly considers important new and developing directions and concerns in network security as an indication of where resources should be focused both tactically and strategically. Chapter 14 discusses disaster preparedness as the critical factor that determines an operator’s ability to recover from a network disaster. For network operators to effectively recover from a disaster, a significant investment must be made to prepare before the disaster occurs, so that network operations are prepared to act quickly and efficiently. This chapter describes the creation, exercise, and management of disaster recovery plans. With good disaster preparedness, disaster recovery becomes the disciplined management of the execution of disaster recovery plans. 1.4.7 Part VII: Reliable Application Services Large-scale networks exist to connect users to applications. Part VII expands the scope of the book to the software and servers that support network applications. Chapter 15 presents an approach to the design and development of reliable network application software. This chapter presents the entire life cycle of what it 16 C.R. Kalmanek and Y.R. Yang takes to build reliable network applications, including software development process, requirements development, architecture, design and implementation, testing methodology, support, and reporting. This chapter also discusses techniques that aid in troubleshooting failed systems as well as techniques that tend to minimize the duration of a failure. The chapter presents best practices for building reliable network applications. Chapter 16 provides a comprehensive overview of capacity and performance engineering (C/PE), which is especially critical to the successful deployment of a networked service platform. At the highest level, the goal is to ensure that the service meets all performance and reliability requirements in the most cost-effective manner, where “cost” encompasses such areas as hardware/software resources, delivery schedule, and scalability. The chapter uses e-mail as an illustrating example. Section 16.4 covers the architecture assessment phase of the C/PE process, including the flow of critical transactions. Section 16.5 covers the workload/metric assessment phase, including the workload placed on platform elements and the servicelevel performance/reliability metrics that the platform must meet. Sections 16.6 and 16.7 develop analytic models to predict how a proposed platform will handle the workload while meeting the requirements (reliability/ availability assessment and capacity/performance assessment). Sections 16.8 and 16.9 develop engineering guidelines to size the platform initially (scalability assessment) and to maintain service capacity, performance, and reliability post deployment (capacity/performance management). Best practices of C/PE are given at the end of the chapter. 1.5 Conclusion With our society’s increasing dependence on networks and networked application services, the importance of reliability and performance engineering has never been greater. Unfortunately, large-scale networks and services present significant challenges: scale and complexity, the need for correct operation in the presence of constant change, as well as measurement and data challenges. Addressing these challenges requires good design and sound operational practices. Network and service engineers start with a firm understanding of the design objectives, the technology, and the operational environment for the service; follow a comprehensive service design process; and develop capacity and performance engineering models. Network and service management rely on a well-thought out measurement design, a data collection and storage infrastructure, and a suite of management tools and applications. When done right, the end result is a network or service that works well. As customers and applications become more demanding, this “raises the bar” for reliability and performance, ensuring that this field will continue to provide opportunities for research and improvements in practice. 1 The Challenges of Building Reliable Networks and Networked Application Services 17 References 1. A History of the ARPANET. Bolt, Beranek, and Newman, 1981. 2. Deming, W. E. (2000). The new economics for government, industry and education (2nd ed.). Cambridge, MA: MIT Press. ISBN 0–262–54116–5. 3. AT&T statement (1990). The Risks Digest, 9(63). 4. Wilson, A. M. (1998). Alarm management and its importance in ensuring safety, Best practices in alarm management, Digest 1998/279. 5. Stallings, W. (1999). SNMP, SNMPv2, SNMPv3, and RMON 1 and 2 (3rd ed.). Reading, MA: Addison-Wesley. 6. Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., et al. (December 2008). Troubleshooting chronic conditions in large IP networks. Proceedings of the 4th ACM international conference on emerging Networking Experiments and Technologies (CoNEXT). 7. Telemark Survey. http://www.telemarkservices.com/ 8. Schwartz, J. (2007). Who needs hackers? New York Times, September 12, 2007. Chapter 2 Structural Overview of ISP Networks Robert D. Doverspike, K.K. Ramakrishnan, and Chris Chase 2.1 Introduction An Internet Service Provider (ISP) is a telecommunications company that offers its customers access to the Internet. This chapter specifically covers the design of a large Tier 1 ISP that provides services to both residential and enterprise customers. Our primary focus is on a large IP backbone network in the continental USA, though similarities arise in smaller networks operated by telecommunication providers in other parts of the world. This chapter is principally motivated by the observation that in large carrier networks, the IP backbone is not a self-contained entity; it co-exists with numerous access and transport networks operated by the same or other service providers. In fact, how the IP backbone interacts with its neighboring networks and the transport layers is fundamental to understanding its structure, operation, and planning. This chapter is a hands-on description of the practical structure and implementation of IP backbone networks. Our goal is complicated by the complexity of the different network layers, each of which has its own nomenclature and concepts. Therefore, one of our first tasks is to define the nomenclature we will use, classifying the network into layers and segments. Once this partitioning is accomplished, we identify where the IP backbone fits and describe its key surrounding layers and networks. This chapter is motivated by three aspects of the design of large IP networks. The first aspect is that the design of an IP backbone is strongly influenced by the details of the underlying network layers. We will illustrate how the evolution R.D. Doverspike () Executive Director, Network Evolution Research, AT&T Labs Research, 200 S. Laurel Ave, Middletown, NJ 07748, USA e-mail: rdd@research.att.com K.K. Ramakrishnan Distinguished Member of Technical Staff, Networking Research, AT&T Labs Research, Shannon Labs, 180 Park Avenue, Florham Park, NJ 07932, USA C. Chase AT&T Labs, 9505 Arboretum Blvd, Austin, TX 78759, USA e-mail: chase@labs.att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 2, c Springer-Verlag London Limited 2010 19 20 R.D. Doverspike et al. of customer access through the metro network has influenced the design of the backbone. We also show how the evolution of the Dense Wavelength-Division Multiplexing (DWDM) layer has influenced core backbone design. The second aspect presents the use of Multiprotocol Label Switching (MPLS) in large ISP networks. The separation of routing and forwarding provided by MPLS allows carriers to support Virtual Private Networks (VPNs) and Traffic Engineering (TE) on their backbones much more simply than with traditional IP forwarding. The third aspect is how network outages manifest in multiple network layers and how the network layers are designed to respond to such disruptions, usually through a set of processes called network restoration. This is of prime importance because a major objective of large ISPs is to provide a known level of quality of service to its customers through Service Level Agreements (SLAs). Network disruptions occur from two major sources: failure of network components and maintenance activity. Network restoration is accomplished through preplanned network design processes and real-time network control processes, as provided by an Interior Gateway Protocol (IGP) such as Open Shortest Path First (OSPF). We present an overview of OSPF reconvergence and the factors that affect its performance. As customers and applications place more stringent requirements on restoration performance in large ISPs, the assessment of OSPF reconvergence motivates the use of MPLS Fast Reroute (FRR). Beyond the motivations described above, the concepts defined in this chapter lay useful groundwork for the succeeding chapters. Section 2.2 provides a structural basis by providing a high-level picture of the network layers and segments of a typical, large nationwide terrestrial carrier. It also provides nomenclature and technical background about the equipment and network structure of some of the layers that have the largest impact on the IP backbone. Section 2.3 provides more details about the architecture, network topology, and operation of the IP backbone (the IP layer) and how it interacts with the key network layers identified in Section 2.2. Section 2.4 discusses routing and control protocols and their application in the IP backbone, such as MPLS. The background and concepts introduced in Sections 2.2– 2.4 are utilized in Section 2.5, where we describe network restoration and planning. Finally, Section 2.6 describes a “case study” of an IPTV backbone. This section unifies many of the concepts presented in the earlier sections and how they come together to allow network operators to meet their network performance objectives. Section 2.7 provides a summary, followed by a reference list, and a glossary of acronyms and key terms. 2.2 The IP Backbone Network in Its Broader Network Context 2.2.1 Background and Nomenclature From the standpoint of large telecommunication carriers, the USA and most large countries are organized into metropolitan areas, which are colloquially referred to as metros. Large intrametro carriers place their transmission and switching equipment 2 Structural Overview of ISP Networks 21 in buildings called Central Offices (COs). Business and residential customers typically obtain telecommunication services by connecting to a designated first CO called a serving central office. This connection occurs over a feeder network that extends from the CO toward the customer plus a local loop (or last mile) segment that connects from the last equipment node of the feeder network to the customer premise. Equipment in the feeder network is usually housed in above-ground huts, on poles, or in vaults. The feeder and last-mile segments usually consist of copper, optical fiber, coaxial cable, or some combination thereof. Coaxial cable is typical to a cable company, also called a Multiple System Operator (MSO). While we will not discuss metro networks in detail in this chapter, it is important to discuss their aspects that affect the IP backbone. However, the metro networks we describe coincide mostly with those carriers whose origins are from large telephone companies (sometimes called “Telcos”). Almost all central offices today are interconnected by optical fiber. Once a customer’s data or voice enters the serving central office, if it is destined outside that serving central office, it is routed to other central offices in the same metro area. If the service is bound for another metro, it is routed to one or more gateway COs. If it is bound for another country, it eventually routes to an international gateway. A metro gateway CO is often called a Point of Presence (POP). While POPs were originally defined for telephone service, they have evolved to serve as intermetro gateways for almost all telecommunication services. Large intermetro carriers have one or more POPs in every large city. Given this background, we now employ some visualization aids. Networks are organized into network layers, which we depict vertically with two network graphs vertically stacked on top of one another in Fig. 2.1. Each of the network layers can be considered to be an overlay network with respect to the network below. Inter-metro network Metro 5 Metro 4 Metro 3 Metro 1 Metro 2 Fig. 2.1 Conceptual network layers and segmentation 22 R.D. Doverspike et al. We can further organize these layers into access, metro, and core network segments. Figure 2.1 shows the core segment connected to multiple metro segments. Each metro segment represents the network layers of the equipment located in the central offices of a given metropolitan area. The access segment represents the feeder network and loop network associated with a given metro segment. The core segment represents the equipment in the POPs and network structures that connect them for intermetro transport and switching. In this chapter, we focus on the ISP backbone network, which is primarily associated with the core segment. We refer only briefly to access architectures and will discuss portions of the metro segment to the extent to which they interact and connect to the core segment. Also, in this chapter we will not discuss broader telecommunication contexts, such as international networks (including undersea links), satellite, and wireless networks. More detail on the various network segments and their network layers and a historical description of how they arose can be found in [11]. Unfortunately, there is a wide variety of terminology used in the industry, which presents a challenge for this chapter because of our broad scope. Some of the terminology is local to an organization, application, or network layer and, thus, when used in a broader context can be confused with other applications or layers. Within the context of network-layering descriptions, we will use the term IP layer. However, we use the term “IP backbone” interchangeably with “IP layer” in the context of the core network segment. The terms Local Area Network (LAN), Metropolitan Area Network (MAN), and Wide Area Network (WAN) are also sometimes used and correlate roughly with the access, metro, and core segments defined earlier; however, LAN, MAN, and WAN are usually applied only in the context of packet-based networks. Therefore, in this chapter, we will use the terms access, metro, and core, since they apply to a broader context of different network technologies and layers. Other common terms for the various layers within the core segment are long-distance and long-haul networks. 2.2.2 Simple Graphical Model of Network Layers The following simple graph-oriented model is helpful when modeling routing and network design algorithms, to understand how network layers interact and, in particular, how to classify and analyze the impact of potential network disruptions. This model applies to most connection-oriented networks and, thus, will apply to some higher-layer protocols that sit on top of the IP layer. The IP layer itself is connectionless and does not fit exactly in this model. However, this model is particularly helpful to understand how lower network layers and neighboring network layers interact. In the layered model, a network layer consists of nodes, links (also called edges), and connections. The nodes represent types of switches or cross-connect equipment that exchange data in either digital or analog form via the links that connect 2 Structural Overview of ISP Networks 23 them. Note that at the lowest layer (such as fiber) nodes represent equipment, such as fiber-optic patch panels, in which connections are switched manually by crossconnecting fiber patch cords from one interface to another. Links can be modeled as directed (unidirectional) or undirected (bidirectional). Connections are crossconnected (or switched) by the nodes onto the links, and thus form paths over the nodes and links of the graph. Note that the term connection often has different names at different layers and segments. For example, in most telecommunication carriers, a connection (or portions thereof ) is called a circuit in many of the lower network layers, often referred to as transport layers. Connections can be point-to-point (unidirectional or bidirectional), point-to-multipoint or, more rarely, multipoint-tomultipoint. Generally, connections arise from two sources. First, telecommunication services can arise “horizontally” (relative to our conceptual picture of Fig. 2.1) from a neighboring network segment. Second, connections in a given layer can originate from edges of a higher-layer network layer. In this way, each layer provides a connection “service” for the layer immediately above it to provide connectivity. Sometimes, a “client/server” model is referenced, such as the User-Network Interface (UNI) model [29] of the Optical Internetworking Forum (OIF), wherein the links of higher-layer networks are “clients” and the connections of lower-layer networks are “servers”. For example, see G.7713.2 [19] for more discussion of connection management in lower-layer transport networks. Recall that the technology layers we define are differentiated by the nodes, which represent actual switching or cross-connect equipment, rather than more abstract entities, such as protocols within each of these technology layers that can create multiple protocol sublayers. An early manifestation of protocol layering is the OSI model developed by the ISO standards organization [37] and the resulting classification of packet layering, such as Layer 1, Layer 2, Layer 3, which subsequently emerged in the industry. Although these layering definitions can be somewhat strained in usage, the industry generally associates IP with Layer 3 and MPLS or Ethernet VLANS with Layer 2 (which will be described later in the chapter). Layer 1, or the Physical Layer (PHY layer) of the OSI stack, covers multiple technology layers that we will cover in the next section. We illustrate this graphical network-layering model in Fig. 2.2, which depicts two layers. Note that for simplicity, we depict the edges in Fig. 2.2 as undirected. The cross-connect equipment represented by the nodes of Layer U (“upper layer”) connect to their counterpart nodes in Layer L (“lower layer”) by interlayer links, depicted as lightly dashed vertical lines. While this model has no specific geographical correlation, we note that the switching or cross-connect equipment represented in Layer U usually are colocated in the same buildings/locations (central offices in carrier networks) as their lower-layer counterparts in Layer L. In such representations, the interlayer links are called intra-office links. The links of Layer U are transported as connections in lower Layer L. For example, Fig. 2.2 highlights a link between nodes 1 and 6 of layer U . This link is transported via a connection between nodes 1 and 6 of Layer L. The path of this connection is shown through nodes (1, 2, 3, 4, 5, 6) at Layer L. 24 R.D. Doverspike et al. Example Layer U links Nodes of Layer U and Layer L are co-located (same central office) Layer U 1 6 3 5 1 2 Layer-U link is transported as a connection in Layer L 6 3 5 4 Layer L Fig. 2.2 Example of network layering Another example is given by the link between nodes 3 and 5 of Layer U . This routes over nodes (3, 4, 5) in Layer L. As this layered model illustrates, the concept of a “link” is a logical construct, even in lower “physical layer(s)”. Along these lines, we identify some interesting observations in Fig. 2.2: 1. There are more nodes in Layer L than in Layer U . 2. When viewed as separate abstract graphs, the degree of logical connectivity in Layer L is less than that for Layer U . For example, there are at the most three edge-diverse paths between nodes 1 and 6 in layer U . However, there are at the most, only two edge-diverse paths between the corresponding pair of nodes in Layer L. 3. When we project the links of Layer U onto their connection paths in Layer L; we see some overlap. For example, the two logical links highlighted in Layer U overlap on links (3, 4) and (4, 5) of Layer L. These observations generalize to the network layers associated with the IP backbone and affect how network layers are designed and how network failures at various layers affect higher-layer networks. The second observation says that while the logical topology of an upper-layer network, such as the IP layer, looks like it has many alternate paths to accommodate network disruptions, this can be deceiving unless one incorporates the lower-layer dependencies. For example, if link 3–4 of Layer L fails, then both links 1–6 and 3–5 of Layer U fail. Put more generally, failures of links of lower-layer networks usually cause multiple link failures in higher-layer networks. Specific examples will be described in Section 2.3.2. 2 Structural Overview of ISP Networks 25 2.2.3 Snapshot of Today’s Core Network Layers Figure 2.3 provides a representation of the set of services that might be provided by a large US-based carrier, and how these services map onto different network layers in the core segment. This figure is borrowed from [11] and depicts a mixture of legacy network layers (i.e., older technologies slowly being phased out) and current or emerging network layers. For a connection-oriented network layer (call it layer L), demand for connections comes from two sources: (1) links of higher network layers that route over layer L and (2) demand for telecommunications services provided by layer L but which originate outside layer L’s network segment. The second source of demand is depicted by rounded rectangles in Fig. 2.3. Note that Fig. 2.3 is a significant simplification of reality; however, it does capture most predominant layers and principal interlayer relationships relevant to our objectives. Note that an important observation in Fig. 2.3 is that links of a given layer can be spread over multiple lower layers including “skipping” over intermediate lower layers. Before we describe these layers, we provide some preliminary background on Time Division Multiplexing (TDM), whose signals are often used to transport links of the IP layer. Table 2.1 summarizes the most common TDM transmission rates. The Synchronous Optical Network (SONET) digital-signal standard [35], pioneered Frame Relay & ATM Private Line (DS3 to OC-12) Residential IPTV Voice over IP ISP & Business VPN Ethernet Services Ethernet Layer IP Layer ATM Layer Circuitswitched Voice DS1 Private Line Circuit-Switched Layer W-DCS Layer DCS-3/3 Layer Ethernet Private Line Intelligent Optical Switch (IOS) Layer SONET Ring Layer Wavelength Services Key: ROADM / Pt-to-pt DWDM Layer Service Layer-Layer Service Fiber Layer Network Layer Legacy Layer Fig. 2.3 Example of core-segment network layers Connections Gigabit Ethernet Private Line Pre-SONET Transmission Layer 26 R.D. Doverspike et al. Table 2.1 Time division multiplexing (TDM) digital hierarchy (partial list) Approximate rate DS-n Plesiosynchronous SONET SDH 64 Kb/s DS-0 E0 1.5 Mb/s DS-1 2.0 Mb/s E-1 34 Mb/s E-3 45 Mb/s DS-3 51.84 Mb/s STS-1 VC-3 155.5 Mb/s OC-3 STM-1 622 Mb/s OC-12 STM-3 2.5 Gb/s OC-48 STM-16 10 Gb/s OC-192 STM-48 40 Gb/s OC-768 STM-192 100 Gb/s OTN wrapper ODU-1 ODU-2 ODU-3 ODU-4 Kb/s D kilobits per second; Mb/s D megabits per second; Gb/s D gigabits per second. OTN line rates are higher than payload. ODU-2 includes 10 GigE and ODU-3 includes 40 GigE (under development). ODU-4 only includes 100 GigE by Bellcore (now Telcordia) in the early 1990s, is shown in the fourth column of Table 2.1. SONET is the existing higher-rate digital-signal hierarchy of North America. Synchronous Digital Hierarchy (SDH) is a similar digital-signal standard later pioneered by the International Telecommunication Union (ITU-T) and adopted by most of the rest of the world. The DS-n column represents the North American pre-SONET digital-signal rates, most of which originated in the Bell System. The Plesiosynchronous column represents the pre-SDH rates used mostly in Europe. However, after nearly 30 years, both DS-n and Plesiosynchronous are still quite abundant and their related private-line services are still sold actively. Finally, in the last column, we show the more recent Optical Transport Network (OTN) signals, also standardized by the ITU-T [18]. Development of the OTN signal standards were originally motivated by the need for a more robust standard to achieve very high bit rates in DWDM technologies; for example, it was needed to incorporate and standardize various bit-error recovery techniques, such as Forward Error Correction (FEC). As such, the OTN rates were originally termed “digital wrappers” to contain high rate SONET, SDH, or Ethernet signals, plus provide the extra fault notification information needed to reliably transport the high rates. Although there are many protocol layers in OTN, we just show the Optical channel Data Unit (ODU) rates in Table 2.1. To minimize confusion, in the rest of this chapter, we will mostly give examples in terms of DS-n and SONET rates. Referring back to the layered network model of the previous section, Table 2.2 gives some examples of the nodes, links, and connections in Fig. 2.3. We only list those layers that have relevance to the IP layer. We will briefly describe these layers in the following sections. 2 Structural Overview of ISP Networks 27 Table 2.2 Examples of nodes, links, and connections for network layers of Fig. 2.3 Core layer Typical node Typical link Typical connection IP Router SONET OC-n, 1/10 IP is connection-less gigabit Ethernet, ODU-n Ethernet can refer to both 1/10 Gigabit Ethernet Ethernet Ethernet switch or connection-less and or rate-limited router with connection-oriented Ethernet private Ethernet services line functionality Asynchronous ATM switch SONET OC-12/48 Permanent virtual circuit transfer (PVC), Switched virtual mode (ATM) circuit (SVC) W-DCS Wideband digital SONET STS-1 DS1 cross-connect (channelized) system (DCS) SONET Ring SONET add-drop SONET OC-48/192 SONET STS-n, DS-3 multiplexer (ADM) SONET OC-48/192 SONET STS-n IOS Intelligent optical switch (IOS) or broadband digital cross-connect system (DCS) DWDM signal SONET, SDN, or 1/10/100 DWDM Point-to-point gigabit Ethernet DWDM terminal or reconfigurable optical add-drop multiplexer (ROADM) Fiber Fiber patch panel or Fiber optic strand DWDM signal or SONET, cross-connect SDH, or Ethernet signal 2.2.4 Fiber Layer The commercial intercity fiber layer of the USA is privately owned by multiple carriers. In addition to owning fiber, carriers lease bundles of fiber from one another using various long-term Indefeasible Right of Use (IROU) contracts to cover needed connectivity in their networks. Fiber networks differ significantly between metro and rural areas. In particular, in carrier metro networks, optical fiber cables are usually placed inside PVC pipes, which are in turn placed inside concrete conduits. Additionally, fiber for core networks is often corouted in conduit or along rightsof-way with metro fiber. Generally, in metro areas, optical cables are routed and spliced between central offices. In the central office, most carriers prefer to connect the fibers to a fiber patch panel. Equipment that use (or will eventually use) the interoffice fibers are also cross-connected into the patch panels. This gives the carrier flexibility to connect equipment by simply connecting fiber patch cords on the patch panels. Rural areas differ in that there are often long distances between central offices and, as such, intermediate huts are used to splice fibers and place equipment, such as optical amplifiers. 28 R.D. Doverspike et al. 2.2.5 DWDM Layer Although many varieties of DWDM systems exist, we show a simplified view of a (one-way) point-to-point DWDM system in Fig. 2.4. Here, Optical Transponders (OTs) are Optical-Electrical-to-Optical (O-E-O) converters that input optical digital signals from routers, switches, or other transmission equipment using a receive device, such as a photodiode, on the add/drop side of the OT. The input signal has a standard intra-office wavelength, denoted by 0 . The OT converts the signal to electrical form. Various other physical layer protocols may be applied at this point, such as incorporating various handshaking called Link Management Protocols (LMPs) between the transmitting equipment and the receiving OT. A transponder is in clear channel mode if it does not change the transport protocols of the signal that it receives and essentially remains invisible to the equipment connecting to it. For example, Gigabit Ethernet (GigE) protocols from some routers or switches sometimes incorporate signaling messages to the far-end switch in the interframe gaps. If clear channel transmission is employed by the OT, such messages will be preserved as they are routed over the DWDM layer. After conversion to electrical form, the signal is retransmitted using a laser on the network or line-side of the OT. However, typical of traditional point-to-point systems, the wavelength of the laser is fixed to correspond to the wavelength assigned to a specific channel of the DWDM system, k . The output light pulses from multiple OTs at different wavelengths are then multiplexed into a single fiber by sending them through an optical multiplexer, such as an Arrayed Waveguide Grating Optical multiplexer: combines input optical signals with different wavelengths (from one optical fiber each) to output on a single optical fiber. Can be implemented with an optical grating. Optical amplifier client signals (SONET, Ethernet) λ0 λ0 λ0 optical multiplexer λ1 λ2 λn λ0 Optical Transponder (OT): inputs standard intraoffice wavelength (λ0), electrically regenerates signal, and outputs specific wavelength for longdistance transport (λk over channel k) Fig. 2.4 Simplified view of point-to-point DWDM system optical demultiplexer λ0 λ0 OT: inputsλk, electrically regenerates signal, and outputs λ0 2 Structural Overview of ISP Networks 29 (AWG) or similar device. If the distance between the DWDM terminals is sufficiently long, optical amplifiers are used to boost the power of the signal. However, power balancing among the DWDM channels is a major concern of the design of the DWDM system, as are other potential optical impairments. These topics are beyond the scope of this chapter. On the right side of Fig. 2.4, typically, the same (or similar) optical multiplexer is used in reverse, in which case, it becomes an optical demultiplexer. The OTs on the right side (the receive direction of the DWDM system) basically work in reverse to the transmit direction described above, by receiving the specific interoffice wavelength, k , converting to electrical, and then using a laser to generate the intra-office wavelength, 0 . Carrier-based DWDM systems are usually deployed in bidirectional configurations. To see this, the reader can visually reproduce the entire system in Fig. 2.4 and then flip it (mirror it) right to left. The multiplexed DWDM signal in the opposite direction is transmitted over a separate fiber. Therefore, even though the electronics and lasers of the one-way DWDM system in the reverse direction operate separately from the shown direction, they are coupled operationally. For example, the two fiber ports (receive and transmit) of the OT are usually deployed on the same line card and arranged next to one another. Optical amplification is used to extend the distance between terminals of a DWDM system. However, multiple systems are required to traverse the continental USA. Connections can be established between different point-to-point DWDM systems in an intermediate CO via an intermediate-regenerator OT (not pictured in Fig. 2.4). An intermediate-regenerator OT has the same effect on a signal as backto-back OTs. Since the signal does not have to be cross-connected elsewhere in the intermediate central office, cost savings can be achieved by omitting the intermediate lasers and receivers of back-to-back OTs. However, we note that most core DWDM networks have many vintages of point-to-point systems from different equipment suppliers. Typically, an intermediate-regenerator OT can only be used to connect between DWDM systems of the same equipment supplier. A difficulty with deploying point-to-point DWDM systems is that in central offices that interface multiple fiber spans (i.e., the node in the fiber layer has degree >2), all connections demultiplex in that office and pass through OTs. OTs are typically expensive and it is advantageous to avoid their deployment where possible. A better solution is the Reconfigurable Optical Add-Drop Multiplexer (ROADM). We show a simplified diagram of a ROADM in Fig. 2.5. The ROADM allows for multiple interoffice fibers to connect to the DWDM system. Appropriately, it is often called a multidegree ROADM or n-degree ROADM. As Fig. 2.5 illustrates, the ROADM is able to optically (i.e., without use of OTs) cross-connect channel k (transmitting at wavelength k ) arriving on one fiber to channel k (wavelength k ) outgoing on another fiber. Note that the same wavelength must be used on the two fibers. This is called the wavelength continuity constraint. The ROADM can also be configured to terminate (or “drop”) a connection at that location, in which case it is cross-connected to an OT to connect to routers, switches, or transmission equipment. A “dropped” connection is illustrated by 2 on the second fiber from the top on the left in Fig. 2.5 and an “added” connection is illustrated by n on the bottom 30 R.D. Doverspike et al. Optical Transponders (OT) also provided in bidirectional mode for regeneration at intermediate nodes λ1 λ2 λn λ1 λ2 in out λn λ0 λ0 λ0 λ0 λ1 optical multiplexer λn ROADM optical demultiplexer λ0 λ0 Fig. 2.5 Simplified view of Reconfigurable Optical Add-Drop Multiplexer (ROADM) fiber on the left. As with the point-to-point DWDM system, optical properties of the system impose distance (also called reach) constraints. Many transmission technologies, including optical amplification, are used to extend the distance between the optical add/drop points of a DWDM system. Today, this separation is designed to be about 1,500 km for a long-distance DWDM system, as a trade-off between cost and the all-optical distance for a US-wide network. Longer connections have to regenerate their signals, usually with an intermediate-regenerator OT. As with point-to-point DWDM systems, connections crossing ROADMS from different equipment suppliers usually must add/drop and connect through OTs. We illustrate a representative ROADM layer for the continental USA in Fig. 2.6. The links represent fiber spans between ROADMS. As described above, to route a connection over the network of Fig. 2.6 may require points of regeneration. We also note, though, that today’s core transport carriers usually have many vintages of DWDM technology and, thus, there may be several ROADM networks from different equipment suppliers, plus several point-to-point DWDM networks. All this complexity must be managed when routing higher-layer links, such as those of the IP backbone, over the DWDM layer. We finish this introduction of the DWDM layer with a few observations. While most large carriers have DWDM technology covering their core networks, this is not generally true in the metro segment. The metro segment typically consists of a mixture of DWDM spans and fiber spans (i.e., spans with no DWDM). If fact, in metro areas usually only a fraction of central office fiber spans have DWDM technology routed over them. This affects how customers interface to the IP backbone network for higher-rate interfaces. Finally, we note that while most 2 Structural Overview of ISP Networks 31 Note: This figure is a simplified illustration. It does not represent the specific design of any commercial carrier Seattle Portland Chicago Salt Lake City Reconfigurable Optical Add / Drop Multiplexer (ROADM) Fig. 2.6 Example of ROADM Layer topology of the connections for the core DWDM layer arise from links of the IP layer, many of the connections come from what many colloquially call “wavelength services” (denoted by the rounded rectangle in Fig. 2.3). These come from high-rate private-line connections emanating from outside the core DWDM layer. Examples are links between switches of large enterprise customers that are connected by leased-line services. 2.2.6 TDM Cross-Connect Layers In this section, we will briefly describe the TDM cross-connect layers. TDM cross-connect equipment can be basically categorized into two common types: a SONET/SDH Add-Drop Multiplexer (ADM) or a Digital Cross-Connect System (DCS). Consistent with our earlier remark about the use of terminology, the latter often goes by a variety of colloquial or outmoded model names of equipment suppliers, such as DCS-3/1, DCS-3/3, DACS, and DSX. A TDM cross-connect device interfaces multiple high-rate digital signals, each of which uses time division multiplexing to break the signal into lower-rate channels. These channels carry lower-rate TDM connections and the TDM cross-connect device cross-connects the lower-rate signals among the channels of the different high-rate signals. Typically, an ADM only interfaces two high-rate signals, while a DCS interfaces many. However, over time these distinctions have blurred. Telcordia classified DCSs into three layers: 32 R.D. Doverspike et al. a narrowband DCS (N-DCS) cross-connects at the DS-0 rate, a wideband-DCS (W-DCS) cross-connects at the DS-1 rate, and a broadband-DCS (B-DCS) crossconnects at the DS-3 rate or higher. ADMs are usually deployed in SONET/SDH self-healing rings. The IOS and SONET Ring layers are shown in Fig. 2.3, encircled by the (broader) ellipse that represents the TDM cross-connect devices. More details on these technologies can be found in [11]. Self-healing rings and DCSs will be relevant when we illustrate how services access the wide-area ISP network layer later in this chapter. Despite the word “optical” in its name, an Intelligent Optical Switch (IOS) is a type of B-DCS. Examples can be found in [6, 34]. The major differentiator of the IOS over older B-DCS models is its advanced control plane. An IOS network can route connection requests under distributed control, usually instigated by the source node. This requires mechanisms for distributing topology updates and internodal messaging to set up connections. Furthermore, an IOS usually can restore failed connections by automatically rerouting them around failed links. More detail is given when we discuss restoration methods. Many of the connections for the core TDM-cross-connect layers (ring layers, DCS layers, IOS layer) come from higher layers of the core network. For example, many connections of the IOS layer are links between W-DCSs, ATM networks, or lower-rate portions of IP layer networks. However, much of their demand for connections comes from subwavelength private-line services, shown by the rounded rectangle in Fig. 2.3. A portion of this private-line demand is in the form of Ethernet Private Line (EPL) services. These services usually represent links between Ethernet switches or routers of large enterprise customers. For example, the Gigabit Ethernet signal from an enterprise customer’s switch is transported over the metro network and then interfaces an Ethernet card either residing on the IOS itself or on an ADM that interfaces directly onto the IOS. The Ethernet card encapsulates the Ethernet frames inside concatenated n STS-1 signals that are transported over the IOS layer. The customer can choose the rate of transport, and hence the value of n he/she wishes to purchase. The ADM Ethernet card polices the incoming Ethernet frames to the transport rate of n STS-1. 2.2.7 IP Layer The nodes of the IP layer shown in Fig. 2.3 represent routers that transport packets among metro area segments. IP generally define pairwise adjacencies between ports of the routers. In the IP backbone, these adjacencies are typically configured over SONET, SDH, or Ethernet, or OTN interfaces on the routers. As described above, these links are then transported as connections over the interoffice lowerlayer networks shown in Fig. 2.3. Note that different links can be carried in different lower-layer networks. For example, lower-rate links may be carried over the TDM cross-connect layers (IOS or SONET Ring), while higher-rate links may be carried directly over the DWDM layer, thus “skipping” the TDM cross-connect layers. We will describe the IP layer in more detail in subsequent sections. 2 Structural Overview of ISP Networks 33 2.2.8 Ethernet Layer The Ethernet layer in Fig. 2.3 refers to several applications of Ethernet technology. For example, Ethernet supports a number of physical layer standards that can be used for Layer 1 transport. Ethernet also refers to connection-oriented Layer 2 pseudowire services [16] and connection-less transparent LAN services. For example, intra-office links between routers often use an Ethernet physical layer riding on optical fiber. An important application of Ethernet today is providing wide-area Layer 2 Virtual Private Network (VPN) services for enterprise customers. Although many variations exist, these services generally support enterprise customers that have Ethernet LANs at multiple locations and need to interconnect their LANs within a metro area or across the wide area. Most large carriers provide these services as an overlay on their IP layer, and hence, why we show the layered design in Fig. 2.3. Prior to the ability to provide such services over the IP layer, Ethernet private lines were supported by TDM cross-connect layers (i.e., Ethernet frames encapsulated over Layer 1 TDM private lines as described in Section 2.2.6). However, analogous to why wide-area Frame Relay displaced wide-area DS-0 private lines in the 1990s, wide-area packet networks are often more efficient than private lines to connect LANs of enterprise customers. The principal approach that intermetro carriers use to provide wide-area Ethernet private network services is Virtual Private LAN Service (VPLS) [24, 25]. In this approach, carriers provide such Ethernet services with routers augmented with appropriate Ethernet capabilities. The reason for this approach is to provide the robust carrier-grade network capabilities provided by routers. With wide-area VPLS, the enterprise customer is connected via the metro network to the edge routers on the edge of the core IP layer. We describe how the metro network connects to the core IP layer network in the next section. The VPLS architecture is described in more detail in Section 2.4.2 when we describe MPLS. We conclude this section with the comment that standards organizations and industry forums (e.g., IEEE, IETF, and Metro Ethernet Forum) have explored the use of Ethernet switches with upgraded carrier-grade network control protocols rather than using routers as nodes in the IP layer. For example, see Provider Backbone Transport (PBT) [27] and Provider Backbone Bridge – Traffic Engineering (PBB-TE) [15]. However, most large ISPs are deploying MPLS-based solutions. Therefore, we concentrate on the layering architecture shown in Fig. 2.3 in the remainder of this chapter. 2.2.9 Miscellaneous/Legacy Layers For completeness, we depict other “legacy” network layers with dashed ovals in Fig. 2.3. These technologies have been around for decades in most carrierbased core networks. They include network layers whose nodes represent ATM 34 R.D. Doverspike et al. switches, Frame-Relay switches, DCS-3/3s (a B-DCS that cross-connects DS3s), Voice-switches (DS-0 circuit switches), and pre-SONET ADMs. Most of these layers are not material to the spirit of this chapter and we do not discuss them here. 2.3 Structure of Today’s Core IP Layer 2.3.1 Hierarchical Structure and Topology In this chapter, we further break the IP layer into Access Routers (ARs) and Backbone Routers (BRs). Customer equipment homes to access routers, which in turn home onto backbone routers. An AR is either colocated with its backbone routers or not; the latter is called a Remote Access Router (RAR). Of course, there are alternate terminologies. For example, the IETF defines similar concepts to customer equipment, access routers, and backbone routers with its definitions, respectively, of Customer-Edge (CE) equipment, Provider-Edge (PE) routers, and Provider (P) routers. A simplified picture of a typical central office containing both ARs and BRs is shown in Fig. 2.7. Access routers are dual-homed to two backbone routers to enable higher levels of service availability. The links between routers in the same office are typically Ethernet links over intra-office fiber. While we show only two ARs in Channelized OC-12 m-GigE BR SONET OC-n (e.g., n= 768) AR Intra-office Fiber BR BR CORE ROADM Layer Network AR IntraOffice TDM Layers Access/ Metro TDM Layers Example of DS1 access circuits multiplexed over channelized OC-12 interface BR RAR BR = Backbone Router = IP Layer Logical Link = IP Layer Access Link (R)AR = (Remote) Access Router = Router Line Card = Central Office Fig. 2.7 Legacy central office interconnection diagram (Layer 3) 2 Structural Overview of ISP Networks 35 Note: This figure is a simplified illustration. It does not represent the specific design of any commercial carrier Core Router Intra-building Access / Edge Router Remote Access / Edge Router Fig. 2.8 Example of IP layer switching hierarchy Fig. 2.7, note that typically there are many ARs in large offices. Also, due to scaling and sizing limitations, there may be more than two backbone routers or switches per central office used to further aggregate AR traffic before it enters the BRs. Moreover, we show a remote access router that homes to one of the BRs. Figure 2.8 illustrates this homing arrangement in a broader network example, where small circles represent ARs, diamonds represent RARs, and large squares represent BRs. Note that remote ARs are homed to BRs in different offices. Homing remote ARs to BRs in different central offices raises network availability. However, a stronger motivation for doing this is that RAR–BR links are usually routed over the DWDM layer, which generally does not offer automatic restoration, and so the dual-homing serves two purposes: (1) protect against BR failure or maintenance activity and (2) protect against failure or maintenance of a RAR–BR link. While the homing scheme described here is typical of large ISPs, other variations exist. For example, there are dual-homing architectures where (nonremote) ARs are homed to a BR colocated in the same central office and then a second BR in a different central office. While this latter architecture provides a slightly higher level of network availability against broader central office failure, it can be more costly owing to the need to transport the second AR–BR link. However, the latter architecture allows more load balancing across BRs because of the extra flexibility in homing ARs. 36 R.D. Doverspike et al. Improved load balancing can offer other advantages, including lower BR costs. Also, for ISPs with many scattered locations, but less total traffic, this latter architecture may be more cost-effective than colocating two BRs in each BR-office. The right side of Fig. 2.7 also shows the metro/access network-layer clouds to connect customer equipment to the ARs. In particular, we illustrate DS1 customer interfaces. The left side of Fig. 2.7 also shows the lower-layer DWDM clouds to connect the interoffice links between BRs. We will expand these clouds in the next sections. The reasons for segregating the IP topology into access and backbone routers are manifold: Access routers aggregate lower-rate interfaces from various customers or other carriers. This function requires significant equipment footprint and processor resources for customer-related protocols. As a result, major central offices consist of many access routers to accommodate the low-rate customer interfaces. Without the aggregation function of the backbone router, each such office would be a myriad of tie links between access routers and interoffice links. Access routers are often segregated by different services or functions. For example, general residential ISP service can be segregated from high-priority enterprise private VPN service. As another example, some access routers are sometimes segregated to be peering points with other carriers. Backbone routers are primarily designed to be IP-transport switches equipped only with the highest speed interfaces. This segregation allows the backbone routers to be optimally configured for interoffice IP forwarding and transport. 2.3.2 Interoffice Topology Figure 2.9 expands the core lower ROADM Layer cloud of Fig. 2.7. It shows ports of interoffice links between BRs connecting to ports on ROADMs. These links are transported as connections in the ROADM network. For example, today these links go up to 40 gigabits per second (Gb/s) or SONET OC-768. These connections are routed optically through intermediate ROADMs and regenerated where needed, as described in Section 2.2.5. Also, we note that the link between the remote ARs and BRs route over the same ROADM network, although the rate of this RAR–BR link may be at lower rate, such as 10 Gb/s. Figure 2.10 shows a network-wide example of the IP layer interoffice topology. There are some network-layering principles illustrated in Fig. 2.10 that we will describe. First, if we compare the IP layer topology of Fig. 2.8 with that of the DWDM layer (ROADM layer) of Fig. 2.10, we note that there is more connectivity in the IP layer graph than the DWDM layer. The reason for this is the existence of what many IP layer planners call express links. If we examine the link labeled “direct link” between Seattle and Portland, we find that when we route this link over the DWDM layer topology, there are no intermediate ROADMs. In fact, there are two types of direct links. The first type connects through 2 Structural Overview of ISP Networks 37 BR ROADM Core ROADM Layer Network CO D AMP BR BR ROADM BR CO C AMP AMP ROADM ROADM OT for transport of links of IOS Layer or high rate private line Service CO A RAR CO B ROADM = Reconfigurable Optical Add-Drop Multiplexer (R)AR = (Remote) Access Router BR = Backbone Router = Central Office (CO) = ROADM Optical Transponder (OT) = Router Line Card = ROADM Layer connection transporting IP layer link Fig. 2.9 Core ROADM Layer diagram Direct link Express link Seattle Portland Note: This figure is a simplified illustration. It does not represent the specific design of any commercial carrier Chicago Salt Lake City Core Router Aggregate Link Fig. 2.10 Example of IP layer interbackbone topology 38 R.D. Doverspike et al. no intermediate ROADMs, as illustrated by the Seattle–Portland link. The second type connects through intermediate ROADMS, but encounters no BRs in those intermediate central offices, as illustrated by the Seattle–Chicago link. In contrast, if we examine the express link between Portland and Salt Lake City, we find that any path in the DWDM layer connecting the routers in that city pair bypasses routers in at least one of its intermediate central offices. Express links are primarily placed to minimize network costs. For example, it is more efficient to place express links between well-chosen router pairs with high network traffic (enough to raise the link utilization above a threshold level); otherwise the traffic will traverse through multiple routers. Router interfaces can be the most-expensive single component in a multilayered ISP network; therefore, costs can usually be minimized by optimal placement of express links. It is also important to consider the impact of network layering on network reliability. Referring to the generic layering example of Fig. 2.2, we note that the placement of express links can cause a single DWDM link to be shared by different IP layer links. This gives rise to complex network disruption scenarios, which must be modeled using sophisticated network survivability modeling tools. This is covered in more detail in Section 2.5.3. Returning to Fig. 2.10, we also note the use of aggregate links. Aggregate links also go by other names, such as bundled links and composite links. An aggregate link bundles multiple physical links between a pair of routers into a single virtual link from the point of view of the routers. For example, an aggregate link could be composed of five OC-192 (or 10 GigE) links. Such an aggregate link would appear as one link with 50 Gb/s of capacity between the two routers. Generally, aggregate links are implemented by a load-balancing algorithm that transparently switches packets among the individual links. Usually, to reduce jitter or packet reordering, packets of a given IP flow are routed over the same component link. The main advantage of aggregate links is that as IP networks grow large, they tend to contain many lower-speed links between a pair of routers. It simplifies routing and topology protocols to aggregate all these links into one. If one of the component links of an aggregate link fails, the aggregate link remains up; consequently, the number of topology updates due to failure is reduced and network rerouting (called reconvergence) is less frequent. Network operators seek to achieve network stability, and therefore shy away from many network reconvergence events; aggregate links result in less network reconvergence events. On the downside, if only one link of a (multiple link) aggregate link fails, the aggregate link remains “up”, but with reduced capacity. Since many network routing protocols are capacity in-sensitive, packet congestion could occur over the aggregate link. To avoid this situation, router software is designed with capacity thresholds for aggregate links that the network operator can set. If the aggregate capacity falls below the threshold, the entire aggregate link is taken out of service. While the network “loses” the capacity of the surviving links in the bundle when the aggregate link is taken out of service, the alternative is potentially significant packet loss due to congestion on the remaining links. 2 Structural Overview of ISP Networks 39 2.3.3 Interface with Metro Network Segment Figure 2.11 is a blowup of the clouds on the right side of Fig. 2.7. It provides a simplified example of how three business ISP customers gain access to the IP backbone. These could be enterprise customers with multiple branches who subscribe to a VPN service. Each access method consists of a DS1 link encapsulating IP packets that is transported across the metro segment. In carrier vernacular, using packet/TDM links to access the IP backbone is often called TDM backhaul. We do not show the inner details of the metro network here. Detailed examples can be found in [11]. Even suppressing the details of the complex metro network, the TDM backhaul is clearly a complicated architecture. To aid his/her understanding, we suggest the reader to refer back to the TDM hierarchy shown in Table 2.1. The customer’s DS-1 (which carries encapsulated IP packets) interfaces to a low-speed multiplexer located in the customer building, such as a small SONET ADM. This ADM typically serves as one node of a SONET ring (usually a 2-node ring). Each link of the ring is routed over diverse fiber, usually at OC-3 or OC-12 rate. Eventually, the DS-1 is routed to a SONET OC-48 or OC-192 ring that has one of its ADMs in the POP. The DS-1 is transported inside an STS-1 signal that is divided into 28 time slots called channels (a channelized STS-1), as specified by the SONET standard. The ADM routes all the SONET STS-1s carrying DS-1 traffic bound for the core carrier to a metro W-DCS. Note that there are often multiple AR Channelized OC-12 Customer Location Intra-Office TDM Layers W-DCS (Core) DS1/DS3 W-DCS (Metro) ADM AR MSP MSP IOS (Core) Example of 3 DS1 access circuits OC-12 SONET ADM (Metro) Access | Metro TDM Layers (R)AR = (Remote) Access Router MSP = Multi-service Platform (multiplexes low-rate TDM circuits) = Layer 3 Logical Link = IP Layer Access Link = Router Line Card W-DCS = Wideband Digital Cross-Connect System ADM = Add-Drop Multiplexer ADM DS1/DS3 Customer Location Fig. 2.11 Legacy central office interconnection diagram (intra-office TDM layers) 40 R.D. Doverspike et al. core carriers in a POP, and hence, the metro W-DCS cross-connects all the DS-1s destined for a given core carrier into channelized STS-1s and hands them off to the core W-DCS(s) of that core carrier. However, note that this handoff does not occur directly between the two W-DCSs, but rather passes through a higher-rate B-DCS, in this case the Intelligent Optical Switch (IOS) introduced in Section 2.2.6. The IOS cross-connects most of the STS-1s (multiplexed into OC-n interfaces) in a central office. Also, notice that the IOS is fronted with Multi-Service Platforms (MSPs). An MSP is basically an advanced form of SONET ADM that gathers many types of lower-speed TDM interfaces and multiplexes them up to OC-48 or OC-192 for the IOS. It usually also has Ethernet interfaces that encapsulate IP packets into TDM signals (e.g., for Ethernet private line discussed earlier). The purpose of such a configuration is to minimize the cost and scale of the IOS by avoiding using its interface bay capacity for low-speed interfaces. Finally, the core W-DCS cross-connects the DS1s destined for the access routers in the central office onto channelized STS-1s. Again, these STS-1s are routed to the AR via the IOS and its MSPs. The DS-1s finally reach a channelized SONET card on the AR (typically OC-12). This card on the AR de-multiplexes the DS-1s from the STS-1, de-encapsulates the packets, and creates a virtual interface for each of our three example customer access links in Fig. 2.11. The channelized SONET card is colloquially called a CHOC card (CHannelized OC-n). Note that the core and metro carriers depicted in Fig. 2.11 may be parts of the same corporation. However, this complex architecture arose from the decomposition of long-distance and local carriers that was dictated by US courts and the Federal Communications Commission (FCC) at the breakup of the Bell System in 1984. It persists to this day. If we reexamine the above TDM metro access descriptions, we find that there are many restoration mechanisms, such as dual homing of the ARs to the BRs and SONET rings in the metro network. However, there is one salient point of potential failure. If an AR customer-facing line card or entire AR fails or is taken out of service for maintenance in Fig. 2.11, then the customer’s service is also down. Carriers offer service options to protect against this. The most common provide two TDM backhaul connections to the customer’s equipment, often called Customer Premise Equipment (CPE), each of which terminates on a different access router. This architecture significantly raises the availability of the service, but does incur additional cost. An example of such a service is given in [1]. To retain accuracy, we make a final technical comment on the example of Fig. 2.11. Although we show direct fiber connections between the various TDM and packet equipment, in fact, most of these usually occur via a fiber patch panel. This enables a craftsperson to connect the equipment via a simple (and well-organized) patch chord or cross-connect. This minimizes expense, simplifies complex wiring, and expedites provisioning work orders in the CO. Figure 2.12 depicts how customers access the AR via emerging metro packet network layers instead of TDM. Here, instead of the traditional TDM network, the customer accesses the packet core via Ethernet. The most salient difference is the substantially simplified architecture. Although many different types of services 2 Structural Overview of ISP Networks 41 Customer Location n-GigE BR FE | GigE AR NTE Ethernet Virtual Private Line (combo of VLAN & Pseudowire) BR AR RE (Metro) RE (Core) Dual role access router and Ethernet switch AR RE NTE n-GigE RE (Metro) Access | Metro Ethernet Layer = Access Router = Router | Ethernet Switch = Network Terminating Equipment = Layer 3 Link = Layer 2-3 Link = Router | Ethernet Line Card = Virtual access link to IP Layer NTE FE | GigE Customer Location Fig. 2.12 Central office interconnection diagram (metro Ethernet interface) are possible, we describe two fundamental types of Ethernet service: Ethernet virtual circuits and Ethernet VPLS. Most enterprise customers will use both types of services. There are three basic types of connectivity for Ethernet virtual circuits: (1) intrametro, (2) ISP access via establishment of Ethernet virtual circuits between the customer location and IP backbone, and (3) intermetro. Since our main focus is the core IP backbone, we discuss the latter two varieties. For ISP access, in the example of Fig. 2.12, the customer’s CPE interfaces the metro network via Fast Ethernet (FE) or GigE into a small Ethernet switch placed by the metro carrier called Network Terminating Equipment (NTE). The NTE is the packet analog of the small ADM in the TDM access model in Fig. 2.11. For most metro Ethernet services, the customer can usually choose which policed access rate he/she wishes to purchase in increments of 1 Mb/s or similar. For example, he/she may wish 100 Mb/s for his/her Committed Information Rate (CIR) and various options for his/her Excess Information Rate (EIR). The EIR options control how his bandwidth bursts are handled/shared when they exceed his CIR. The metro packet networks uses Virtual Local Area Network (VLAN) identifiers [14] and pseudowires or MPLS LSPs to route the customer’s Ethernet virtual circuit to the metro Ethernet switch/router in the POP, as shown in Fig. 2.12. VLANs can also be used to segregate a particular customer’s services, such as the two fundamental services (VPLS vs Internet access) described here. The metro Ethernet switch/router has high-speed links 42 R.D. Doverspike et al. (such as 10 Gb/s) to the core Ethernet switch/router. However, the core Ethernet switch/router is fundamentally an access router, but with the needed features and configurations needed to provide Ethernet and VPLS, and thus homes to backbone routers as any other access router. Thus, the customer’s virtual circuit is mapped to a virtual port on the core AR/Ethernet-Switch and from that point onward is treated similarly as the TDM DS-1 virtual port in Fig. 2.11. If an intermetro Ethernet virtual circuit is needed, then an appropriate pseudowire or tunnel can be created between the ARs in different metros. Such a service can eventually substitute for traditional private-line service as metro packet networks are deployed. The second basic type of Ethernet service type is generally provided through the VPLS model described in Section 2.2.8. For example, the customer might have two LANs in metro-1, one LAN in metro-2 and another LAN in metro-3. Wide-area VPLS interconnects these LANs into a large transparent LAN. This is achieved using pseudowires (tunnels) between the ARs in metros-1, 2, and 3. Since the core access router has a dual role as access router and Ethernet VPLS switch, it has the abilities to route customer Ethernet frames among pseudowires among the remote access routers. Besides enterprise Ethernet services, connection of cellular base stations to the IP backbone network is another important application of Ethernet metro access. Until recently, this was achieved by installing DS-1s from cell sites to circuit switches in Mobile Telephone Switching Offices (MTSOs) to provide voice service. However, with the advent and rapid growth of cellular services based on 3G or 4G technology, there is a growing need for high-speed packet-based transport from cell sites to the IP backbone. The metro Ethernet structure for this is similar to that of the enterprise customer access shown in Fig. 2.12. The major differences occur in the equipment at the cell site, the equipment at the MTSO, and then how this equipment connects to the access router/Ethernet switch of the IP backbone. 2.4 Routing and Control in ISP Networks 2.4.1 IP Network Routing The IP/MPLS routing protocols are an essential part of the architecture of the IP backbone, and are key to achieving network reliability. This section introduces these control protocols. An Interior Gateway Protocol (IGP) disseminates routing and topology information within an Autonomous System (AS). A large ISP will typically segment its IP network into multiple autonomous systems. In addition, an ISP’s network interconnects with its customers and with other ISPs. The Border Gateway Protocol (BGP) is used to exchange global reachability information with ASs operated by the same ISP, by different ISPs, and by customers. In addition, IP multicast is becoming more widely deployed in ISP networks, using one of several variants of the Protocol-Independent Multicast (PIM) routing protocol. 2 Structural Overview of ISP Networks 43 2.4.1.1 Routing with Interior Gateway Protocols As described earlier, Interior Gateway Protocols are used to disseminate routing and topology information within an AS. Since IGPs disseminate information about topology changes, they play a critical role in network restoration after a link or node failure. Because of the importance of restoration to the theme of this chapter, we discuss this further in Section 2.5.2. The two types of IGPs are distance vector and link-state protocols. In link-state routing [32], each router in the AS maintains a view of the entire AS topology using a Shortest Path First (SPF) algorithm. Since link-state routing protocols such as Open Shortest Path First (OSPF) [26] and Intermediate System–Intermediate System (IS–IS) [30] are the most commonly used IGPs among large ISPs, we will not discuss distance vector protocols further. For the purposes of this chapter, which focuses on network restoration, the functionality of OSPF and IS–IS are similar. We will use OSPF to illustrate how IGPs handle failure detection and recovery. The view of network topology maintained by OSPF is conceptually a directed graph. Each router represents a vertex in the topology graph and each link between neighboring routers represents a unidirectional edge. Each link also has an associated weight (also called cost) that is administratively assigned in the configuration file of the router. Using the weighted topology graph, each router computes a shortest path tree (SPT) with itself as the root, and applies the results to build its forwarding table. This assures that packets are forwarded along the shortest paths in terms of link weights to their destinations [26]. We will refer to the computation of the shortest path tree as an SPF computation, and the resultant tree as an SPF tree. As illustrated in Fig. 2.13, the OSPF topology may be divided into areas, typically resulting in a two-level hierarchy. Area 0, known as the “backbone area”, resides at the top level of the hierarchy and provides connectivity to the nonbackbone areas (numbered 1, 2, etc.). OSPF typically assigns a link to exactly one area. Links may be in multiple areas, and multi-area links are addressed in more detail in Chapter 11 (Measurements of Control Plane Reliability and Performance by Aman Shaikh and Lee Breslau). Routers that have links to multiple areas are called border routers. For example, routers E, F and I are border routers in Fig. 2.13. Every router maintains its own copy of the topology graph for each area to which it is connected. The router performs an SPF computation on the topology graph for each area and thereby knows how to reach nodes in all the areas to which it connects. To improve scalability, OSPF was designed so that routers do not need to learn the entire topology of remote areas. Instead, routers only need to learn the total weight of the path from one or more area border routers to each node in the remote area. Thus, after computing the SPF tree for the area it is in, the router knows which border router to use as an intermediate node for reaching each remote node. Every router running OSPF is responsible for describing its local connectivity in a Link-State Advertisement (LSA). These LSAs are flooded reliably to other routers in the network, which allows them to build their local view of the topology. The flooding is made reliable by each router acknowledging the receipt of every LSA it receives from its neighbors. The flooding is hop-by-hop and hence does not depend 44 R.D. Doverspike et al. Z 5 Y B 1 10 A X 5 C 1 1 D Area 1 1 F 1 1 E H 2 1 1 1 Internal IGP Router 3 G Border Router (between OSPF Areas) I Area 2 2 Area 0 L 1 1 AS Border Router J 1 K Fig. 2.13 OSPF topology: areas and hierarchy on routing. The set of LSAs in a router’s memory is called a Link-State Database (LSDB) and conceptually forms the topology graph for the router. OSPF uses several types of LSAs for describing different parts of topology. Every router describes links to all its neighbor routers in a given area in a Router LSA. Router LSAs are flooded only within an area and thus are said to have an area-level flooding scope. Thus, a border router originates a separate Router LSA for every area to which it is connected. Border routers summarize information about one area and distribute this information to adjacent areas by originating Summary LSAs. It is through Summary LSAs that other routers learn about nodes in the remote areas. Summary LSAs have an area-level flooding scope like Router LSAs. OSPF also allows routing information to be imported from other routing protocols, such as BGP. The router that imports routing information from other protocols into OSPF is called an AS Border Router (ASBR). Routers A and B are ASBRs in Fig. 2.13. An ASBR originates External LSAs to describe the external routing information. The External LSAs are flooded in the entire AS irrespective of area boundaries, and hence have an AS-level flooding scope. While the capability exists to import external routing information from protocols such as BGP, the number of such routes that may be imported may be very large. As a result, this can lead to overheads both in communication (flooding the external LSAs) as well as computation (SPF computation scales with the number of routes). As a consequence of the scalability problems they pose, the importing of external routes is rarely utilized. Two routers that are neighbor routers have link-level connectivity between each other. Neighbor routers form an adjacency so that they can exchange routing 2 Structural Overview of ISP Networks 45 information with each other. OSPF allows a link between the neighbor routers to be used for forwarding only if these routers have the same view of the topology, i.e., the same link-state database. This ensures that forwarding data packets over the link does not create loops. Thus, two neighbors have to make sure that their link-state databases are synchronized, and they do so by exchanging parts of their link-state databases when they establish an adjacency. The adjacency between a pair of routers is said to be “full” once they have synchronized their link-state databases. While sending LSAs to a neighbor, a router bundles them together into a Link-State Update packet. We will re-examine the OSPF reconvergence process in more detail when we discuss network disruptions in Section 2.5.2.1. Although elegant and simple, basic OSPF is insensitive to network capacity and routes packets hop-by-hop along the SPF tree. As mentioned in Section 2.3.2, this has some potential shortcomings when applied to aggregate links. While aggregatelink capacity thresholds can be tuned to minimize this potentially negative effect, a better approach may be to use capacity-sensitive routing protocols, often called Traffic Engineering (TE) protocols, such as OSPF-TE [21]. Alternatively, one may use routing protocols with a greater degree of routing control, such as MPLS-based protocols. Traffic Engineering and MPLS are discussed later in this chapter. 2.4.1.2 Border Gateway Protocol The Border Gateway Protocol is used to exchange routing information between autonomous systems, for example, between ISPs or between an ISP and its large enterprise customers. When BGP is used between ASs, it is referred to as Exterior BGP (eBGP). When BGP is used within an AS to distribute external reachability information, it is referred to as Interior BGP (iBGP). This section provides a brief summary of BGP. It is covered in much greater detail in Chapters 6 and 11. BGP is a connection-oriented protocol that uses TCP for reliable delivery. A router advertises Network Layer Reachability Information (NLRI) consisting of an IP address prefix, a prefix length, a BGP next hop, along with path attributes, to its BGP peer. Packets matching the route will be forwarded toward the BGP next hop. Each route announcement can also have various attributes that can affect how the peer will prioritize its selection of the best route to use in its routing table. One example is the AS PATH attribute which is a list of ASes through which the route has been relayed. Withdrawal messages are sent to remove NLRI that are no longer valid. For example in Fig. 2.14, AjZ denotes an advertisement of NLRI for IP prefix z, and Wjs,r denotes that routes s and r are being withdrawn and should be removed from the routing table. If an attribute of the route changes, the originating router announces it again, replacing the previous announcement. Because BGP is connection-oriented, there are no refreshes or reflooding of routes during the lifetime of the BGP connection, which makes BGP simpler than a protocol like OSPF. However, like OPSF, BGP has various timers affecting behavior like hold-offs on route installation and route advertisement. 46 R.D. Doverspike et al. Router R1 BGP process RIB ---- BGP Adjacency W |s, r A |z Router R2 BGP process RIB ---- Fig. 2.14 BGP message exchange BGP maintains tables referred to as Routing Information Bases (RIBs) containing BGP routes and their attributes. The Loc-RIB table contains the router’s definitive view of external routing information. Besides routes that enter the RIB from BGP itself, routes enter the RIB via distribution from other sources, such as static or directly connected routes or routing protocols such as OSPF. While the notion of a “route” in BGP originally meant an IPv4 prefix, with the standardization of Multiprotocol BGP (MP-BGP) it can represent other kinds of reachability information, referred to as address families. For example, a BGP route can be an IPv6 prefix or an IPv4 prefix within a VPN. External routes advertised in BGP must be distributed to every router in an AS. The hop-by-hop forwarding nature of IP requires that a packet address be looked up and matched against a route at each router hop. Because the address information may match external networks that are only known in BGP, every router must have the BGP information. However, we describe later how MPLS removes the need for every interior router to have external BGP route state. Within an AS, the BGP next hop will be the IP address of the exit router or exit link from the AS through which the packet must route and BGP is used by the exit router to distribute the routes throughout the AS. To avoid creating a full mesh of iBGP sessions among the edge and interior routers, BGP can use a hierarchy of Route Reflectors (RR). Figure 2.15 illustrates how BGP connections are constructed using a Route Reflector. BGP routes may have their attributes manipulated when received and before sending to peers, according to policy design decisions of the operator. Of the BGP routes received by a BGP router, BGP first determines the validity of a route (e.g., is the BGP next hop reachable) and then chooses the best route among valid duplicates with different paths. The best route is decided by a hierarchy of tiebreakers among route attributes such as IGP metric to the next hop and BGP path attributes such as AS PATH length. The best route is then relayed to all peers except the originating one. One variation of this relay behavior is that any route received from an iBGP peer on a nonroute reflector is not relayed to any other iBGP peer. 2 Structural Overview of ISP Networks 47 CE PE iBGP client PE RR CE CE RR iBGP PE PE CE PE CE RR iBGP eBGP PE PE eBGP CE = Provider Edge router (Access Router) = Customer Edge router = Route Reflector = Interior BGP = Exterior BGP Fig. 2.15 BGP connections in an ISP with Route Reflectors (RR) 2.4.1.3 Protocol-Independent Multicast IP Multicast is very efficient when a source sends data to multiple receivers. By using multicast at the network layer, a packet traverses a link only once, and therefore the network bandwidth is utilized optimally. In addition, the processing at routers (forwarding load) as well as at the end-hosts (discarding unwanted packets) is reduced. Multicast applications generally use UDP as the underlying transport protocol, since there is no unique context for the feedback received from the various receivers for congestion control purposes. We provide a brief overview of IP Multicast in this section. It is covered in greater detail in Chapter 11. IP Multicast uses group addresses from the Class “D” address space (in the context of IPv4). The range of IP addresses that are used for IP Multicast group addresses is 224.0.0.0 to 239.255.255.255. When a source sends a packet to an IP Multicast group, all the receivers that have joined that group receive it. The typical protocol used between the end-hosts and routers is Internet Group Management Protocol (IGMP). Receivers (end-hosts) announce their presence ( join a multicast group) by sending an IGMP report to join a group. From the first router, the indication of the intent of an end-host to join the multicast group is forwarded through routers upwards along the shortest path to the root of the multicast tree. The root for an IP Multicast tree can be a source in a source-based distribution tree, or it may be a “rendezvous point” when the tree is a shared distribution tree. The routing protocol used in conjunction with IP multicast is called Protocol-Independent Multicast (PIM). PIM has variants of the routing protocol used to form the multicast tree to forward traffic from a source (or sources) to the receivers. A router forwards a multicast packet only if it was received on the upstream interface to the source or to a rendezvous point (in a shared tree). Thus, a packet sent by a source follows the distribution tree. To avoid loops, if a packet arrives on an interface that is not on the shortest path toward the source of rendezvous point, the packet is discarded 48 R.D. Doverspike et al. (and thus not forwarded). This is called Reverse Path Forwarding (RPF), a critical aspect of multicast routing. RPF avoids loops by not forwarding duplicate packets. PIM relies on the SPT created by the traditional routing protocols such as OSPF to find the path back to the multicast source using RPF. IP Multicast uses soft-state to keep the multicast forwarding state at the routers in the network. There are two broad approaches for maintaining multicast state. The first is termed PIM-Dense Mode, wherein traffic is first flooded throughout the network, and the tree is “pruned” back along branches where the traffic is not wanted. The underlying assumption is that there are multicast receivers for this group at most locations, and hence flooding is appropriate. The flood and prune behavior is repeated, in principle, once every 3 min. However, this results in considerable overhead (as the traffic would be flooded until it is pruned back) each time. Every router also ends up keeping state for the multicast group. To avoid this, the router downstream of a source periodically sends a “state refresh” message that is propagated hop-by-hop down the tree. When a router receives the state refresh message on the RPF interface, it refreshes the prune state, so that it does not forward traffic received subsequently, until a receiver joins downstream on an interface. While PIM-Dense Mode is desirable in certain situations (e.g., when receivers are likely to exist downstream of each of the routers – densely populated groups – hence the name), PIM-Sparse Mode (PIM-SM) is more appropriate for wide-scale deployment of IP multicast for both densely and sparsely populated groups. With PIM-SM, traffic is sent only where it is requested, and receivers are required to explicitly join a multicast group to receive traffic. While PIM-SM uses both a shared tree (with a rendezvous point, to allow for multiple senders) as well as a per-source tree, we describe a particular mode, PIM-Source Specific Multicast (PIM-SSM), which is more commonly used for IPTV distribution. More details regarding PIM-SM, including PIM using a shared tree, is described in Chapter 11. PIM-SSM is adopted when the end-hosts know exactly which source and group, typically denoted (S,G), to join to receive the multicast transmissions from that source. In fact, by requiring that receivers signal the combination of source and group to join, different sources could share the same group address and not interfere with each other. Using PIM-SSM, a receiver transmits an IGMP join message for the (S,G) and the first hop router sends a (S,G)join message directly along the shortest path toward the source. The shortest path tree is rooted at the source. One of the key properties of IP Multicast is that the multicast routing operates somewhat independently of the IGP routing. Changes to the network topology are reflected in the unicast routing using updates that operate on short-time scales (e.g., transmission of LSAs in OSPF reflect a link or node failure immediately). However, IP Multicast routing reflects the changed topology only when the multicast state is refreshed. For example, with PIM-SSM, the updated topology is reflected only when the join is issued periodically (which can be up to a minute or more) by the receiver to refresh the state. We will examine the consequence of this for wide-area IPTV distribution later in this chapter. 2 Structural Overview of ISP Networks 49 2.4.2 Multiprotocol Label Switching 2.4.2.1 Overview of MPLS Multiprotocol Label Switching (MPLS) is a technology developed in the late 1990s that added new capabilities and services to IP networks. It was the culmination of various IP switching technology efforts such as multiprotocol over ATM, Ipsilon’s IP Switching, and Cisco’s tag switching [7,20]. The key benefits provided by MPLS to an ISP network are: 1. Separation of routing (the selection of paths through the network) from forwarding/switching via IP address header lookup 2. An abstract hierarchy of aggregation To understand these concepts, we first consider how normal IP routing in an ISP network functions. In an IP network without MPLS, there is a topology hierarchy with edge and backbone routers. There is also a routing hierarchy with BGP carrying external reachability information and an IGP like OSPF carrying internal reachability information. BGP carries the information about which exit router (BGP next hop) is used to reach external address space. OSPF picks the paths across the network between the edges (see Fig. 2.16). It is important to note that every OSPF router knows the complete path to reach all the edges. The internal paths that OSPF picks and the exit routers from BGP are determined before the first packet is forwarded. The connection-less and hop-by-hop forwarding behavior of IP routing requires that every router have this internal and external routing information present. A CE PE PE P A.1 P PE PE CE PE CE CE P PE PE Provider Router Network PE P -Provider router (Backbone Router) PE - Provider Edge router (Access Router) CE - Customer Edge switch Packet forwarded using hopby-hop route lookup Routes chosen using OSPF interior routing protocols Fig. 2.16 Traditional IP routing with external routes distributed throughout backbone 50 R.D. Doverspike et al. Consider the example in Fig. 2.16, where a packet enters on the left with address A.1 destined to the external network A on the upper right. When the first packet arrives, the receiving provider edge router (PE) looks up the destination IP address. From BGP, it learns that the exit router for that address is the upper right PE. From OSPF, the path to reach that exit PE is determined. Even though the ingress PE knows the complete path to reach the exit PE, it simply forwards the packet to the next-hop backbone router, labeled as a P-router (P) in the figure. The backbone router then repeats the process: using the packet IP address, it determines the exit from BGP and the path to the exit from OSPF to forward the packet to the next-hop BR. The process repeats again until the packet reaches the exit PE. The repeated lookup of the packet destination to find the external exit and internal path appears to be unnecessary. The lookup operation itself is not expensive, but the issue is the unnecessary state and binding information that must be carried inside the network. The ingress router knows the path to reach the exit. If the packet could somehow be bound to the path itself, then the successive next-hop routers would only need to know the path for the packet and not its actual destination. This is what MPLS accomplishes. Consider Fig. 2.17 where MPLS sets up an end-to-end Label Switched Path (LSP) by assigning labels to the interior paths to reach exits in the network. The LSP might look like the one shown in Fig. 2.18. The backbone routers are now called Label Switch Routers (LSR). Via MPLS signaling protocols, the LSR knows how to forward a packet carrying an incoming label for an LSP to an outgoing interface and outgoing label; this is called a “swap” operation. The PE router also acts as an LSR, but is usually at the head (start) or end (tail) of the LSP where, respectively, the initial label is “pushed” onto the data or “popped” (removed) from the data. A CE A.1 PE PE LSR A.1 LSR PE CE PE PE CE LSR PE PE CE LSR - Label Switch Router PE - Proider Edge router (Access Router) CE - Customer Edge router PER LSP: Route lookup once and associated label assigned to packet Routes chosen using OSPF interior routing protocols Fig. 2.17 Routing with MPLS creates Label Switched Paths (LSP) for routes across the network 2 Structural Overview of ISP Networks POP data 51 SWAP 417 data SWAP 666 data PUSH 233 data data Label Switched Path “tail end” “head end” Fig. 2.18 Within an LSP, labels are assigned at each hop by the downstream router In the example of Fig. 2.17, external BGP routing information such as routes to network A is only needed in the edges of the network. The interior LSRs only need to know the interior path among the edges as determined by OSPF. When the packet with address A.1 arrives at the ingress PE, the same lookup operation is done as previously: the egress PE is determined from BGP and the interior path to reach the egress is found from OSPF. But this time the packet is given a label for the LSP matching the OSPF path to the egress. The internal LSRs now forward the packet hop-by-hop based on the labels alone. At the exit PE, the label is removed and the packet is forwarded toward its external destination. In this example, the binding of a packet to paths through the network is only done once – at the entrance to the network. The assignment of a packet to a path through the network is separated from the actual forwarding of the packet through the network (this is the first benefit that was identified above). Further, a hierarchy of forwarding information is created: the external routes are only kept at the edge of the network while the interior routers only know about interior paths. At the ingress router all received packets needing to exit the same point of the network receive the same label and follow the same LSP. MPLS takes these concepts and generalizes them further. For example, the LSP to the exit router could be chosen differently from the IGP shortest path. IPv4 provides a method for explicit path forwarding in the IP header, but it is very inefficient. With MPLS, explicit routing becomes very efficient and is the primary tool for traffic engineering in IP backbones. In the previous example, if an interior link was heavily utilized, the operator may desire to divert some traffic around that link by taking a longer path as shown in Fig. 2.19. Normal IP shortest path forwarding does not allow for this kind of traffic placement. The forwarding hierarchy can be used to create provider-based VPNs. This is illustrated in Fig. 2.20. Virtual private routing contexts are created at the PEs, one per customer VPN. The core of the network does not need to maintain state information about individual VPN routes. The same LSPs for reaching the exits of the network are used, but there are additional labels assigned for separating the different VPN states. 52 R.D. Doverspike et al. A CE PE PE LSR LSR PE CE PE PE CE LSR PE CE PE PE LSR - Label Switch Router PE - Provider Edge router (Access Router) CE - Customer Edge router LSP Routes chosen using OSPF interior routing protocols Fig. 2.19 MPLS with Traffic Engineering can use alternative to the IGP shortest path A CE PE PE LSR LSR PE PE CE PE CE LSR PE PE CE LSR - Label Switch Router PE - Provider Edge router CE - Customer Edge router PE LSP Fig. 2.20 MPLS VPNs support separated virtual routing contexts in PEs interconnected via LSPs In summary, the advantages to the IP backbone of decoupling of routing and forwarding are: It achieves efficient explicit routing. Interior routers do not need any external reachability information. 2 Structural Overview of ISP Networks 53 Packet header information is only processed at head of LSP (e.g., edges of the network). It is easy to implement nested or hierarchical identification (such as with VPNs). 2.4.2.2 Internet Route Free Core The ability of MPLS to remove the external BGP information plus Layer 3 address lookup from the interior of the IP backbone is sometimes referred to as an Internet Route Free Core. The “interior” of the IP backbone starts at the left-side (BR-side) port of the access routers in Fig. 2.7. Some of the advantages of Internet Route Free Core include: Traffic engineering using BGP is much easier. Route reflectors no longer need to be in the forwarding plane, and thus can be dedicated to IP layer control plane functions or even placed on a server separate from the routers. Denial of Service (DoS) attacks and security holes are better controlled because BGP routing decisions only occur at the edges of the IP backbone. Enterprise VPN and other priority services can be better isolated from the “Public Internet”. We provide more clarification for the last advantage. Many enterprise customers, such as financial companies or government agencies, are concerned about mixing their priority traffic with that of the public Internet. Of course, all packets are mixed on links between backbone routers; however, VPN traffic can be functionally segregated via LSPs. In particular, since denial of service attacks from the compromised hosts on the public Internet rely on reachability from the Internet, the private MPLS VPN address space isolates VPN customers from this threat. Further, enterprise premium VPN customers are sometimes clustered onto access routers dedicated to the VPN service. Furthermore, higher performance (such as packet loss or latency) for premium VPN services can be provided by implementing priority queueing or providing them bandwidth-sensitive LSPs (discussed later). A similar approach can be used to provide other performance-sensitive services, such as Voice-over-IP (VoIP). 2.4.2.3 Protocol Basics MPLS encapsulates IP packets in an MPLS header consisting of one or more MPLS labels, known as a label stack. Figure 2.21 shows the most commonly used MPLS encapsulation type. The first 20 bits are the actual numerical label. There are three bits for inband signaling of class of service type, followed by and End-of-Stack bit (described later) and a time-to-live field, which serves the same function as an IP packet time-to-live field. MPLS encapsulation does not define a framing mechanism to determine the beginning and end of packets; it relies on existing underlying link-layer technologies. 54 R.D. Doverspike et al. Layer 2 Header | PID MPLS Label 1 MPLS Label 2 Label (20bits) … | CoS (3 bits) | Stack (1 bit) MPLS Label n | Layer 3 Packet TTL (8 bits) Fig. 2.21 Generic MPLS encapsulation and header fields Existing protocols such as Ethernet, Point-to-Point Protocol (PPP), ATM, and Frame Relay have been given new protocol IDs or new link-layer control fields to allow them to directly encapsulate MPLS-labeled packets. Also, MPLS does not have a protocol ID field to indicate the type of packet encapsulated, such as IPv4, IPv6, Ethernet, etc. Instead, the protocol type of the encapsulated packet is implied by the label and communicated by the signaling protocol when the label is allocated. MPLS defines the notion of a Forwarding Equivalence Class (FEC) (not to be confused with Forward Error Correction (FEC) in lower network layers defined earlier). All packets with the same forwarding requirements, such as path and priority queuing treatment, can belong to the same FEC. Each FEC is assigned a label. Many FEC types have been defined by the MPLS standards: IPv4 unicast route, VPN IPv4 unicast route, IPv6 unicast route, Frame Relay permanent virtual circuit, ATM virtual circuit, Ethernet VLAN, etc. Labels can be stacked, with the number of stacked labels indicated by the endof-stack bit. This allows hierarchical nesting of FECs, which permits VPNs, traffic engineering, and hierarchical routing to be created simultaneously in the same network. Consider the previous VPN example where a label may represent the interior path to reach an exit and an inner label may represent a VPN context. MPLS is entitled “multiprotocol” because it can be carried over almost any transport as mentioned above, ironically even IP itself, and because it can carry the payload for many different packet types – all the FEC types mentioned above. Signaling of MPLS FECs and their associated label among routers and switches can be done using many different protocols. A new protocol, the Label Distribution Protocol (LDP), was defined specifically for MPLS signaling. However, existing protocols have also been extended to signal FECs and labels: Resource Reservation Protocol (RSVP) [3] and BGP, for example. 2.4.2.4 IP Traffic Engineering and MPLS The purpose of IP traffic engineering is to enable efficient use of backbone capacity. That is, both to ensure that links and routers in the network are not congested and that they are not underutilized. Traffic engineering may also mean ensuring that certain performance parameters such as latency or minimum bandwidth are met. 2 Structural Overview of ISP Networks 55 To understand how MPLS traffic engineering plays a role in ISP networks, we first explain the generic problem to be solved – the multicommodity flow problem – and how it was traditionally solved in IP networks versus how MPLS can solve the problem. Consider an abstract network topology with traffic demands among nodes. There are: Demands d.i; j / from node i to j Constraints – link capacity b.i; j / between nodes Link costs C.i; j / Path p.k/ or route for each demand The traffic engineering problem is to find paths for the demands that fit the link constraints. The problem can be specified at different levels of difficulty: 1. Find any feasible solution, regardless of the path costs. 2. Find a solution that minimizes the costs for the paths. 3. Find a feasible or a minimum cost solution after deleting one or more nodes and/or links. Traffic Engineering an IP Network In an IP network, the capacities represent link bandwidths between routers and the costs might represent delay across the links. Sometimes, we only want to find a feasible solution, such as in a multicast IPTV service. Sometimes, we want to minimize the maximum path delay, such as in a Voice-over-IP service. And sometimes, we want to ensure a design that is survivable (meaning it is still feasible to carry the traffic) for any single- or dual-link failure. Consider how a normal ISP without traffic engineering might try to solve the problem. The tools available on a normal IP network are: Metric manipulation, i.e., pick OSPF weights to create a feasible solution. Simple topology or link augmentation: this tends to overengineer the network and restrict the possible topology. Source or policy route using the IPv4 header option or router-based source routes. Source routes are very inefficient resulting in tremendously lower router capacity and they are not robust, making the network very difficult to operate. Figure 2.22 illustrates a network with a set of demands and an example of the way that particular demands might be routed using OSPF. Although the network has sufficient total capacity to carry the demands, it is not possible to find a feasible solution (with no congested links) by only setting OSPF weights. A small ISP facing this situation without technology like MPLS would probably resort to installing more link capacity on the A-D-C node path. The generic solution to an arbitrary traffic engineering problem requires specifying the explicit route (path) for each demand. This is a complex problem that can take an indeterminate time to solve. But there are other approaches that can solve a large subset of problems. One suboptimal approach is Constraint-based Shortest 56 Fig. 2.22 IP routing is limited in its ability to meet resource demands. It cannot successfully route the demands within the link bandwidths in this example R.D. Doverspike et al. D 2 3 A C 1 B 4 All link capacities = 1 unit, except C-3 = 2 units Demand (2,3) = 0.75 units Demand (1,3) = 0.4 units Demand (1,4) = 0.4 units Path First (CSPF). CSPF has been implemented in networks with ATM Private Network-to-Network Interface (P-NNI) and IP MPLS. For currently defined MPLS protocols, the constraints can be bandwidths per class of service for each link. Also, links can be assigned a set of binary values, which can be used to include or exclude the links from routing a given demand. CSPF is implemented in a distributed fashion where all nodes have a full knowledge of network resource allocation. Then, each node routes its demands independently by: 1. Pruning the network to only feasible paths 2. Pick the shortest of the feasible paths on the pruned network Although CSPF routing is suboptimal when compared with a theoretical multicommodity flow solution, it is a reasonable compromise to solving many traffic engineering problems in which the nodes route their demands independently of each other. For more complex situations where CSPF is inadequate, network planners must use explicit paths computed by an offline system. The next section discusses explicit routing in more detail. Traffic Engineering Using MPLS The main problems with traffic engineering an IP backbone with only a Layer 3 IGP routing protocol (such as OSPF) are (1) lack of knowledge of resource allocation and (2) no efficient explicit routing. The previous example of Fig. 2.22 shows how OPSF would route all demands onto a link that does not have the necessary capacity. Another example problem is when a direct link is needed for a small demand between nodes to meet certain delay requirements. But OSPF cannot prevent other traffic demands from routing over this smaller link and causing congestion. MPLS solves this with extensions to OSPF (OSPF-TE) [21] to provide resource allocation knowledge and RSVP-TE [2] for efficient signaling of explicit routes to use those resources. See Fig. 2.23 for a simple example of how an explicit path is created. RSVP-TE can create an explicit hop-by-hop path in the PATH message downstream. The PATH 2 Structural Overview of ISP Networks 57 2 D 3 A 1 C 51 9 B 1 3. PATH 0.4 Mbps RESV with labels Fig. 2.23 RSVP messaging to set up explicit paths Fig. 2.24 MPLS-TE enables efficient capacity usage through traffic engineering to solve the example in Fig. 2.22 D 2 3 A C 1 B 4 All link capacities = 1 unit, except C-3 = 2 units Demand (2,3) = 0.75 units Demand (1,3) = 0.4 units Demand (1,4) = 0.4 units message can request resources such as bandwidth. The return message is an RESV, which contains the label that the upstream node should use at each link hop. In this example, a traffic-engineered LSP is created along path A-B-C for 0.4 Mb/s. These LSPs are referred to as traffic engineering tunnels. Tunnels can be created and differentiated for many purposes (including restoration to be defined in later sections). But in general, primary (service route) tunnels can be considered as a routing mechanism for all packets of a given FEC between a given pair of routers or router interfaces. Using this machinery, Fig. 2.24 illustrates how MPLS-TE can be used to solve the capacity overload problem in the network shown in Fig. 2.22. The explicit path used in RSVP-TE signaling can be computed by an offline system and automatically configured in the edge routers or the routers themselves can compute the path. In the latter case, the edge routers must be configured with the IP prefixes and their associated bandwidth reservations that are to be trafficengineered to other edges of the network. Because the routers do this without knowledge of other demands being routed in the network, the routers must receive periodic updates about bandwidth allocations in the network. 58 R.D. Doverspike et al. OSPF-TE provides a set of extensions to OSPF to advertise traffic engineering resources in the network. For example, bandwidth resources per class of service can be allocated to a link. Also, a link can be assigned binary attributes, which can be used for excluding or including a link for routing an LSP. These resources are advertised in an opaque LSA via OSPF link-state flooding and are updated dynamically as allocations change. Given the knowledge of link attributes in the topology and the set of demands, the router performs an online CSPF to calculate the explicit paths. The path outputs of the CSPF are given to RSVP-TE to signal in the network. As TE tunnels are created in the network, the link resources change, i.e., available bandwidth is reduced on a link after a tunnel is allocated using RSVP-TE. Periodically, OSPF-TE will advertise the changes to the link attributes so that all routers can have an updated view of the network. 2.4.2.5 VPNs with MPLS Figure 2.20 illustrates the key concept in how MPLS is used to create VPN services. VPN services here refer to carrier-based VPN services, specifically the ability of the service provider to create private network services on top of a shared infrastructure. For the purposes of this text, VPNs are of two basic types: a Layer 3 IP routed VPN or a Layer 2 switched VPN. Generalized MPLS (GMPLS) [19] can also be used for creating Layer 1 VPNs, which will not be discussed here. A Layer 3 IP VPN service looks to customers of the VPN as if the provider built a router backbone for their own use – like having their own private ISP. VPN standards define the PE routers, CE routers, and backbone P-routers interconnecting the PEs. Although the packets share (are mixed over) the ISP’s IP layer links, routing information and packets from different VPNs are virtually isolated from each other. A Layer 2 VPN provides either point-to-point connection services or multipoint Ethernet switching services. Point-to-point connections can be used to support end-to-end services such as Frame Relay permanent virtual circuits, ATM virtual circuits, point-to-point Ethernet circuits (i.e., with no Media Access Control (MAC) learning or broadcasting) and even a circuit emulation over packet service. Interworking between connection-oriented services, such as Frame Relay to ATM interworking, is also defined. This kind of service is sometimes called a Virtual Private Wire Service (VPWS). Layer 2 VPN multipoint Ethernet switching services support a traditional Transparent LAN over a wide-area network called Virtual Private LAN Service (VPLS) [24, 25]. Layer 3 VPNs over MPLS As mentioned previously, Layer 3 VPNs maintain a separate virtual routing context for each VPN on the PE routers at the edge of the network. External CEs connect to the virtual routing context on a PE that belongs to a customer’s VPN. 2 Structural Overview of ISP Networks 59 Layer 3 VPNs implemented using MPLS are often referred to as BGP MPLS VPNs because of the important role BGP has in the implementation. BGP is used to carry VPN routes between the edges of the network. BGP keeps the potentially overlapping VPN address spaces unique by prepending onto the routes a route distinguisher (RD) that is unique to each VPN. The RD + VPN IPv4 prefix combination creates a new unique address space carried by BGP, sometimes called the VPNv4 address space. VPN routes flow from one virtual routing instance into other virtual routing instances on PEs in the network using a BGP attribute called a Route Target (RT). An RT is an address configured by the ISP to identify all virtual routing instances that belong to a VPN. RTs constrain the distribution of VPN routes among the edges of the network so that the VPN routes are only received by the virtual routing instances belonging to the intended (targeted) VPN. We note that RDs and RTs are only used in the BGP control plane – they are not values that are somehow applied to user packets themselves. Rather, for every advertised VPNv4 route, BGP also carries a label assignment that is unique to a particular virtual router on the advertising PE. Every VPN packet that is forwarded across the network receives two labels at the ingress PE: an inner label associated with the advertised VPNv4 route and an outer label associated with the LSP to reach the egress advertising PE (dictated by the BGP next-hop address). See Fig. 2.25 for a simplified example. In this example, LSR3 L2 → pop LNK1 data: vr1 vr1: RT1, RD1 table: Rt Z → L4, PE2 PE2 → L1, LSR1 L1→L2 LSR1 PE1 L1|L4|Z| packet LSR2 PE2 Route Z CE1 Li- labels LSP LNK2 data: vr1 vr1: RT1, RD1 table: Rt Z → L4,CE2,LNK2 CE2 Fig. 2.25 In this VPN example, a virtual routing context (vr1) in the PEs contains the VPN label and routing information such as route target (RT1) and route distinguisher (RD1), attached CE interfaces, and next-hop lookup and label binding. VPN traffic is transported using a label stack of VPN label and interior route label 60 R.D. Doverspike et al. there is a VPN advertising a route Z, which enters the receiving virtual router (vr1) and is distributed by BGP to other PE virtual routers using RTs. A packet entering the VPN destined toward Z is looked up in the virtual routing instance, where the two labels are found – the outer label to reach the egress PE and the inner label for the egress virtual routing instance. Layer 2 VPNs over MPLS The implementation of Layer 2 VPNs over MPLS is similar to Layer 3 VPNs. Because there is no IP routing in the VPN service, there is instead a virtual switching context created on the edge PEs to isolate different VPNs. These virtual switching contexts keep the address spaces of the edge services from conflicting with each other across different VPNs. Layer 2 VPNs use a two-label stack approach that is similar to Layer 3 VPNs. Reaching an egress PE from an ingress PE is done using the same network interior LSPs that the Layer 3 VPN service would use. And then, there is an inner label associated with either the VPWS or VPLS context at the egress PE. This inner label can be signaled using either LDP or BGP. The inner label and the packet encapsulation comprise a pseudowire, as defined in the PWE3 standards [16]. The pseudowire connects an ingress PE to an egress PE switching context and is identified by the inner label. The VPWS service represents a single point-to-point connection, so there will only be a single pseudowire setup in each direction. For VPLS however, carriers typically set up a full mesh of pseudowires/LSPs among all PEs belonging to that VPLS. Forwarding for a VPWS is straightforward: the CE connection is associated with the appropriate pseudowires in each direction when provisioned. For VPLS, forwarding is determined by the VPLS forwarding table entry for the destination Ethernet MAC address. Populating the forwarding table is based on source MAC address learning. The forwarding table records the inbound interface on which a source MAC was seen. If the destination MAC is not in the table, then the packet is flooded to all interfaces attached to the VPLS. Flooding of unknown destination MACs and broadcast MACs follows some special rules within a VPLS. All PEs within a backbone are assumed to be full mesh connected with pseudowires. So, packets received from the backbone are not flooded again into the backbone, but are only flooded onto CE interfaces. On the other hand, packets from a CE to be flooded are sent to all attached CE interfaces and all pseudowire interfaces toward the other backbone PEs. There is also a VPLS variation called Hierarchical VPLS to constrain the potential explosion of mesh point-to-point LSPs needed among the PE routers. This might happen with a PE that acts like a spoke with a single pseudowire attached to a core of meshed PEs. In this model, a flooding packet received at a mesh connected PE from a spoke PE pseudowire is sent to all attached CEs and pseudowires. In such a model, the PE interconnectivity must be guaranteed to be loop-free or a spanning tree protocol may be run among the PEs for that VPLS. 2 Structural Overview of ISP Networks 61 2.5 Network Restoration and Planning The design of an IP backbone is driven by the traffic demands that need to be supported, and network availability objectives. The network design tools model the traffic carried over the backbone links not only in a normal “sunny day” scenario, but also in the presence of network disruptions. Many carriers offer Service Level Agreements (SLAs). SLAs will vary across different types of services. For example, SLAs for private-line services are quite different from those for packet services. SLAs also usually differ among different types of packet services. The SLAs for general Internet, VPN, and IPTV services will generally differ. A packet-based SLA might be expressed in terms of Quality of Service (QoS) metrics:For example, the SLA for a premium IP service may cover up to three QoS metrics: latency, jitter, and packet loss. An example of the latter is “averaged over time period Y , the customer will receive at least X % of his/her packets transmitted.” Some of these packet services may be further differentiated by offering different levels of service, also called Class of Service (CoS). To provide its needed SLAs, an ISP establishes internal network objectives. Network availability is a key internal metric used to control packet loss. Furthermore, network availability is also sometimes used as the key QoS metric for private-line services. Network availability is often stated colloquially in “9s”. For example, “four nines” of availability means the service is available at least 0.9999 of the time. Stated in the contra-positive, the service should not be down more than 0.0001 of the time (approximately 50 min per year). Given its prime importance, we will concentrate on network availability in the remainder of this section. The single largest factors in designing and operating the IP backbone such that it achieves its target network availability are modeling its potential network disruptions and the response of the network to those disruptions. Network disruptions most typically are caused by network failures and maintenance activities. Maintenance activities include upgrading of equipment software, replacement of equipment, and reconfiguration of network topologies or line cards. Because of the complex layering and segmentation of networks surrounding the IP backbone and because of the variety and vintage of equipment that accumulates over the years, network planners, architects, network operators, and engineers spend considerable effort to maintain network availability. In this section, we will briefly describe the types of restoration methods we find at the various network layers. Then, we will describe how network disruptions affect the IP backbone, the types of restoration methods used to handle them, and finally how the network is designed to meet the needed availability. Table 2.3 summarizes typical restoration methods used in some of today’s network core layers that are most relevant to the IP backbone. See [11] for descriptions of restoration methods used in other layers shown in Fig. 2.3. In the next sections, we will describe the rows of this table. Note that the table is approximate and does not apply universally to all telecommunication carriers. 62 R.D. Doverspike et al. Table 2.3 Example of core-segment restoration methods Network layer Fiber DWDM SONET Ring IOS (DCS) W-DCS IP backbone Restoration method(s) against network failures that originate at that layer or lower layers No automatic rerouting 1) Manual 2) 1 C 1 restoration (also called dedicated protection) Bidirectional Line-Switched Rings (BLSR) Distributed path-based mesh restoration No automatic rerouting 1) IGP reconfiguration 2) MPLS Fast Reroute (FRR) Exemplary restoration time scale Hours (manual) 1) Hours (manual) 2) 3–20 ms 50–100 ms Sub-second to seconds Hours 1) 10–60 s 2) 50–100 ms 2.5.1 Restoration in Non-IP Layers 2.5.1.1 Fiber Layer As we described earlier, in most central offices today, optical interfaces on switching or transport equipment connect to fiber patch panels. Some carriers have installed an automated fiber patch panel, also called a Fiber Cross-Connect (FXC), which has the ability for an operator to remotely control the cross-connects. Some of the enabling technologies include physical crossbars using optical collometers and Micro-Electro-Mechanical Systems (MEMS). A good overview of these technologies can be found in [12]. When disruptions occur to the fiber layer, most commonly from construction activity, network operators can reroute around the failed fiber by using a patch panel to cross-connect the equipment onto undamaged fibers. This may require coordination of cross-connects at intermediate central offices to patch a path through alternate COs if an entire cable is damaged. Of course, this typically is a slow manual process, as reflected in Table 2.3 and so higher-layer restoration is usually utilized for disruptions to the fiber layer. 2.5.1.2 DWDM Layer Some readers may be surprised to learn that carriers have deployed few (if any) automatic restoration methods in their DWDM layers (neither metro nor core segment). The one type of restoration occasionally deployed is one-by-one (1:1) or one-plus-one (1 C 1) tail-end protection switching, which switches at the endpoints of the DWDM layer connection. With 1C1 switching, the signal is duplicated and transmitted across two (usually) diversely routed connections. The path of the connection during the nonfailure state is usually called the working path (also called the primary or service path); the path of the connection during the failure state is called the restoration path (also called protection path or backup path). The receiver 2 Structural Overview of ISP Networks 63 consists of a simple detector and switch that detects failure of the signal on the working path (more technically, detects performance errors such as average BER threshold crossings) and switches to the restoration path upon alarm. Once adequate signal performance is again achieved on the signal along the working path (including a time-out threshold to avoid link “flapping”), it switches back to the working path. In 1:1 protection switching, there is no duplication of signal, and thus the restoration connection can be used for other transport in nonfailure states. The transmitted signal is switched to the restoration path upon detection of failure of the service path and/or notification from the far end. Technically speaking, in ROADM or Point-to-point DWDM systems, 1 C 1 or 1:1 protection switching is usually implemented electronically via the optical transponders. Consequently, these methods can be implemented at other transport layers, such as DCS, IOS, and SONET. The major advantage of 1 C 1 or 1:1 methods is that they can trigger in as little as 3–20 ms. However, because these methods require restoration paths that are dedicated (one-for-one) for each working connection, the resulting restoration capacity cannot be shared among other working connections for potential failures. Furthermore, the restoration paths are diversely routed and are often much longer than their working paths. Consequently, 1 C 1 and 1:1 protection switching tend to be the costliest forms of restoration. 2.5.1.3 SONET Ring Layer The two most common types of deployed SONET or SDH self-healing ring technology are Unidirectional Path Switched Ring (UPSR-2F) and Bidirectional Line-Switched Ring (BLSR-2F). The “2F ” stands for “2-Fibers”. For simplicity, we will limit our discussion to SONET rings, but there is a very direct analogy for SDH rings. However, note that ADM-ADM ring links are sometimes transported over a lower DWDM layer, thus forming a “connection” that is routed over channels of DWDM systems, instead of direct fiber. Although there is no inherent topographical orientation in a ring, many people conceptually visualize each node of a SONET self-healing ring as an ADM with an east bidirectional OC-n interface (i.e., a transmit port and a receive port) and a west OC-n interface. Typically, n D 48 or 192. An STS-k SONET-Layer connection enters at an add/drop port of an ADM, routes around the ring on k STS-1 channels of the ADM–ADM links and exits the ring at an add/drop port of another ADM. The UPSR is the simplest of the devices and works similarly to the 1 C 1 tail-end switch described in Section 2.5.1.2, except that each direction of transmission of a connection routes counterclockwise on the “outer” fiber around the ring (west direction) and therefore an STS-k connection used the same k STS-1 channels on all links around the ring. At each add/drop transmit port, the signal is duplicated in the opposite direction on the “inner” fiber. The selector responds to a failure as described above. The BLSR-2F partitions the bidirectional channels of its East and West highspeed links in half. The first half is used for working (nonfailure) state, and the second half is reserved for restoration. When a failure to a link occurs, 64 R.D. Doverspike et al. the surrounding ADMs loop back that portion of the connection paths onto the restoration channels around the opposite direction of the ring. The UPSR has very rapid restoration, but suffers the dedicated-capacity condition described in Section 2.5.1.2; as a consequence, today UPSRs are now confined mostly to the metro network, in particular to the portion closest to the customer, often extending into the feeder network. Because BLSR signaling is used to advertise failures among ADMs and real-time intermediate cross-connections have to be made, a BLSR restores more slowly than a UPSR. However, the BLSR is capable of having multiple connections share restoration channels over nonsimultaneous potential network failures, and is thus almost always deployed in the middle of the metro network or parts of the core network. Rings are described in more detail in [11]. 2.5.1.4 IOS Layer The typical equipment that comprise today’s IOS layer use distributed control to provision (set-up) connections. Here, links of the IOS network (SONET bidirectional OC-n interfaces) are assigned routing weights. When a connection is provisioned over the STS-1 channels of an IOS network, its source node (IOS) computes its working path (usually along a minimum-weight path) plus also computes its restoration path that is diversely routed from the working path. After the connection is set up along its working path, the restoration path is stored for future use. The nodes communicate the state of the network connectivity via topology update messages transmitted over the SONET overhead on the links between the nodes. When a failure occurs, the nodes flood advertisement messages to all nodes indicating the topology change. The source node for each affected connection then instigates the restoration process for its failed connections by sending connection request messages along the links of the (precalculated) restoration path, seeking spare STS-1 channels to reroute its connections. Various handshaking among nodes of the restoration paths are implemented to complete the rerouting of the connections. Note that in contrast to the dedicated and ring methods, the restoration channels are not prededicated to specific connections and, therefore, connections from a varied set of source/destination pairs can potentially use them. Such a method is called shared restoration because a given spare channel can be used by different connections across nonsimultaneous failures. Shared mesh restoration is generally more capacity-efficient than SONET rings in mesh networks (i.e., networks with average connectivity greater than 2). We now delve a little more into IOS restoration to make a key point that will become relevant to the IP backbone, as well. The example in Fig. 2.2 shows two higher-layer connections routing over the same lower-layer link. In light of the discussion above about the restoration path being diverse from the working path in the IOS layer, the astute reader may ask “diverse relative to what?” The answer is that, in general, the path should be diverse all the way down through the DWDM and Fiber Layers. This requires that the IOS links contain information about how they share these lower-layer links. Often, this is accomplished via a mechanism called 2 Structural Overview of ISP Networks 65 “bundle groups”. That is, a bundle group is created for each lower-layer link, but is expressed as a group of IOS links that share (i.e., route over) that link. Diverse restoration paths can be discovered by avoiding IOS links that belong to the same bundle group of a link on the working path. Of course, the equipment in the IOSLayer cannot “see” its lower layers, and consequently has no idea how to define and create the bundle groups. Therefore, bundle groups are provisioned in the IOSs using an Operations Support System (OSS) that contains a database describing the mapping of IOS links to lower-layer networks. This particular example illustrates the importance of understanding network layering; else we will not have a reliable method to plan and engineer the network to meet the availability objective. This point will be equally important to the IP backbone. A set of bundled links is also referred to as a Shared Risk Link Group (SRLG) in the telecommunications industry, since it refers to a group of links that are subject to a shared risk of disruption. 2.5.1.5 W-DCS Layer and Ethernet Layer There are few restoration methods provided at the W-DCS layer itself. This is because most disruptions to a W-DCS link occurs from a disruption of (1) a W-DCS line card or (2) a component in a lower layer of which the link routes. Disruptions of type (1) are usually handled by providing 1:1 restorable intra-office links between the W-DCS and TDM node (IOS or ADM). Disruptions of type (2) are restored by the lower TDM layers. This only leaves failure or maintenance of the W-DCS itself as an unrestorable network disruption. However, a W-DCS is much less sophisticated than a router and less subject to failure. Restoration of Layer 2 VPNs in an IP/MPLS backbone is discussed in Section 2.5.2. We note here that restoration in enterprise Ethernet networks is typically based on the Rapid Spanning Tree Protocol (RSTP). When enterprise Ethernet VPNs are connected over the IP backbone (such as VPLS), an enterprise customer who employs routing methods such as RSTP expects it to work in the extended network. By encapsulating the customer’s Ethernet frames inside pseudowires ensures that the client’s RTSP control packets are transported transparently across the wide area. For example, a client VPN may choose to restore local link disruptions by routing across other central offices or even distant metros. Since all this appears as one virtual network to the customer, such applications may be useful. 2.5.2 IP Backbone There are two main restoration methods we describe for the IP layer: IGP reconfiguration and MPLS Fast Reroute (FRR). 66 R.D. Doverspike et al. 2.5.2.1 OSPF Failure Detection and Reconvergence In a formal sense, the IGP reconvergence process responds to topology changes. Such topology changes are usually caused by four types of events: 1. Maintenance of an IP layer component 2. Maintenance of a lower-layer network component 3. Failure of an IP layer component (such as a router line card or common component) 4. Failure of a lower-layer network component (such as a link) When network operations staff perform planned maintenance on an IP layer link, it is typical to raise the OSPF administrative weight of the link to ensure that all traffic is diverted from the link (this is often referred to as “costing out” the link). In the second case, most carriers have a maintenance procedure where organizations that manage the lower-layer networks schedule their daily maintenance events and inform the IP layer operations organization. The IP layer operations organization responds by costing out all the affected links before the lower-layer maintenance event is started. In the first two cases (planned maintenance activity), the speed of the reconvergence process is usually not an issue. This is because the act of changing an IGP routing weight on a link causes LSAs to be issued. During the process of updating the link status and recomputation of the SPF tree, the affected links remain in service (i.e., “up”). Therefore, once the IGP reconfiguration process has settled, the routers can redirect packets to their new paths. While there may be a transient impact during the “costing out” period, in terms of transient loops and packet loss, the service impact is kept to a minimum by using this costing out technique to remove a link from the topology for performing maintenance. In the last two cases (failures), once the affected links go down, packets may be lost or delayed until the reconvergence process completes. Such a disruption may be unacceptable to delay or loss-sensitive applications. This motivates us to examine how to reduce the time required for OSPF to converge from unexpected outages. This is the focus of the remainder of this section. While most large IP backbones route over lower layers, such as DWDM, those do not provide restoration. Layer 1 failure detection is a key component of the IP layer restoration process. A key component of the overall failure recovery time in OSPFbased networks is the failure detection time. However, lower-layer failure detection mechanisms sometimes do not coordinate well with higher-layer mechanisms and do not detect disruptions that originate in the IP layer control plane. As a result, OSPF routers periodically exchange Hello messages to detect the loss of a link adjacency with a neighbor. If a router does not receive a Hello message from its neighbor within a RouterDeadInterval, it assumes that the link to its neighbor has failed, or the neighbor router itself is down, and generates a new LSA to reflect the changed topology. All such LSAs generated by the routers affected by the failure are flooded throughout the network. This causes the routers in the network to redo the SPF 2 Structural Overview of ISP Networks 67 calculation and update the next-hop information in their respective forwarding tables. Thus, the time required to recover from a failure consists of: (1) the failure detection time, (2) LSA flooding time, (3) the time to complete the new SPF calculations and update the forwarding tables. To avoid a false indication that an adjacency is down because of congestion related loss of Hello messages, the RouterDeadInterval is usually set to be four times the HelloInterval – the interval between successive Hello messages sent by a router to its neighbor. With the RFC suggested default values for these timers (HelloInterval value of 10 s and RouterDeadInterval value of 40 s), the failure detection time can take anywhere between 30 and 40 s. LSA flooding times consist of propagation delay and additional pacing delays inserted by the router. These pacing delays serve to rate-limit the frequency with which LSUpdate packets are sent on an interface. Once a router receives a new LSA, it schedules an SPF calculation. Since the SPF calculation using Dijkstra’s algorithm (see e.g., [8]) constitutes a significant processing load, a router typically waits for additional LSAs to arrive for a time interval corresponding to spfDelay (typically 5 s) before doing the SPF calculation on a batch of LSAs. Moreover, routers place a limit on the frequency of SPF calculations (governed by a spfHoldTime, typically 10 s, between successive SPF calculations), which can introduce further delays. From the description above, it is clear that reducing the HelloInterval can substantially reduce the Hello protocol’s failure detection time. However, there is a limit to which the HelloInterval can be safely reduced. As the HelloInterval becomes smaller, there is an increased chance that network congestion will lead to loss of several consecutive Hello messages and thereby cause a false alarm that an adjacency between routers is lost, even though the routers and the link between them are functioning. The LSAs generated because of a false alarm will lead to new SPF calculations by all the routers in the network. This false alarm would soon be corrected by a successful Hello exchange between the affected routers, which then causes a new set of LSAs to be generated and possibly new path calculations by the routers in the network. Thus, false alarms cause an unnecessary processing load on routers and sometimes lead to temporary changes in the path taken by network traffic. If false alarms are frequent, routers have to spend considerable time doing unnecessary LSA processing and SPF calculations, which may significantly delay important tasks such as Hello processing, thereby leading to more false alarms. False alarms can also be generated if a Hello message gets queued behind a burst of LSAs and thus cannot be processed in time. The possibility of such an event increases with the reduction of the RouterDeadInterval. Large LSA bursts can be caused by a number of factors such as simultaneous refresh of a large number of LSAs or several routers going down/coming up simultaneously. Choudhury [5] studies this issue and observes that reducing the HelloInterval lowers the threshold (in terms of number of LSAs) at which an LSA burst will lead to generation of false alarms. However, the probability of LSA bursts leading to false alarms is shown to be quite low. 68 R.D. Doverspike et al. Since the loss and/or delayed processing of Hello messages can result in false alarms, there have been proposals to give such packets prioritized treatment at the router interface as well as in the CPU processing queue [5]. An additional option is to consider the receipt of any OSPF packet (e.g., an LSA) from a neighbor as an indication of the good health of the router’s adjacency with the neighbor. This provision can help avoid false loss of adjacency in the scenarios where Hello packets get dropped because of congestion, caused by a large LSA burst, on the link between two routers. Such mechanisms may help mitigate the false alarm problem significantly. However, it will take some time before these mechanisms are standardized and widely deployed. It is useful to make a realistic assessment regarding how small the HelloInterval can be, to achieve faster detection and recovery from network failures while limiting the occurrence of false alarms. We summarize below the key results from [13]. This assessment was done via simulations on the network topologies of commercial ISPs using a detailed implementation of the OSPF protocol in the NS2 simulator. The work models all the important OSPF protocol features as well as various standard and vendor-introduced delays in the functioning of the protocol. These are shown in Table 2.4. Goyal [13] observes that with the current default settings of the OSPF parameters, the network takes several tens of seconds before recovering from a failure. Since the main component in this delay is the time required to detect a failure using the Hello protocol, Goyal [13] examines the impact of lower HelloInterval values on failure detection and recovery times. Table 2.5 shows typical results for failure detection and recovery times after a router failure. As expected, the failure detection time is within the range of three to four times the value of HelloInterval. Once a neighbor detects the router failure, it generates a new LSA about 0.5 s after the failure detection. The new LSA is flooded throughout the network and will lead to scheduling of an SPF calculation 5 s (spfDelay) after the LSA receipt. This is done to allow one SPF calculation to take care of several new LSAs. Once the SPF calculation is done, the router takes about 200 ms more to update the forwarding table. After including the LSA propagation and pacing delays, one can expect the failure recovery to take place about 6 s after the ‘earliest’ failure detection by a neighbor router. Notice that many entries in Table 2.5 show the recovery to take place much sooner than 6 s after failure detection. This is partly an artifact of the simulation because the failure detection times reported by the simulator are the “latest” ones rather than the “earliest”. In one interesting case (seed 2, HelloInterval 0.75 s), the failure recovery takes place about 2 s after the ‘latest’ failure detection. This happens because the SPF calculation scheduled by an earlier false alarm takes care of the LSAs generated because of router failure. There are also many cases in which failure recovery takes place more than 6 s after failure detection (notice entries for HelloInterval 0.25 s, seeds 1 and 3). Failure recovery can be delayed because of several factors. The SPF calculation frequency of the routers is limited by spfHoldTime (typically 10 s), which can delay the new SPF calculation in response to the router failure. The delay caused by spfDelay is also a contribution. 2 Structural Overview of ISP Networks 69 Table 2.4 Various delays affecting the operation of OSPF protocol Standard configurable delays RxmtInterval The time delay before an un-acked LSA is retransmitted. Usually 5 s. HelloInterval The time delay between successive Hello packets. Usually 10 s. RouterDeadInterval The time delay since the last Hello before a neighbor is declared to be down. Usually four times the HelloInterval. Vendor-introduced configurable delays Pacing delay The minimum delay enforced between two successive Link-State Update packets sent down an interface. Observed to be 33 ms. Not always configurable. spfDelay The delay between the shortest path calculation and the first topology change that triggered the calculation. Used to avoid frequent shortest path calculations. Usually 5 s. spfHoldTime The minimum delay between successive shortest path calculations. Usually 10 s. Standard fixed delays LSRefreshTime MinLSInterval MinLSArrival Router-specific delays Route install delay LSA generation delay LSA processing delay SPF calculation delay The maximum time interval before an LSA needs to be reflooded. Set to 30 min. The minimum time interval before an LSA can be reflooded. Set to 5 s. The minimum time interval that should elapse before a new instance of an LSA can be accepted. Set to 1 s. The delay between the shortest path calculation and update of forwarding table. Observed to be 0.2 s. The delay before the generation of an LSA after all the conditions for the LSA generation have been met. Observed to be around 0.5 s. The time required to process an LSA including the time required to process the Link-State Update packet before forwarding the LSA to the OSPF process. Observed to be less than 1 ms. The time required to do shortest path calculation. Observed to be 0.00000247x 2 C 0.000978 s on Cisco 3600 series routers; x being the number of nodes in the topology. Finally, the routers with a low degree of connectivity may not get the LSAs in the first try because of loss due to congestion. Such routers may have to wait for 5 s (RxmtInterval) for the LSAs to be retransmitted. The results in Table 2.5 show that a smaller value of HelloInterval speeds up the failure detection but is not effective in reducing the failure recovery times beyond a limit because of other delays like spfDelay, spfHoldTime, and RxmtInterval. Failure recovery times improve as the HelloInterval reduces down to about 0.5 s. Beyond that, as a result of more false alarms, we find that the recovery times actually go up. While it may be possible to further speed up 70 R.D. Doverspike et al. Table 2.5 Failure detection time and failure recovery time for a router failure with different HelloInterval values Seed 1 Seed 2 Seed 3 Hello interval (s) FDT (s) FRT (s) FDT (s) FRT (s) FDT (s) FRT (s) 10 2 1 0.75 0.5 0.25 32:08 7:82 3:81 2:63 1:88 0:95 36:60 11:68 9:02 7:84 6:98 10:24 39:84 7:63 3:80 2:97 1:82 0:84 46:37 12:18 8:31 5:08 6:89 6:08 33:02 7:79 3:84 2:81 1:79 0:99 38:07 12:02 10:11 7:82 6:85 13:41 the failure recovery by reducing the values of these delays, eliminating such delays altogether is not prudent. Eliminating spfDelay and spfHoldTime will result in potentially additional SPF calculations in a router in response to a single failure (or false alarm) as the different LSAs generated because of the failure arrive one after the other at the router. The resulting overload on the router CPUs may have serious consequences for routing stability, especially when there are several simultaneous changes in the network topology. Failure recovery below the range of 1–5 s is difficult with OSPF. In summary, OSPF recovery time can be lowered by reducing the value of HelloInterval. However, too small a value of HelloInterval will lead to many false alarms in the network, which cause unnecessary routing changes and may lead to routing instability. The optimal value for the HelloInterval that will lead to fast failure recovery in the network, while keeping the false alarm occurrence within acceptable limits for a network, is strongly influenced by the expected congestion levels and the number of links in the topology. While the HelloInterval can be much lower than current default value of tens of seconds, it is not advisable to reduce it to the millisecond range because of potential false alarms. Further, it is difficult to prescribe a single HelloInterval value that will perform optimally in all cases. The network operator needs to set the HelloInterval conservatively taking into account both the expected congestion as well as the number of links in the network topology. 2.5.2.2 MPLS Fast Reroute MPLS Fast Reroute (FRR) was designed to improve restoration performance using the additional protocol layer provided by MPLS LSPs [17]. Primary and alternate (backup) LSPs are established. Fast rerouting over the alternate paths after a network disruption is achieved using preestablished router forwarding table entries. Equipment suppliers have developed many flavors of FRR, some of which are not totally compliant with standardized MPLS FRR. This section provides an overview of the basic concept. There are two basic varieties of backup path restoration in MPLS FRR, called next-hop and next-next-hop. The next-hop approach identifies a unidirectional link to be protected and a backup (or bypass) unidirectional LSP that routes around the 2 Structural Overview of ISP Networks 71 MPLS secondary LSP tunnel X MPLS primary LSP tunnels PHY layer links MPLS next-hop backup path X MPLS next-nexthop backup paths Fig. 2.26 Example of Fast Reroute backup paths link if it fails. The protected link can be a router–router link adjacency or even another layer of LSP tunnel itself. The backup LSP routes over alternate links. The top graph in Fig. 2.26 illustrates a next-hop backup path for the potential failure of a given link (designated with an “X”). For now ignore the top path labeled “MPLS secondary LSP tunnel”, which will be discussed later. With the next-next-hop approach, the primary entities to protect are two-link working paths. The backup path is an alternate path over different links and routers than the protected entity. In general, a next-hop path is constructed to restore against individual link failures while next-next-hop paths are constructed to restore against both individual link failures and node failures. The trade-off is that next-hop paths are simpler to implement because all flows routing over the link can be rerouted similarly, whereas next-nexthop requires more LSPs and routing combinations. This is illustrated in the lower example of Fig. 2.26, wherein the first router along the path carries flows that terminate on different second hop routers, and therefore must create multiple backup LSPs that originate at that node. We will briefly describe an implementation of the next-hop approach to FRR. A primary end-to-end path is chosen by RSVP. This path is characterized by the Forwarding Equivalence Class (FEC) discussed earlier and reflects packets that are to be corouted and have similar CoS queuing treatment and ability to be restored with FRR. Often, a mesh of fully connected end-to-end LSPs between the backbone routers (BRs) is created. 72 R.D. Doverspike et al. As discussed in earlier sections, an LSP is identified in forwarding tables by mappings of pairs of label and interface: (In-Label, In-Interface)! (Out-Label, Out-Interface). An end-to-end LSP is provisioned (set up) by choosing and populating these entries at each intermediate router along the path by a protocol such as RSVP-TE. For the source router of the LSP, the “In-Label” variable is equivalent to the FEC. As a packet hops along routers, the labels are replaced according to the mapping until it reaches the destination router, in which case, the MPLS shim headers are popped and packets are placed on the final output port. With next-hop, facility-based FRR, a backup (or bypass) LSP is set up for each link. For example, consider a precalculated backup path to protect a link between routers A and B, say (A-1, B-1), where A-1 is the transmit interface at router A, B-1 is the receive interface at router B, and L-1 is the MPLS label for the path over this link. The forwarding table entries are of form (L-i, A-k) ! (L-1, A-1) at router A and (L-1, B-1) ! (L-j, B-s) at router B. When this link fails, a Layer 1 alarm is generated and forwarded to the router controller or line card at A and B. For packets arriving at router A, mapping entries in the forwarding table with the Out-Interface D A-1 have another (outer) layer of label pushed on the MPLS stack to coincide with the backup path. This action is preloaded into the forwarding table and triggered by the alarm. Forwarding continues along the routers of this backup LSP by processing the outer layer labels as with any MPLS packet. The backup path ends at router B and, therefore, when the packets arrive at router B, their highest (exterior) layer label is popped. Then, from the point of view of router B, after the outer label is popped, the MPLS header is left with (In-Label, In-Interface) D (L-1, B-1) and therefore the packets continue their journey beyond router B just as they would if link (A-1, B-1) were up. In this way, all LSPs that route over the particular link are rerouted (hence the term “facility based”). Various other specifications can be made to segregate the backup path to be pushed on given classes of LSPs, for example to provide restoration for some IP CoSs rather than others. Another common implementation of next-hop FRR defines 1-hop pseudowires for each key link. Each pseudowire has defined a primary LSP and backup LSP (a capability found in most routers). If the link fails, a similar alarm mechanism causes the pseudowire to reroute over the backup LSP. When the primary LSP is again declared up, the pseudowire switches back to the primary path. An advantage of this method is that the pseudowire appears as a link to the IGP routing algorithm. Weights can be used to control how packets route over it or the underlying Layer 1 link. Section 2.6 illustrates this method for an IPTV backbone network. MPLS FRR has been demonstrated to work very rapidly (less than 100 ms) in response to single-link (IP layer PHY link) failures by many vendors and carriers. Most FRR implementations behave similarly during the small interval immediately after the failure and before IGP reconvergence. However, implementations differ in what happens after IGP reconvergence. We describe two main approaches in the context of next-hop FRR here. In the first approach, the backup LSP stays in place until the link goes back into service and IGP reconverges back to its nonfailure state. This is most common when a separate LSP or pseudowire is associated with each link in next-hop FRR. In this case, the link-LSP is rerouted onto its backup LSP and stays that way until the primary LSP is repaired. 2 Structural Overview of ISP Networks 73 In the second approach, FRR provides rapid restoration and then, after a short settling period, the network recomputes its paths [4]. Here, each primary end-toend LSP is recomputed during the first IGP reconfiguration process after the failure. Since the IGP knows about the failed link(s), it reroutes the primary end-to-end LSPs around them and the backup LSPs become moot. This is illustrated in the three potential paths in the topmost diagram of Fig. 2.26. The IP flow routes along the primary LSP during the nonfailure state. Then, the given link fails and the path of the flow over the failed link deviates along the backup LSP, as shown by the lower dashed line. After the first IGP reconfiguration process, the end-to-end LSP path is recomputed, illustrated by the topmost dashed line. When a failed component is repaired or a maintenance procedure is completed, the disrupted links are put back into service. The process to return the network to its nonfailure state is often called normalization. During the normalization process, LSAs are broadcast by the IGP and the forwarding tables are recalculated. The normalization process is often controlled by an MPLS route mechanism/timer. A similar procedure would occur for next-next hop. The reason for the second approach is that while FRR enables rapid restoration, because these paths are segmental “patches” to the primary paths, the alternate route is often long and capacity-inefficient. With the first approach, IP flows continue routing over the backup paths until the repair is completed and alarms clear, which may span hours or days. Another reason is that if multiple link failures occur, then some of the backup FRR paths may fail; some response is needed to address this situation. These limitations of the first approach were early key inhibitors to implementation of FRR in large ISPs. The key to implementing this second FRR strategy is that the switch from FRR backup paths to new end-to-end paths is hitless (i.e., negligible packet loss), else we may suffer three hits from each single failure (the failure itself, the process to reroute the end-to-end paths immediately after the failure, and then the process to revert to the original paths after repair). If the alternate end-to-end LSPs are presetup and the forwarding table changes implemented efficiently for most routers (often using pointers), this process is essentially hitless for most IP unicast (point-to-point) applications. However, we note that today’s multicast does not typically enjoy hitless switchover to the new forwarding table because most multicast trees are usually built via join and prune request messages issued backwards (upstream) from the destination nodes. However, it is expected that different implementations of multicast will fix this problem in the future. We discuss this again in Section 2.6 and refer the reader to [36] for more discussion of hitless multicast. For the network design phase of implementing FRR, for next-hop FRR, each link (say L) along the primary path needs a predefined a backup path whose routing is diverse in lower layers. That is, the paths of all lower-layer connections that support the links of the backup path are disjoint from the path of the lower-layer connection for link L. The key is in predefining the backup tunnels. While next-next-hop paths can be also used to restore against single-link failures, the network becomes more complex to design if there is a high degree of lower-layer link overlap. More generally, the major difficulty for the FRR approach is defining the backup LSPs so 74 R.D. Doverspike et al. that the service paths can be rerouted, given a predefined set of lower-layer failures. Furthermore, when multiple lower-layer failures occur and MPLS backup paths fail, FRR does not work and the network must revert to the slower primary path recalculation approach (described in method 2 above). 2.5.3 Failures Across Multiple Layers Now that the reader is armed with background on network layering and restoration methods, we are poised to delve deeper into the factors and carrier decision variables that shape the availability of the IP backbone. Let us briefly revisit Fig. 2.9, which gives a simple example of the core ROADM Layer Diagram. Consider a backbone router (BR) in central office B with a link to one of the backbone routers in central office A. Furthermore, consider the remote access router (RAR) that is homed to the backbone router in office A. However, let us add a twist wherein the link between the RAR and BR routes over the IOS layer instead of directly onto the ROADM (DWDM layer) as pictured in Fig. 2.9. This can occur for RAR–BR links with lower bandwidth. This modification will illustrate more of the potential failure modes. In particular, we have constructed this simple example to illustrate several key points: Computing an estimate of the availability of the IP backbone involves analysis of many network layers. Network disruptions can originate from many different sources within each layer. Some lower layers may provide restoration and others do not; how does this affect the IP backbone? Figure 2.27 gives examples of the types of individual component disruptions (“down events”) that might cause links to fail in this network example, but still only shows a few of the many disruptions that can originate at these layers. As one can see, this is a four-layer example; and, some of the layers are skipped. Note that for simplicity, we illustrate point-to-point DWDM systems at the DWDM layer; however, the concepts apply equally well for ROADMs. Some readers perhaps may think that the main source of network failures is fiber cuts and, therefore, the entire area of multilayer restoration can be reduced to analyzing fiber cuts. However, this oversimplifies the problem. For example, an amplifier failure can often be as disruptive as a fiber cable cut and will likely result in the failure of multiple IP layer links. Furthermore, amplifier failures are more frequent. Let us examine the effect of some of the failures illustrated in Fig. 2.27. IOS interface failure: The IOS network has restoration capability, as described in earlier sections. Consequently, the IOS layer reroutes its failed SONET STS-n connection that supports the RAR–BR link onto its restoration path. In this case, once the SONET alarms are detected by the two routers (the RAR and BR), they take the link out of service and generate appropriate LSAs to the correct IGP 2 Structural Overview of ISP Networks 75 OC-n router common component OC-n BR BR AR router line card IP Layer IOS common component DWDM common component or Amplifier intra-office fiber IOS IOS IOS IOS line card IOS Layer OTs D W D M D W D M OTs D W D M D W D M OTs OTs D W D M D W D M OTs OT ROADM/Point-to-point DWDM Layer fiber cable Fiber Layer BR = Backbone Router ROADM = Reconfigurable Optical Add//Drop Multiplexer AR = Access Router OT = Optical Transponder DWDM = (Dense) Wavelength Division Multiplexer IOS = Intelligent Optical Switch Fig. 2.27 Example of components disruptions (failure or maintenance activity) at multiple layers administrative areas or control domains to announce the topology change. Assuming that the IOS-layer restoration is successful, the AR–BR link comes back after a short time (as specified in the IOS layer of Table 2.3) and the SONET alarm clears. After perhaps, an appropriate time-out on the routers to avoid link flapping, the link is brought back up by the router and the topology change is announced via LSAs. We note that in a typical AR/BR homing architecture, the LSAs from an AR–BR link are only announced in subareas and so do not affect unaffected ARs or BRs. Fiber cut: In the core network, the probability of a fiber cut is roughly proportional to its length. They are less frequent than many of the other failures, but highly disruptive, where usually many simultaneous IP layer links fail because of the concentration of capacity enabled by DWDM. Optical Transponder: OT failure is the most common of the failures shown in Fig. 2.27. However, a single OT failure only affects individual IP backbone links. Some of the more significant problems with OT failures are (1) performance degradation, where bit errors occasionally trip BER threshold crossing alerts and (2) there is a nonnegligible probability of multiple failures in the network, in which an OT fails while another major failure is in progress or vice versa. DWDM terminal or amplifier: Amplifier failure is usually the most disruptive of failures because of its impact (multiple wavelengths) and sheer quantity, often placed every 50–100 miles, depending on the vintage and bit rate of the wavelengths of the DWDM equipment. Failure of the DWDM terminal equipment not associated with amplifiers and OTs is less probable because of the increased use of 76 R.D. Doverspike et al. passive (nonelectrical or powered) components. Note that in Fig. 2.27, for the OT, fiber cut, and amplifier failure, the affected connections at their respective layers are unrestored. Thus, the IP layer must reroute around its lost link capacity. Intra-office fiber: These disruptions usually occur from maintenance, reconfiguration, and provisioning activity in the central office. This has been minimized over the years due to the use of fiber patch panels; however, when significant network capacity expansion or reconfiguration occurs, especially for the deployment of new technologies, architectures, or services, downtime from these class of failures typically spikes. However, it is typical to lump the intra-office fiber disruptions into the downtime for a linecard or port and model them as one unit. Router: These network disruptions include failure of router line cards, failure of router common equipment, and maintenance or upgrade of all or parts of the router. Note that for these disruptions that originate at the IP layer, no lower-layer restoration method can help because rerouting the associated connections at the lower layers will not bring the affected link back up. However, in the dual-homing AR–BR architecture, all the ARs that home to the affected router can alternatively reroute through the mate BR. The method of rerouting the AR traffic to the surviving AR–BR links differs per carrier. Usually, IGP reconfiguration is used. However, this can be unacceptably slow for some high-priority services, as evidenced by Table 2.3. Therefore, other faster techniques are sometimes used, such as Ethernet link load balancing or MPLS FRR. We generalize some simple observations on multilayer restoration illustrated by Fig. 2.27 and its subsequent discussion: 1. Because of the use of express links, a single network failure or disruption at a lower layer usually results in multiple link failures at higher layers. 2. Failures that originate at an upper layer cannot be restored at a lower layer. 3. To meet most ISP network availability objectives, some form of restoration (even if rudimentary) must be provided in upper layers. 2.5.4 IP Backbone Network Design Network design is covered in more detail in Chapter 5. However, to tie together the concepts of network layering, network failure modeling, and restoration, we provide a brief description of IP network design here to illustrate its importance in meeting network availability targets. In this section, we give a brief description about how these factors are accommodated in the network design. To illustrate this, we describe a very simplified network design (or network planning) process as follows. This process would occur every planning period or whenever major changes to the network occur: 2 Structural Overview of ISP Networks 77 1. Derive a traffic matrix. 2. Input the existing IP backbone topology and compute any needed changes. That is, determine the homing of AR locations to the BR locations and determine which BR pairs are allowed to have links placed between them. 3. Determine the routing of BR–BR links over the lower-layer networks (e.g., DWDM, IOS, fiber). 4. Route the traffic matrix over the topology and size the links. This results in an estimate of network cost across all the needed layers. 5. Resize the links by finding their maximum needed capacity over all possible events in the Failure Set, which models potential network disruptions (both component failures and maintenance activity). This step simulates each failure event, determining which IP layer link or nodes fail after lower-layer restoration, if it exists, is applied and determining the capacity needed after traffic is rerouted using IP layer restoration. 6. Re-optimize the topology by going back to step 2 and iterating with the objective of lowering network cost. Note in steps 2 and 3 that most carriers are reluctant to make large changes to the existing IP backbone topology, since these can be very disruptive and costly events. Therefore, steps 2 and 3 usually incur small topology changes from one planning period to another planning period. We will not describe detailed algorithms for the above in detail here. Approaches to the above problem can be found in [22, 23]. The traffic matrix can come in a variety of forms, such as the peak 5-min average loads between AR-pairs or average loads, etc. Unfortunately, many organizations responsible for IP network design either have little or no data about their current or future traffic matrices. In fact, many engineers who manage IP networks expand their network by simply observing link loads. When a link load exceeds some threshold, they add more capacity. Given no knowledge or high uncertainty of the true, stochastic traffic matrix, this may be a reasonable approach. However, network failures and their subsequent restorations are the phenomena that cause the greatest challenges with such a simple approach. Because of the extensive rerouting that can occur after a network failure, there is no simple or intuitive parameter to determine the utilization threshold for each link. Traffic matrix estimation is discussed in detail in Chapter 5. A missing ingredient in the above network design algorithm is we did not describe how to model the needed network availability for an ISP to achieve its SLAs. Theoretically, even if we assume the traffic matrix (present and/or future) is completely accurate, to achieve the network design availability objective, all the component failure modes and all the network layering must be modeled to design the IP backbone. The decision variables are the layers where we provide restoration (including what type of restoration should be used) and how much capacity should be deployed at each layer to meet the QoS objectives for the IP layer. This is further complicated by the fact that while network availability objectives for transport layers are often expressed in worst-case or average-case connection uptimes, IP backbone QoS objective often use packet-loss metrics. 78 R.D. Doverspike et al. However, we can approximate the packet loss constraints in large IP layer networks by establishing maximum link utilization targets. For example, through separate analysis it might be determined that every flow can achieve the objective maximum packet loss target by not exceeding 90% utilization on any 40 Gb/s link, with perhaps lower utilization maxima needed on lower-rate links. Then, one can model when this utilization condition is met over the set of possible failures, including subsequent restoration procedures. By modeling the probabilities of the failure set, one can compute a network availability metric appropriate for packet networks. The probabilities of events in the failure set can be computed using Markov models and the Mean Time Between Failures (MTBF) and the Mean Time to Repair (MTTR) of the component disruptions. These parameters are usually obtained from a combination of equipment-supplier specifications, network observation/data, and carrier policies and procedures. A major stumbling block with this theoretical approach is that the failure event space is exponential in size. Even for very small networks and a few layers, it is intractable to compute all potential failures, let alone the subsequent restoration and network loss. An approach to probabilistic modeling to solve this problem is presented in more detail in Chapter 4 and in [28]. Armed with this background, we conclude this section by revisiting the issue of why we show the IP backbone routing over an unrestorable DWDM layer in the network layering of Fig. 2.3. This at first may seem counterintuitive because it is generally true that, per unit of capacity, the cost of links at lower layers is less than that of higher layers. Some of the reasons for this planning decision, which is consistent with most large ISPs, were hinted at in Section 2.5.3. We summarize them here. 1. Backbone router disruptions (failures or maintenance events) originate within the IP layer and cannot be restored at lower layers. Extra link capacity must be provided at the IP layer for such disruptions. Once placed, this extra capacity can then also be used for IP layer link failures that originate at lower layers. This obviates most of the cost advantages of lower-layer restoration. 2. Under nonfailure conditions, there is spare capacity available in the IP layer to handle uncertain demand. For example, restoration requirements aside, to handle normal service demand, IP layer links could be engineered to run below 80% utilization during peak intervals of the traffic matrix and well below that at off-peak intervals. If we allow higher utilization levels during network disruption events, then this provides an existing extra buffer during those events. Furthermore, there may be little appreciable loss during network disruptions during off-peak periods. As QoS and CoS features are deployed in the IP backbone, there is yet another advantage to IP layer restoration. Namely, the IP layer can assign different QoS objectives to different service classes. For example, one such distinction might be to plan network restoration so that premium services receive better performance than best-effort services during network disruptions. In contrast, the DWDM layer cannot make such fine-grain distinctions; it either restores or does not restore the entire IP layer link, which carries a mixture of different classes of services. 2 Structural Overview of ISP Networks 79 2.6 IPTV Backbone Example Some major carriers now offer nationwide digital television, high-speed Internet, and Voice-over-IP services over an IP network. These services typically include hundreds of digital television channels. Video content providers deliver their content to the service provider in digital format at select locations called super hub offices (SHOs). This in turn requires that the carrier have the ability to deliver high-bandwidth IP streaming to its residential customers on a nationwide basis. If such content is delivered all the way to residential set-top boxes over IP, it is commonly called IPTV. There are two options to providing such an IPTV backbone. The first option is to create a virtual network on top of the IP backbone. Since video service consists mostly of streaming channels that are broadcast to all customers, IP multicast is usually the most cost-effective protocol to transport the content. However, users have high expectations for video service and even small packet losses negatively impact video quality. This requires the IP backbone to be able to transport multicast traffic at a very high level of network availability and efficiency. The first option results in a mixture of best-effort traffic and traffic with very high quality of service on the same IP backbone, which in turn requires comprehensive mechanisms for restoration and priority queuing. Consequently, some carriers have followed the second option, wherein they create a separate overlay network on top of the lower-layer DWDM or TDM layers. In reality, this is another (smaller) IP layer network, with specialized traffic, network structure, and restoration mechanisms. We describe such an example in this section. Because of the high QoS objectives needed for broadcast TV services, the reader will find that this section builds on most of the previous material in this chapter. 2.6.1 Multicast-Based IPTV Distribution Meeting the stringent QoS required to deliver a high-quality video service (such as low latency and loss) requires careful consideration of the underlying IP-transport network, network restoration, and video and packet recovery methods. Figure 2.28 (borrowed from [9]) illustrates a simplified architecture for a network providing IPTV service. The SHO gathers content from the national video content providers, such as TV networks (mostly via satellite today) and distributes it to a large set of receiving locations, called video hub offices (VHOs). Each VHO in turn feeds a metropolitan area. IP routers are used to transport the IPTV content in the SHO and VHOs. The combination of SHO and VHO routers plus the links that connect them comprise the IPTV backbone. The VHO combines the national feeds with local content and other services and then distributes the content to each metro area. The long-distance backbone network between the SHO and the VHO includes a pair of redundant routers that are associated with each VHO. This allows for protection against router component failures, router hardware maintenance, or software 80 R.D. Doverspike et al. Dashed Links used for restoration SHO VHO VHO Edges of Multicast Tree VHO VHO VHO VHO VHO VHO VHO VHO S / VHO = Super / Video Hub Office Router Metro Intermediate Office RG Metro Set-top Box Access Video Serving Office DSLAM = Digital Subscriber Loop Access Multiplexer RG RG = Residential Gateway Fig. 2.28 Example nationwide IPTV network upgrades. IP multicast is used for delivery as it provides economic advantages for the IPTV service to distribute video. With multicast, packets traverse each link at most once. The video content is encoded using an encoding standard such as H.264. Video frames are packetized and are encapsulated in the Real-Time Transport Protocol (RTP) and UDP. In this example, PIM-SSM is used to support IP multicast over the video content. Each channel from the national live feed at the SHO is assigned a unique multicast group. There are typically hundreds of channels assigned to standard-definition (SD) (1.5 to 3 Mb/s) and high-definition (HD) (6 to 10 Mb/s) video signals plus other multimedia signals, such as “picture-in-picture” channels and music. So, the live feed can be multiple gigabits per second in aggregate bandwidth. 2.6.2 Restoration Mechanisms The IPTV network can use various restoration methods to deliver the needed video QoS to end-users. For example, it can recover from relatively infrequent and short bursts of loss using a combination of video and packet recovery mechanisms and protocols, including the Society of Motion Picture and Television Engineers (SMPTE; www.smpte.org/standards) 2022–1 Forward Error Correction (FEC) 2 Structural Overview of ISP Networks 81 standard, retransmission approaches based on RTP/RTCP [33] and Reliable UDP (R-UDP) [31], and video player loss-concealment algorithms in conjunction with set-top box buffering. R-UDP supports retransmission-based packet-loss recovery. In addition to protecting against video impairments due to last-mile (loop) transmission problems in the access segment, a combination of these methods can recover from a network failure (e.g., fiber link or router line card) of 50 ms or less. Repairing network failures usually takes far more than 50 ms (potentially several hours), but when combined with link-based FRR, this restoration methodology could meet the stringent requirements needed for video against single-link failures. Figure 2.29 (borrowed from [9]) illustrates how we might implement link-based FRR in an IPTV backbone by depicting a network segment with four node pairs that have defined virtual links (or pseudowires). This method is the pseudowire, next-hop FRR approach described in Section 2.5.2.2. For example, node pair E-C has a lower-layer link (such as SONET OC-n or Gigabit Ethernet) in each direction and a pseudowire in each direction (a total of four unidirectional logical links) used for FRR restoration. The medium dashed line shows the FRR backup path for the pseudowire E!C. Note that links such as E-A are for restoration and, hence, have no pseudowires defined. Pseudowire E!C routes over a primary path that consists of the single lower-layer link E!C (see the solid line in Fig. 2.29). If a failure occurs to a lower-layer link in the primary path such as C-E, then the router at node E attempts to switch to the backup path using FRR. The path from the root to node A will switch to the backup path at node E (E-A-B-C). Once it reaches node C, it will A E F Backup path for Pseudowire E→C B IGP view of Multicast tree C D Root Path of flow from Root to node A A E F X B D C Layer 1 Link (High weight) Layer 1 Link (High weight – used for restoration only) Pseudowire (Low weight – sits on top of Layer 1 solid black link) Fig. 2.29 Fast Reroute in IPTV backbone 82 R.D. Doverspike et al. continue on its previous (primary) path to node A (C-B-F-A). The entire path from E to A during the failure is shown by the outside dotted line. Although the path retraces itself between the routers B and C, the multicast traffic does not overlap because of the links’ unidirectionality. Also, although the IGP view of the topology realizes that the lower-layer links between E and C have gone “down,” because the pseudowire from E!C is still “up” and has the least weight, the shortest path tree remains unchanged. Consequently, the multicast tree remains unchanged. The IGP is unaware of the actual routing over the backup path. Note that these backup paths are precomputed, by analyzing all possible link failures in a comprehensive manner, a priori. If we route the pseudowire FRR backup path on a lower-layer path that is diverse from its primary path, FRR operates rapidly (suppose around 50 ms), and we set the hold-down timers appropriately, IGP will not detect the effect of any single fiber or DWDM layer link failure. Therefore, the multicast tree will remain unaffected, reducing the outage time of any single-link failure from tens of seconds to approximately 50 ms. This order of restoration time is needed to achieve the stringent IPTV network availability objectives. 2.6.3 Avoiding Congestion from Traffic Overlap A drawback of restoration using next-hop FRR is that since it reroutes traffic on a link-by-link basis, it can suffer traffic overlap during link failures, thus requiring more link capacity to meet the target availability. Links are deployed bidirectionally, and traffic overlap means that the packets of the same multicast flows travel over the same link (in the same direction) two or more times. If we avoid overlap, we can run the links at higher utilization and thus design more cost-effective networks. This requires that the multicast tree and backup paths be constructed so that traffic does not overlap. To illustrate traffic overlap, Fig. 2.30a shows a simple network topology with node S as the source and nodes d1 to d8 as the destinations. Here, each router is connected by a pair of directed links (in opposite directions). The two links of the pair are assigned the same IGP weight and the multicast trees are derived from these weights. The Fig. 2.30a illustrates two sets of link weights. Figure 2.30b shows the multicast tree derived from the first set of weights. In this case, there exists a singlelink failure that causes traffic overlap. For example, the dotted line shows the backup route for link d1–d4. If link d1–d4 fails, then the rerouted traffic will overlap with other traffic on links S -d 2 and d 2–d 6, thereby resulting in congestion on those links. Client routers downstream of d 2 and d 6 will see impairments as a result of this congestion. It is desirable to avoid this congestion wherever possible by constructing a multicast tree such that the backup path for any single-link failure does not overlap with any downstream link on the multicast tree. This is achieved by choosing OSPF link weights suitably. The tree derived from the second pair of weights is shown in Fig. 2.30c. In this case, the backup paths do not cause traffic overlap in response to any single-link 2 Structural Overview of ISP Networks 83 a S 1,10 1,10 1,100 d1 1,10 d4 d5 1,10 b d2 1,10 1,10 d6 d7 d8 1,10 1,100 1,10 Topology S d3 d1 X d4 d3 c S d1 d2 d2 d3 X d5 d6 d7 d8 d4 Multicast Tree with 1st weights d5 d6 d7 d8 Multicast Tree with 2nd weights Fig. 2.30 Example of traffic overlap from single-link failure failure. The multicast tree link is now from d 6 to d 2. The backup path for link d1–d4 is the same as in Fig. 2.30b. Observe that traffic on this backup path does not travel in the same direction as any link of the multicast tree. An algorithm to define FRR backup paths and IGP weights so that the multicast tree does not overlap from any single failure can be found in [10]. 2.6.4 Combating Multiple Concurrent Failures The algorithm and protocol in [10] helps in avoiding traffic overlap of the multicast tree during single-link failures. However, multiple link failures can still cause overlap. An example is shown in Fig. 2.31. Assume that links d1–d4 and d3–d8 are both down. If the backup path for edge d1–d4 is d1-S-d2-d6-d5-d4 (as shown in Fig. 2.30b and in Fig. 2.31) and the backup path for edge d3–d8 is d3-S-d2-d6-d7-d8, traffic will overlap paths on edges S-d2 and d2–d6. There would be significant traffic loss due to congestion if the links of the network are sized to only handle a single stream of multicast traffic. This situation essentially occurs because MPLS FRR occurs at Layer 2 and therefore the IGP is unaware of the FRR backup paths. Furthermore, the FRR backup paths are precalculated and there is no real-time (dynamic) accommodation for 84 R.D. Doverspike et al. Fig. 2.31 Example of traffic overlap from multiple link failures S d1 d2 d3 X X d4 d5 d6 d7 d8 different combinations of multiple-link failures. In reality, multiple (double and even triple) failures can happen. When they occur, they can have a large impact on the performance of the network. Yuksel [36] describes an approach that builds on the FRR mechanism but limits its use to a short period. When a single link fails and a pseudowire’s primary path fails, the traffic is rapidly switched over to the backup path as described above. However, soon afterwards, the router sets the virtual link weight to a high value and thus triggers the IGP reconvergence process – this is colloquially called “costing out” the link. Once IGP routing converges, a new PIM tree is rebuilt automatically. This avoids long periods where routing occurs over the FRR backup paths, which are unknown to the IGP. This ensures rapid restoration from single-link failures while allowing the multicast tree to dynamically adapt to any additional failures that might occur during a link outage. It is only during this short, transient period when FRR starts and IGP reconvergence finishes that another failure could expose the network to a path overlapping on the same link. The potential downside of this approach is that it incurs two more network reconvergence processes – that is, the period right after FRR has occurred and then again when the failure is repaired. If it is not carefully executed, this alternative approach can cause many new video interruptions due to small “hits” after single failures. Yuksel [36] proposes a careful multicast recovery methodology to accomplish this approach, yet avoid such drawbacks. A key component of the method is the make-before-break change of the multicast tree – that is, the requirement to hitlessly switch traffic from the old multicast tree to the new multicast tree. When the failure is repaired, the method normalizes the multicast tree to its original shortest path tree again in a hitless manner. The key modification to the multicast tree-building process (pruning and joining nodes) is that the prune message to remove the branch to the previous parent is not sent until the router receives PIM–SSM data packets from its new parent for the corresponding (S,G) group. Another motivation for this modification is because current PIM–SSM multicast does not have an explicit acknowledgement to a join request. It is only through the receipt of a data packet on that interface that the node knows that the join request was successfully received and processed at the upstream node. The soft-state approach of IP Multicast (refresh the state by periodically sending join requests) is also used to ensure consistency. This principle is used to guide the tree reconfiguration process at a node in reaction to a 2 Structural Overview of ISP Networks 85 failure. In this way, routers do not lose data packets during the switchover period. Of course, this primarily works in the PIM-SSM case, where there is a single source. As we can observe from the description above, building an IPTV backbone with high network availability builds on most of the protocols, multilayer failure models, and restoration machinery we have described in the previous sections of the chapter. In particular, given the underlying probabilities of network failures plus these complex failure and restoration mechanisms, such an approach must include the network design methodology to evaluate and estimate the theoretical network availability of the IPTV backbone. If such a methodology was not utilized, a carrier would run the risk of having its video customers dissatisfied with their video service because of inadequate network availability. 2.7 Summary This chapter presents an overview of the layered network design that is typical in a large ISP backbone. We emphasized three aspects that influence the design of an IP backbone. The first aspect is that the IP network design is strongly influenced by its relationship with the underlying network layers (such as DWDM and TDM layers) and the network segments (core, metro, and access). ISP networks use a hierarchy of specialized routers, generally called access and backbone routers. At the edge of the network, the location of access routers, and the types of interfaces that they need to support are strongly influenced by the way the customers connect to the backbone through the metro network. In the core of a large carrier network, backbone routers are interconnected using DWDM transmission technology. As IP traffic is the dominant source of demand for the DWDM layer, the backbone demands drive requirements for the DWDM layer. The need for multiple DWDM links has driven the evolution of aggregate links in the core. The second aspect is that ISP networks have evolved from traditional IP forwarding to support MPLS. The separation of routing and forwarding and the ability to support a routing hierarchy allow ISPs to support new functionality including Layer 2 and Layer 3 VPNs and flexible traffic engineering that could not be as easily supported in a traditional IP network. Finally, this chapter provided an overview of the issues that affect IP network reliability, including the impact of network disruptions at multiple network layers and, conversely, how different network layers respond to disruptions through network restoration. We described how failures and maintenance events originate at various network layers and how they impact the IP backbone. We presented an overview of the performance of OSPF failure recovery to motivate the need for MPLS Fast Reroute. We summarized the interplay between network restoration and the network design process. To tie these concepts together, we presented a “case study” of an IPTV backbone. An IPTV network can be thought of as an IP layer with a requirement for very high performance, essentially high network availability and low packet loss. This 86 R.D. Doverspike et al. requires the interlacing of multiple protocols, such as R-UDP, MPLS Fast Reroute, IP Multicast, and Forward Error Control. We described how lower-layer failures (including multiple failures) affect the IP layer and how these IP layer routing and control protocols respond. Understanding the performance of network restoration protocols and the overall availability of the given network design requires careful modeling of the types and likelihood of network failures, as well as the behavior of the restoration protocols. This chapter endeavored to lay a good foundation for reading the remaining chapters of this book. We conclude by alerting the reader to an important observation about IP network design. Telecommunications and its technologies undergo constant change. Therefore, this chapter describes a point in time. The contents of this chapter are different from what they would have been 5 years ago. There will be further changes over the next 5 years and, consequently, the chapter written 5 years from now may look quite different. References 1. AT&T (2003). Managed Internet Service Access Redundancy Options, from http://www. pnetcom.com/AB-0027.pdf. Accessed 15 April 2009. 2. Awduche, D., Berger, L., Gan, D., Li. T., Srinivasan, V., & Swallow, G. (2001). RSVP-TE: Extensions to RSVP for LSP Tunnels. IETF RFC 3209, Dec. http://tools.ietf.org/html/rfc3209. Accessed 29 January 2010. 3. Braden, R., Zhang, L., Berson, S., Herzog, S., & Jamin, S. (1997). Resource ReSerVation Protocol (RSVP) – Version 1 Functional Specification. IETF RFC 2205, Sept. http://tools.ietf.org/html/rfc2205. Accessed 29 January 2010. 4. Chiu, A., Choudhury, G., Doverspike, R., & Li, G. (2007). Restoration design in IP over reconfigurable all-optical networks. NPC 2007, Dalian, P.R. China, September 2007. 5. Choudhury, G. (Ed.) (2005). Prioritized Treatment of Specific OSPF Version 2 Packets and Congestion Avoidance. IETF RFC 4222, Oct. 6. Ciena Core Director. http://www.ciena.com/products/products coredirector product overview. htm. Accessed 13 April 2009. 7. Cisco (1999). Tag Switching in Internetworking Technology Handbook, Chapter 23, http:// www.cisco.com/en/US/docs/internetworking/technology/handbook/Tag-Switching.pdf, accessed 12/26/09. 8. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms, second edition (pp. 595–601). Cambridge: MIT Press, New York: McGraw-Hill. ISBN 0–262– 03293–7. Section 24.3: Dijkstra’s algorithm. 9. Doverspike R., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., Sinha, R. K., Wang, D., et al. (2009). Designing a reliable IPTV network. IEEE Internet Computing Magazine May/June, pp. 15–22. 10. Doverspike, R., Li, G., Oikonomou, K., Ramakrishnan, K. K., & Wang, D. (2007). IP backbone design for multimedia distribution: architecture and performance. INFOCOM-2007, Anchorage Alaska April 2007. 11. Doverspike, R., & Magill, P. (2008). Commercial optical networks, overlay networks and services. In I. Kaminow, T. Li, & A. Willner, (Eds), Chapter 13 in Optical fiber telecommunications VB. San Diego, CA: Academic. 12. Feuer, M., Kilper, D., & Woodward, S. (2008). ROADMs and their system applications. In I. Kaminow, T. Li, & A. Willner, (Eds), Chapter 8 in Optical fiber telecommunications VB. San Diego, CA: Academic. 2 Structural Overview of ISP Networks 87 13. Goyal, M., Ramakrishnan K. K., & Feng W. (2003) “Achieving Faster Failure Detection in OSPF Networks,” IEEE International Conference on Communications (ICC 2003), Alaska, May 2003. 14. IEEE 802.1Q-2005 (2005) Virtual Bridged Local Area Networks; ISBN 0–7381–3662-X. 15. IEEE: 802.1Qay – Provider Backbone Bridge Traffic Engineering. http://www.ieee802. org/1/pages/802.1ay.html. Accessed October 7, 2008. 16. IETF PWE3: Pseudo Wire Emulation Edge to Edge (PWE3) Working Group. http://www. ietf.org/html.charters/pwe3-charter.html. Accessed 7 Nov 2008. 17. IETF RFC 4090 (2005) Fast Reroute Extensions to RSVP-TE for LSP Tunnels. http:// www.ietf.org/rfc/rfc4090.txt. May 2005. Accessed 7 Nov 2008. 18. ITU-T G.709, “Interfaces for the Optical Transport Network,” March 2003. 19. ITU-T G.7713.2. Distributed Call and Connection Management: Signalling mechanism using GMPLS RSVP-TE. 20. Kalmanek, C. (2002). A Retrospective View of ATM. ACM Sigcomm CCR, Vol. 32, Issue 5, Nov, ISSN: 0146–4833. 21. Katz, D., Kompella, K., & Yeung, D. (2003). IETF RFC 3630: Traffic Engineering (TE) Extensions to OSPF Version 2. http://tools.ietf.org/html/rfc3630. Accessed 4 May 2009. 22. Klincewicz, J. G. (2005). Issues in link topology design for IP networks. SPIE Conference on performance, quality of service and control of next-generation communication networks III, SPIE Vol. 6011, Boston, MA. 23. Klincewicz, J. G. (2006). Why is IP network design so difficult? Eighth INFORMS telecommunications conference, Dallas, TX, March 30–April 1, 2006. 24. Kompella, K., & Rekhter, Y. (2007). IETF RFC 4761: Virtual private LAN service (VPLS) using BGP for auto-discovery and signaling. http://tools.ietf.org/html/rfc4761, accessed 12/26/09. 25. Lasserre, M., & Kompella, V. (2007). IETF RFC 4762: Virtual private LAN service (VPLS) using label distribution protocol (LDP) signaling. http://tools.ietf.org/html/rfc4762, accessed 12/26/09. 26. Moy, J. (1998). IETF RFC 2328: OSPF Version 2. http://tools.ietf.org/html/rfc2328, accessed 12/26/09. 27. Nortel. (2007). Adding scale, QoS and operational simplicity to Ethernet. http://www.nortel. com/solutions/collateral/nn115500.pdf, accessed 12/26/09. 28. Oikonomou, K., Sinha, R., & Doverspike, R. (2009). Multi-Layer Network Performance and Reliability Analysis. The International Journal of Interdisciplinary Telecommunications and Networking (IJITN), Vol. 1 (3), pp. 1–29, Sept. 29. Optical Internetworking Forum (OIF) (2008). OIF-UNI-02.0-Common–User Network Interface (UNI) 2.0 Signaling Specification: Common Part. http://www.oiforum.com/public/ documents/OIF-UNI-02.0-Common.pdf. 30. Oran, D. (1990). IETF RFC 1142: OSI IS-IS intra-domain routing protocol. http://tools. ietf.org/html/rfc1142. 31. Partridge, C., & Hinden, R. (1990). Version 2 of the Reliable Data Protocol (RDP), IETF RFC 1151. April. 32. Perlman, R. (1999). Interconnections: Bridges, Routers, Switches, and Internetworking Protocols, 2e. Addison-Wesley Professional Computing Series. 33. Schulzrinne, H., Casner, S., Frederick, R., & Jacobson, V. (2003). RTP: A Transport Protocol for Real-Time Application, IETF RFC 3550. http://www.ietf.org/rfc/rfc3550.txt, accessed 12/26/09. 34. Sycamore Intelligent Optical Switch. (2009). http://www.sycamorenet.com/products/sn16000. asp. Accessed 13 April 2009. 35. Telcordia GR-253-CORE (2000) Synchronous Optical Network (SONET) Transport Systems: Common Generic Criteria. 36. Yuksel, M., Ramakrishnan, K. K., & Doverspike, R. (2008). Cross-layer failure restoration for a robust IPTV service. LANMAN-2008, Cluj-Napoca, Romania September. 37. Zimmermann, H. (1980). OSI reference model – the ISO model of architecture for open systems interconnection. IEEE Transactions on Communications, 28(Suppl. 4), 425–432. 88 R.D. Doverspike et al. Glossary of Acronyms and Key Terms 1:1 1C1 Access Network Segment ADM Administrative Domain Aggregate Link AR AS ASBR ATM AWG B-DCS Backhaul BER BGP BLSR BR Bundled Link CE switch Channelized CHOC Card CIR CO Composite Link Core Network Segment CoS CPE One-by-one (signal switched to restoration path on detection of failure) One-plus-one (signal duplicated across both service path and restoration path; receiver chooses surviving signal upon detection of failure) The feeder network and loop segments associated with a given metro segment Add/Drop Multiplexer Routing area in IGP Bundles multiple physical links between a pair of routers into a single virtual link from the point of view of the routers. Also called bundled or composite link Access Router Autonomous System Autonomous System Border Router Asynchronous Transfer Mode Arrayed Waveguide Grating Broadband Digital Cross-connect System (cross-connects at DS-3 or higher rate) Using TDM connections that encapsulate packets to connect customers to packet networks Bit Error Rate Border Gateway Protocol Bidirectional Line-Switched Ring Backbone Router See Aggregate Link Customer-Edge switch A TDM link/connection that multiplexes lower-rate signals into its time slots CHannelized OC-n card Committed Information Rate Central Office See Aggregate Link Equipment in the POPs and network structures that connect them for intermetro transport and switching Class of Service Customer Premises Equipment 2 Structural Overview of ISP Networks CSPF DCS DDoS DoS DS-0 DS-1 DS-3 DWDM E-1 eBGP EGP EIGRP EIR EPL FCC FE FEC FEC Feeder Network FRR FXC Gb/s GigE GMPLS HD HDTV Hitless iBGP IETF IGP Internet Route Free Core IGMP Inter-office Links Constraint-based Shortest Path First Digital Cross-connect System Distributed Denial of Service (security attack on router) Denial of Service (security attack on router) Digital Signal – level 0 a pre-SONET signal carrying one voice-frequency channel at 64 kb/s) Digital Signal – level 1 (a 1.544 Mb/s signal). A channelized DS-1 carries 24 DS0s Digital Signal – level 3 (a 44.736 Mb/s signal). A channelized DS-3 carries 28 DS1s Dense Wavelength-Division Multiplexing European plesiosynchronous (pre-SDH) rate of 2.0 Mb/s External Border Gateway Protocol Exterior Gateway Protocol Enhanced Interior Gateway Routing Protocol Excess Information Rate Ethernet Private Line Federal Communications Commission Fast Ethernet (100 Mb/s) Forward Error Correction – bit-error recovery technique in TDM transmission and some IPs Forwarding Equivalence Class – classification of flows defined in MPLS The portion of the access network between the loop and first metro central office Fast Re-Route Fiber Cross-Connect Gigabits per second (1 billion bits per second) Gigabit Ethernet (nominally 1 Gb/s) Generalized MPLS High definition (short for HDTV) High-definition TV (television with resolution exceeding 7201280) Method of changing network connections or routes that incur negligible loss Interior Border Gateway Protocol Internet Engineering Task Force Interior Gateway Protocol Where MPLS removes external BGP information plus Layer 3 address lookup from the interior of the IP backbone Internet Group Management Protocol Links whose endpoints are contained in different central offices 89 90 Intra-office Links IOS IP IPTV IROU IS-IS ISO ISP ITU Kb/s LAN LATA Layer n LDP LMP Local Loop LSA LSDB LSP LSR MAC MAN Mb/s MEMS Metro Network Segment MPEG MPLS MSO MSP MTBF R.D. Doverspike et al. Links that are totally contained within the same central office Intelligent Optical Switch Internet Protocol Internet Protocol television (i.e., entertainment-quality video delivered over IP) Indefeasible Right of Use Intermediate-System-to-Intermediate-System (IP routing and control plane protocol) International Organization for Standardization (not an acronym) Internet Service Provider International Telecommunication Union Kilobits per second (1,000 bits per second) Local Area Network Local Access and Transport Area A colloquial packet protocol layering model, with origins to the OSI reference model. Today, roughly Layer 3 corresponds to IP packets, Layer 2 to MPLS LSPs, pseudowires, or Ethernet-based VLANs, and Layer 1 to all lower-layer transport protocols Label Distribution Protocol Link Management Protocol The portion of the access segment between the customer and feeder network. Also called “last mile” Link-State Advertisement Link-State Database Label Switched Path Label Switch Router Media Access Control Metropolitan Area Network Megabits per second (1 Million bits per second) Micro-Electro-Mechanical Systems The network layers of the equipment located in the central offices of a given metropolitan area Moving Picture Experts Group Multiprotocol Label Switching Multiple System Operator (typically coaxial cable companies) Multi-Service Platform – A type of ADM enhanced with many forms of interfaces Mean Time Between Failure 2 Structural Overview of ISP Networks MTSO MTTR Multicast N-DCS n-degree ROADM Next-hop Next-next-hop Normalization NTE OC-n ODU O-E-O OIF OL OSPF OSPF-TE OSS OT OTN P Router PBB-TE PBT PE Router PIM PL P-NNI POP PPP PPPoE Pseudowire PVC PWE3 QoS RAR RD Reconvergence Mobile Telephone Switching Office Mean Time to Repair Point-to-multipoint flows in packet networks Narrowband Digital Cross-connect System (cross-connects at DS0 rate) A ROADM that can fiber to more than three different ROADMS (also called multidegree ROADM) Method in MPLS FRR that routes around a down link Method in MPLS FRR that routes around a down node Step in network restoration after all failures are repaired to bring the network back to its normal state Network Terminating Equipment Optical Carrier – level n (designation of optical transport of a SONET STS-n) Optical channel Data Unit – protocol data unit in ITU OTN Optical-to-Electrical-to-Optical Optical Internetworking Forum Optical Layer Open Shortest Path First Open Shortest Path First – Traffic Engineering Operations Support System Optical Transponder Optical Transport Network – ITU optical protocol Provider Router Provider Backbone Bridge – Traffic Engineering Provider Backbone Transport Provider-Edge Router Protocol-Independent Multicast Private Line Private Network-to-Network Interface (ATM routing protocol) Point Of Presence Point-to-Point Protocol Point-to-Point Protocol over Ethernet A virtual connection defined in the IETF PWE3 that encapsulates higher-layer protocols Permanent Virtual Circuit Pseudo-Wire Emulation Edge-to-Edge Quality of Service Remote Access Router Route Distinguisher IGP process to update network topology and adjust routing tables 91 92 RIB ROADM RR RSTP RSVP RT RD RTP SD SDH Serving CO SHO SLA SRLG SONET SONET/SDH self-healing rings SPF STS-n SVC TCP TDM UDP UNI Unicast UPSR VHO VLAN VoD VoIP VPLS VPN R.D. Doverspike et al. Router Information Base Reconfigurable Optical Add/Drop Multiplexer Route Reflector Rapid Spanning Tree Protocol Resource Reservation Protocol Route Target (also Remote Terminal in metro TDM networks) Route Distinguisher Real-Time Protocol Standard Definition (television with resolution of about 640 480) Synchronous Digital Hierarchy (a synchronous optical networking standard used outside North America, documented by the ITU in G.707 and G.708) The first metro central office to which a given customer homes Super Hub Office Service Level Agreement Shared Risk Link Group Synchronous Optical Network (a synchronous optical networking standard used in North America, documented in GR-253-CORE from Telcordia) Typically UPSR or BLSR rings Shortest Path First Synchronous Transport Signal – level n (a signal level of the SONET hierarchy with a data rate of n 51.84 Mb/s) Switched Virtual Circuit Transmission Control Protocol Time Division Multiplexing User Data Protocol User-Network Interface Point-to-point flows in packet networks Unidirectional Path-Switched Ring Video Hub Office Virtual Local Area Network Video on Demand Voice-over-Internet Protocol Virtual Private LAN Service (i.e., Transparent LAN Service) Virtual Private Network 2 Structural Overview of ISP Networks WAN Wavelength continuity W-DCS DWDM Wide Area Network A restriction in DWDM equipment that a through connection must be optically cross-connected to the same wavelength on both fibers Wideband Digital Cross-connect System (cross-connects at DS-1, SONET VT-n or higher rate) Wavelength-Division Multiplexing 93 Part II Reliability Modeling and Network Planning Chapter 3 Reliability Metrics for Routers in IP Networks Yaakov Kogan 3.1 Introduction As the Internet has become an increasingly critical communication infrastructure for business, education, and society in general, the need to understand and systematically analyze its reliability has become more important. Internet Service Providers (ISPs) face the challenge of needing to continuously upgrade the network and grow network capacity, while providing a service that meets stringent customer-reliability expectations. While telecommunication companies have long experience providing reliable telephone service, the challenge for an ISP is more difficult because changes in Internet technology, particularly router software, are significantly more frequent and less rigorously tested than was the case in circuit-switched telephone networks. ISPs cannot wait until router technology matures – a large ISP has to meet high reliability requirements for critical applications like financial transactions, Voice over IP, and IPTV using commercially available technology. The need to use less mature technology has resulted in a variety of redundancy solutions at the edge of the network, and in well-thought-out designs for a resilient core network that is shared by traffic from all applications. The reliability objective for circuit-switched telephone service of “no more than 2 hours downtime in 40 years” has been applied to voice communication since 1964 [1]. It has been achieved using expensive redundancy solutions for both switches and transmission facilities. Though routers are less reliable than circuit switches, commercial IP networks have three main advantages when designing for reliability, in comparison with legacy telephone networks. First, packet switching is a far more economically efficient mechanism for multiplexing network resources than circuit switching, given the bursty nature of data traffic. Second, protocols like Multi-Protocol Label Switching (MPLS) support a range of network restoration options that are more economically efficient in restoration from failures of transmission facilities than traditional 1:1 redundancy. Third, commercial Y. Kogan () AT&T Labs, 200 S. Laurel Ave, Middletown, NJ 07748, USA e-mail: yaakovkogan@att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 3, c Springer-Verlag London Limited 2010 97 98 Y. Kogan IP networks can provide different levels of redundancy to different commercial customers, for example, by offering access diversity or multihoming options, pricing the service depending on its reliability. This allows Internet service providers to satisfy customers who are price-sensitive [2] while recovering the high cost of redundancy from customers who require increased reliability to support mission critical applications. The reliability of modern provider edge routers, which have a large variety of interface cards, cannot be accurately characterized by a single downtime or reliability metric because it requires averaging the contributions of the various line cards that may hide the poor reliability of some components. We address this challenge by introducing granular metrics for quantifying the reliability of IP routers. Section 3.2 provides an overview of the main router elements and redundancy mechanisms. In Section 3.3, we use a simplified router reliability model to demonstrate the application of different reliability metrics. In Section 3.4, we define metrics for measuring the reliability of IP routers in production networks. Section 3.5 provides an overview of challenges with measuring end-to-end availability. 3.2 Redundancy Solutions in IP Routers This section provides an overview of the primary elements of a modern router and associated redundancy mechanisms, which are important for availability modeling of services in IP networks. A high-speed IP router is a special multiprocessor system with two types of processors, each with its own memory and CPU: Route Processors (RPs) and Line–Cards (LCs). Each line–card receives packets from other routers via one or more logical interfaces, and performs forwarding operations by sending them to outbound logical interfaces using information in its local Forwarding Information Base (FIB). The route processor controls the operation of the entire router, runs the routing protocols, maintains the necessary databases for route processing, and updates the FIB on each line–card. This separation implies that each LC can continue forwarding packets based on its copy of the FIB when the RP fails. Figure 3.1 provides a simplified illustration of router hardware architecture, where two route processors (active and backup) and multiple line-cards are interconnected through a switch fabric. The Monitor bus is used exclusively for transmission of error and management messages that help one to isolate the fault when a component is faulty and to restore the normal operation of the router, if the failed component is backed up by a redundant unit. Data traffic never goes through Monitor bus but across the switch fabric. These hardware (HW) components operate under the control of an Operating System (OS). Additional details for Cisco and Juniper routers can be found in [3, 4] and [5], respectively. A typical Mean Time Between Failures (MTBF) for both RPs and LCs is about 100,000 h (see, e.g., Table 9.3 in [6]). This MTBF accounts only for hard failures requiring replacement of the failed component, in contrast with soft failures, from which the router can recover, for example, by card reset. A typical example of a soft 3 Reliability Metrics for Routers in IP Networks 99 Line Card 1 Switch fabric Active RP Backup RP Line Card n Monitor Bus Cooling system Power supplies Fig. 3.1 Generic router hardware architecture hardware failure is parity error. Router vendors do not usually provide an MTBF for the OS, as it varies over a wide range. According to our experience, a new OS version may have an MTBF well below 100,000 h as a result of undetected software errors that are first encountered after the OS is deployed to the field. According to our experience, the MTBF for a stable OS is typically above 100,000 h, though even with a stable OS, changes in the operating environment can trigger latent software errors. Without redundancy solutions at the edge of the network, component failures interrupt customer traffic until the failed component is recovered by reset, which may take about a minute, or until it is replaced, which can take hours. To reduce failure impacts, shared HW components whose failure would impact the entire router (e.g., RP, switch fabric, power supply, and cooling system) are typically redundant. In this case, the restoration time (assuming a successful failover to the redundant component) is defined by the failover time. For example, in Cisco 12000 series routers [3] and Juniper T640 router [7], the switch fabric consists of five cards, four of which are active and one provides redundancy with a subsecond restoration time when an active card fails. Failure of one power supply or cooling element does not have any impact on service. RP redundancy is provided by a configuration with two RP cards: primary and backup. A first attempt at reducing the failover time has been made by running the backup RP in standby mode with partial synchronization between the active and standby RPs that enables the standby RP to maintain all Layer 1 and Layer 2 sessions and recover the routing database from adjacent nodes when the primary RP fails. However, when a primary RP fails, BGP adjacencies with adjacent routers go down. The loss of BGP adjacency has the same effect on network routing as failure of the entire router until the standby RP comes on-line and re-establishes BGP adjacencies with its neighbors. During this time, the routing protocols will reconverge to another route and then back again that will cause transient packet 100 Y. Kogan loss – a phenomenon known as “route flapping.” (Route flapping occurs when a router alternately advertises a network destination via one route, then another (or as unavailable, and then available again) in quick sequence [8].) To prevent the adjacent routers from declaring the failed router out of service and removing it from their routing tables and forwarding databases, vendors have developed high availability (HA) routing protocol extensions, which allow a router to restart its routing software gracefully in such a way that packet forwarding is not disrupted when the primary RP fails. If the routers adjacent to a given router support these extensions, they will continue to advertise routes from the restarting router during the grace period. Cisco’s and Juniper’s HA routing protocol extensions are known under the name of Non-Stop Forwarding (NSF) [9] and Graceful Restart (GR) [10], respectively. A detailed description of the Cisco NSF support for BGP, OSPF, IS-IS, and EIGRP routing protocols as well as for MPLS-related protocols can be found in [9]. Here, we describe the BGP protocol extension procedures that follow the implementation specification provided in the IETF proposed standard “Graceful Restart Mechanism for BGP” [11]. Let R1 be the restarting router and R2 be a peer. The goal is to restart a BGP session between R1 and peering routers without redirecting traffic around R1. 1. R1 and R2 signal each other that they understand Graceful Restart in their initial exchange of BGP OPEN messages when the initial BGP connection is established between R1 and R2. 2. An RP failover occurs, and the router R1 BGP process starts on the newly active RP. R1 does not have a routing information base and must reacquire it from its peer routers. R1 will continue to forward IP packets destined for (or through) peer routers (R2) using the last updated FIB. 3. When R2 detects that the TCP session with R1 is cleared, it marks routes, learned from R1, as STALE, but continues to use them to forward packets. R2 also initializes a Restart-timer for R1. Router R2 will remove all STALE routes unless it receives an OPEN message from R1 within the specified Restart-time. 4. R1 establishes a new TCP session with R2 and sends an OPEN message to R2, indicating that its BGP software has restarted. When R2 receives this OPEN message, it resets its own Restart-timer and starts a Stalepath-timer. 5. Both routers re-established their session. R2 begins to send UPDATE messages to R1. R1 starts an Update-delay timer and waits until up to 120 s to receive End-of-RIB (EOR) from all its peers. 6. When R1 receives EOR from all its peers, it will begin the BGP Route Selection Process. 7. When this process is complete, it will begin to send UPDATE messages to R2. R1 indicates completion of updates by EOR and R2 starts its Route Selection Process. 8. While R2 waits for an EOR, it also monitors Stalepath time. If the timer expires, all STALE routes will be removed and “normal” BGP process will be in effect. When R2 has completed its Route Selection Process, then any STALE entries will be refreshed with newer information or removed from the BGP RIB and FIB. The network is now converged. 3 Reliability Metrics for Routers in IP Networks 101 One drawback of NSF/GR is that there is a potential for transient routing loops or packet loss if a restarting router loses its forwarding state (e.g., owing to a power failure). A second drawback of NSF/GR is that it can prolong delays of network-layer re-routing in cases where the service is NOT restored by RP failover. In addition, to be effective in a large ISP backbone, NSF/GR extensions would need to be deployed on all of the peering routers. However, the OSPF NSF extension is Cisco proprietary. The respective drafts were submitted to the IETF but not approved as standards. Since most large ISP networks use routers from multiple vendors, the lack of standardization and universal adoption by vendors limits the usefulness of the NSF and GR extensions. Another approach to router reliability, called Non-Stop Routing (NSR), is free from the drawbacks of graceful restart. It is a self-contained solution that does not require protocol extensions and has a faster failover time. With NSR, the standby RP runs its own version of each protocol and there is continuous synchronization between the active and standby RPs to the extent that it enables the standby RP to take over when the active RP fails without any disruption in the existing peering sessions. The first implementation of NSR was done by Avici Systems [12] in 2003 in the Terabit Switch Router (TSR) router that was used in the AT&T core network. Later, other router vendors implemented their versions of NSR (see, e.g., [13]). It is important to note that router outages can be divided into two categories: planned and unplanned outages. Much of the preceding discussion focused on RP failures or unplanned outages. Planned outages are caused by scheduled maintenance activities, which include software and hardware upgrades as well as card replacement and installation of additional line-cards. Router vendors are developing a software solution on top of NSR to support in-service software upgrade, or ISSU (see, e.g., [13–15]). The goal of ISSU is a significant reduction in downtime due to software upgrades, potentially eliminating this category of downtime if both the old and new SW versions support ISSU. We now turn our attention to line-card failures. Line-card failures are distinct from link failures – while link failures can often be recovered by the underlying transport technology, e.g., SONET ring (see Chapter 2), line-card failures require traffic to be handled by a redundant line-card provisioned on the same or a different router. Line-card redundancy is particularly important for reducing the outage duration of PE (provider-edge) routers that terminate thousands of low-speed customer ports. The first candidate for redundancy is an uplink LC that is used for connection to a P (core) router. Without redundancy, any uplink LC downtime will cause PE router isolation. In addition, a redundant uplink LC allows us to connect a PE router to two P routers using physically diverse transport links. This configuration results in the near elimination of PE router downtime caused by periodic maintenance activities on P routers, under the assumption that maintenance is not performed on these two P routers simultaneously. PE router downtime is nearly eliminated in this case because the probability of PE isolation caused by the failure of the second uplink or the other P router is negligibly small if the maintenance window is short. Restoration from an uplink LC failure is provided at the IP-Layer with restoration time of the order of 10 s as described in Chapter 2. 102 Y. Kogan SONET interfaces on IP routers may support the ability to automatically switch traffic from a failed line-card to a redundant line-card, using a technique called Automatic Protection Switching (APS) [16]. Implementation of APS requires installation of two identical line-cards; one card is designated as primary, the other as secondary. A port on the primary LC is configured as the working interface and the port with the same port number on the secondary LC as the protection interface. The ports form a single virtual interface. Ports on the secondary LC cannot be configured with services; they can only be configured as protection ports for the corresponding ports on the primary LC. The protection and working interfaces are connected to a SONET ADM (Add-Drop Multiplexer), which sends the same signal payload to the working and protection interfaces. When the working interface fails, its traffic is switched to the protection interface. According to our experience, the switchover time is of the order of 1 min. Hitless switchover requires protocol synchronization between the line–cards, which was not available at the time of writing of this chapter. APS is only available in a 1:1 configuration. As a result, it is considered to be expensive. An alternative line-card redundancy approach developed at AT&T [17] is based on a new ISP edge architecture called RouterFarm. RouterFarm utilizes 1:N redundancy, in which a single PE backup router can support multiple active routers. The RouterFarm architecture supports customer access links that connect to PE routers over a dynamically reconfigurable access network. When a PE router fails or is taken out of service for planned maintenance, control software rehomes the customer access links from the affected router to a selected backup router and copies the appropriate router configuration data to the backup router. Service is provided by the backup router once the rehoming is complete. After the primary router is repaired or required maintenance is performed, customers can be rehomed back to the primary router. 3.3 Router Reliability Modeling As described in Section 3.2, router outages can be divided into two categories: planned and unplanned. Planned outages are caused by scheduled maintenance activities. Customers with a single connection to an ISP edge router are notified in advance about planned maintenance. Outages outside of the maintenance window are referred to as unplanned. The common practice is to evaluate router reliability metrics for planned and unplanned outages separately. Table 3.1 provides an example1 of downtime calculation for software (SW) and hardware (HW) upgrades that require the entire router to be taken out of service. The downtime is calculated based on upgrade frequency per year in the second column and mean upgrade duration in the third column. The total mean downtime per year for planned outages is 42 min. 1 All examples are for illustrative purposes only and are not meant to model or describe any network or vendor’s product. 3 Reliability Metrics for Routers in IP Networks 103 Table 3.1 Planned downtime for SW and HW upgrades Activity Freq/year Duration (min) Downtime (min) SW upgrade 2 15 30 HW upgrade 0.2 60 12 The router downtime is close to 0 for unplanned outages if the router supports RP and LC redundancy. If LC redundancy is not supported, unplanned router downtime depends on the ratio rLC =mLC where rLC and mLC denote LC MTTR (Mean Time To Repair) and MTBF, respectively. Using the fact that rLC mLC , one can approximate the downtime probability by rLC =mLC and calculate the average unplanned router downtime per year as dLC D .rLC =mLC/ 525; 600 .min =year/: The factor 525; 600 D 365 24 60 is the number of minutes in a 365-day year. With stable hardware and software, rLC =mLC 4 105 and unplanned downtime dLC is around 21 min, which is less than the planned downtime due to upgrades by a factor of 2. The reliability improvement due to RP and LC redundancy for unplanned outages can be evaluated using the following simplified router reliability model described by a system consisting of two independent components representing the LC and RP. Component 1 corresponds to the LC and component 2 corresponds to the RP. Each component alternates between periods when it is up and periods when it is down. The system is working if both components are up. For nonredundant component i; i D 1; 2, denote MTBF and MTTR by mi and ri , respectively. For a component consisting of primary and backup units, we assume that once a primary unit fails, the backup unit starts to function with probability pi after a random delay with mean i ri . With probability 1 pi , the switchover to the backup unit fails, in which case the mean downtime is ri . Thus, the MTTR for a redundant component is bi D pi i C .1 pi /ri : (3.1) Two important particular cases correspond to pi D 0 (no redundancy) and i D 0 (instantaneous switchover). The MTBF for a redundant component is ci D mi if i > 0 ci D mi =.1 pi / if i D 0: (3.2) The steady state probability that the system (component) is working is referred to as availability. The complementary probability is referred to as unavailability. Based on our assumptions, the availability of component i is Ai D ci ci C bi (3.3) 104 Y. Kogan and the system availability is A D A1 A2 : (3.4) In our case, ri mi that allows us to obtain the following simple approximation for the system unavailability: U D 1 A1 A2 D 1 .1 U1 /.1 U2 / U1 C U2 (3.5) where Ui D bi =.ci C bi / is unavailability of component i . Another important reliability metric is the rate fs at which the system fails. In our case (see, e.g., 7c in [18]) fs 1=c1 C 1=c2 : (3.6) Redundancy without instantaneous switchover decreases the mean component downtime bi and the component and the system unavailability. However, the system failure rate does not decrease because the component uptime ci D mi remains unchanged if i > 0. Instantaneous switchover decreases both the unavailability and the system failure rate. The availability of LCs and RPs with no redundancy is typically better than 0.9999 (four nines) but worse than 0.99999 (five nines). We can compute an estimate of the improvement due to redundancy using Eq. (3.1). If the redundancy of component i is characterized by a probability of successful switchover pi D 0:95 and i =ri D 0:05, then the mean component downtime bi and therefore its unavailability would decrease by about a factor of 10, resulting in a component availability exceeding five nines. The system availability would be limited by the availability of any nonredundant component. 3.4 Reliability Metrics for Routers in Access Networks Figure 3.2 depicts a typical Layer 3 access topology for enterprise customers. It includes n provider-edge routers PE1, : : : , PEn and two core or backbone routers P1 and P2, which are responsible for delivering traffic from customer edge (CE) CE PE1 P1 CE PEn Fig. 3.2 Access network elements Backbone ·· · P2 3 Reliability Metrics for Routers in IP Networks 105 routers at a customer location into the commercial IP network backbone. The service provided by an ISP to an enterprise customer is typically associated with a customer “access port.” An access port is a logical interface on the line-card in a PE, where the link from a customer’s CE router terminates. In general, a PE has a variety of line-cards with different port densities depending on the port speed. For example, a channelized OC-12 card provides up to 336 T1/E1 ports, while a channelized OC-48 card can provide up to either 48 T3 ports, or 16 OC3 ports, or 4 OC12 ports. In Fig. 3.2, each PE is dual-homed to two different P (core) routers using two physically diverse transport links terminating on different line-cards at the PE router. (These transport links are referred to as uplinks.) The links that connect P routers at different nodes are generally provided by an underlying transport network. Dual-homing is used to reduce the impact on the customer due to outages – from a potentially long repair interval to short-duration packet loss caused by protocol reconvergence. Dual-homing is used to address the following outage scenarios: Outage of uplink transport equipment Outage of an uplink line-card at PE routers Outage of an uplink line-card at P routers Outage of one P router or its associated backbone links Customer downtime can be caused by a failure in a PE component, such as a failed interface or line-card, or from a total PE outage. Our goal in this section is to provide a practical way of applying the traditional reliability metrics like availability and MTBF to a large network of edge routers. The calculation of these metrics is straightforward in the case of K identical systems s1 ; : : : ; sK , where each system alternates between periods when it is up and periods when it is down. Assume that k K different systems si1 ; : : : ; sik failed during time interval of length T , and let tj be the total outage duration of system j . The unavailability Uj of system j can be estimated as Uj D tj =T for j D i1 ; : : : ; ik (3.7) and Uj D 0 otherwise. Then, the average unavailability is K P U D j D1 k P Uj K D j D1 ti j KT (3.8) and the average availability is A D 1 U: (3.9) Finally, the average time between failures is estimated as KT=L, where L k is the total number of failures during time interval T . There are two main difficulties with extending these estimates to routers. First, routers experience failures of a single line-card in addition to entire router failures. Second, routers may not be identical. The initial approach to overcome these difficulties was to assign to each failure a weight that represents the fraction of the 106 Y. Kogan access network impacted by the failure. Such an approach is adequate for access networks consisting of the same type routers and line-cards with port speeds in a sufficiently narrow range, which was the case of early access networks with Cisco’s 7500 routers. Modern access networks may consist of several router platforms and high-speed routers may have line-cards with port speed varying in a wide range. For these networks, averaging failures over various router platforms and line-cards with different port speeds is not sufficient. We start with presenting the existing averaging techniques and demonstrating their deficiencies and then describe a granular approach where availability is described by a vector with components representing the availability for each type of access line-cards. Two frequently used expressions for calculating the fraction of the impacted access network are based on different parameterizations of impacted access ports in service and have the following forms [19]: Number of impacted access ports in service Total number of all access ports in service (3.10) Total bandwidth of impacted access ports in service Total bandwidth of all access ports in service (3.11) f D and f D Having the fraction fi of access port impacted and failure duration Di for each failure i; i D 1; : : : ; L during time interval of length T , we can estimate the average access unavailability and availability as U access D L X i D1 fi Di T and Aaccess D 1 U access (3.12) respectively. Formally, one can use Eq. (3.12) with port-weighting or bandwidthweighting fractionsfi for estimating the average unavailability (availability) of any access network with different router platforms. However, there are several problems with these averaging techniques that limit their usefulness: Port-weighted fraction (3.10) emphasizes line-card failures with low-speed ports while failures of high-speed ports are heavily discounted because the port density on a line-card is inversely proportional to the port speed. Bandwidth-weighted fraction (3.11) assigns lower weight to failures of line-cards with low-speed ports because they do not utilize the entire bandwidth of the line-card. Any averaging over different router platforms or even for one router platform with a variety of line-cards that have different quality of hardware and software may hide defects. These issues are illustrated by the following example. Consider an access network consisting of 100 Cisco gigabit switch routers (GSRs) and assume that each router has two access line-cards of each of the following three types: 3 Reliability Metrics for Routers in IP Networks 107 Channelized OC12 with up to 336 T1 ports Channelized OC48: one card is with up to 48 T3 ports while another card is either with up to 16 OC3 ports (50 routers) or with up to 4 OC12 ports (50 routers) 1-port OC48. The total number of ports in service and their respective bandwidth (BW) are shown in Table 3.2. The number of ports in the third column of Table 3.2 is obtained by multiplying the number of ports in service given in the second column of Table 3.3 by the total number of cards with the respective port speed. For T1 and OC48, the total number of cards of each type is 200 D 2100. For T3, OC3, and OC12, the total number of cards is 100, 50, and 50, respectively. In Table 3.3, we use Eqs. (3.10) and (3.11) to calculate port-weight and bandwidth-weight for failure of one linecard depending on the number of ports in service given in the second column. The bandwidth of a line-card is obtained as a product of the number of ports in service, given in the second column of Table 3.3, and the respective speed given in the second column of Table 3.2. One can see that port-weighting practically disregards failures of line-cards with OC48 and OC12 ports, while contribution of failures of line-cards with T3 and OC3 ports is discounted relative to T1 ports by a factor of 6.7 and 20, respectively. As a result, the availability of the access network is dominated by the availability of channelized OC12 card with T1 ports. As one could expect, bandwidth-weighting is biased toward failures of line-cards with an OC48 port. However, failures of other line-cards, except for a channelized OC12 card with T1 ports, become more visible in comparison with port-weighting. As a result of these problems with port and bandwidth-weighting techniques, a more useful approach is to evaluate average availability for each router platform and for each type of access LC separately. The increasing variety of edge routers and access line-cards justifies such an approach, since it allows the ISP to track Table 3.2 Port T1 T3 OC3 OC12 OC48 Total Total number of ports in service and their bandwidth Speed (Mbps) Number of ports BW (Gbps) 1.5 40,000 60.0 45 3,000 135.0 155 500 77.5 622 150 93.3 2,400 200 480.0 43,850 845.8 Table 3.3 Port-weight and bandwidth-weight per line-card Port In service P-weight BW-weight T1 200 0.00456 0.00035 T3 30 0.00068 0.00160 OC3 10 0.00023 0.00183 OC12 3 6.8E-05 0.00221 OC48 1 2.3E-05 0.00284 108 Y. Kogan the reliability with finer granularity. Consider a set of edge routers of the same type with J types of access line–cards, which are monitored for failures during time interval of length T . For each customer impacting failure i; i D 1; : : : ; L, we record the number nij of type j cards affected and the respective failure duration tij . In the case of access line-card redundancy, only failures of active (primary) line-card are counted and then only if the failover to the backup line-card was not hitless. The average unavailability of type j access line-card is calculated as L P Uj D i D1 nij tij (3.13) Nj T where Nj is the total number of type j active cards. The average unavailability can be expressed as Rj (3.14) Uj D Mj where L P Rj D nij tij i D1 L P i D1 (3.15) nij is the average repair time for an LC of type j , and Mj D Nj T L P nij (3.16) i D1 can be interpreted as the average time between router failures impacting customers on access line-cards of type j . Metric Mj can be considered as an extension of the traditional field hardware MTBF. For the field MTBF, only individual line-card failures, which require card replacement, are counted in the denominator. In Mj , we count all failures of type j cards outside the maintenance window, including those caused by reset, software bugs, and all impacted cards of type j in case of entire router failure. This distinction is important since we want a metric that accurately captures customer impact caused by all HW and SW failures. For example, each reset of an active (primary) line-card can cause a protocol reconvergence event resulting in short-duration packet loss. Metrics R; M , and U can also be defined for the entire population of access line-cards without differentiating failure by LC type. Denote L L J J X J X X X X N D Nj ; n D nij ; t D tij : (3.17) j D1 j D1 i D1 j D1 i D1 3 Reliability Metrics for Routers in IP Networks Then RD 109 NT t ; M D n n (3.18) and the average unavailability R : (3.19) M The value of using Mj in addition to the average unavailability is demonstrated by the following example. U D Example 3.1. Consider a set of 400 routers and let T D 1;000 h. Each router has two cards of Type 1, three cards of Type 2, and five cards of Type 3. The number of failures for the entire router and each card type with their duration is given in Table 3.4. In case of single card failures, nij D 1 if LC of type j failed and nij D 0 otherwise. In the case of entire router failure, .ni1 ; ni 2 ; ni 3 / D .2; 3; 5/. In this example, we assume constant failure duration tij D tj of type j cards and a constant duration of the entire router failure. The failure duration is measured in hours. The failure parameters in Table 3.4 are referred to as Scenario 1. We also consider a Scenario 2, in which the only difference with Scenario 1 is that the number of failures of entire routers is increased from 1 to 5. The reliability metrics for two scenarios are given in Table 3.5. The results in columns R and M for LC Type j; j D 1; 2; 3, and for All Cards are calculated using Eqs. (3.15), (3.16), and .3:18/, respectively. The unavailability for LC Type j; j D 1; 2; 3, and for All Cards is calculated using Eqs. (3.14) and (3.19), respectively. The defects per million (DPM) is a commonly used metric that is obtained by multiplying the respective unavailability by 1,000,000. Note that for All Cards, defects per million (DPM) are below 10 in both scenarios, implying a high availability exceeding 99.999% (five nines), while the average time between customer impacting failures M in Scenario 2 is almost half of that in Scenario 1. Therefore, DPM, in contrast with average time between customer impacting failures, is not sensitive to the frequency of short failures of the entire router. Table 3.4 Failures and their duration: Scenario 1 Failure # Failures Router 1 LC type 1 30 LC type 2 6 LC type 3 2 Table 3.5 Reliability metrics Scenario 1 LC type R M DPM 1 0.76 25,000 30.25 2 1.03 133,333 7.75 3 0.21 285,714 0.75 All Cards 0.73 83,333 8.75 Duration 0.1 0.8 1.5 0.5 Scenario 2 R M 0.63 20,000 0.50 57,143 0.13 74,074 0.44 45,455 DPM 31.25 8.75 1.75 9.75 110 Y. Kogan If an ISP were only tracking DPM and router outages increased from one outage per 1,000 h to five outages per 1,000 h, it might miss the significant decrease in reliability as seen from the customer’s perspective. The metrics in the All Cards row hide a low average time between failures and high DPM for LC Type 1 in both scenarios. The average time between customer impacting failures by LC type amplifies the difference between the two scenarios. For example, for LC Type 3, the average time between failures M3 decreased almost by a factor of 4 in Scenario 2, in comparison with Scenario 1. This example illustrates the importance of measuring reliability metrics by the type of access linecards. It also illustrates the significant impact that even short-duration outages of an entire router have on reliability. Furthermore, it shows why nonstop routing and in-service software-upgrade capabilities described in Section 3.2 are considered to be so important by ISPs. 3.5 End-to-End Availability Evaluation of the end-to-end availability requires evaluation of the backbone availability in addition to the access availability discussed in Section 3.4. Given the scale and complexity of a large ISP backbone, there is no generally agreed upon approach for measuring and modeling end-to-end availability. Chapter 4 provides a fairly general approach for performance and reliability (performability) evaluation of networks consisting of independent components with finite number of failure modes. Its application involves the steady state probability distribution that is used for calculation of the expected value of the measure F defined on the set of network states. This section presents a brief overview of some results related to state aggregation and the selection of function F for evaluating the backbone availability. Large ISP backbones are typically designed to ensure that the network stays connected under all single-failure scenarios. Furthermore, the links are designed with enough capacity to carry the peak traffic load under all single-failure scenarios. Therefore, the majority of failures do not cause loss of backbone connectivity. Typically, when a failure happens, P routers detect the failure and trigger a failover to a backup path. If the failover were hitless and the backup path did not increase the end-to-end delay and also had enough capacity to carry all traffic, then the failure would not have any customer impact. Failures impacting customer traffic include the following events: 1. 2. 3. 4. Loss of connectivity Increased end-to-end delay on the backup path Packet loss due to insufficient capacity of the backup path Routing reconvergence triggered by the original failure. Such a reconvergence may cause packet loss during several seconds. Assume that the duration of each event can be measured. Two approaches to measuring the backbone availability are based on knowing the actual point-to-point 3 Reliability Metrics for Routers in IP Networks 111 traffic demand matrix that allows us to calculate the amount of impacted traffic for each event. In the first approach [20], only events 3 and 4 are included. The backbone unavailability is defined as the fraction of traffic lost over a given time period. In the second approach [21], all four events are included. Availability is measured for each origin–destination pair as the percentage of time that the network can satisfy a service-level agreement including 100% connectivity and thresholds on packet loss and delay. The main complexity in the implementation of either approach is in measuring event durations. The determination of event durations requires specially designed network instrumentation involving synthetic (active) measurements. Reference [22] describes a standardized point-to-point approach to path-level measurements and reference [23] describes a novel approach that uses a single measurement host to collect network-wide one-way performance data. These approaches also require a well-thought-out data management infrastructure and computationally intensive processing of their output [24]. Application of edgeto-edge availability distribution to evaluation of VoIP (Voice over IP) reliability [25] is addressed in [26]. References 1. Malec, H., (1998). Communications reliability: A historical perspective. IEEE Transactions on Reliability, 47, 333–345. 2. Claffy, kc., Meinrath, S., & Bradner, S. (2007). The (un)economic Internet? IEEE Internet Computing, 11, 53–58. 3. Bollapragada, V., Murphy, C., & White, R. (2000). Inside Cisco IOS software architecture. Indianapolis, IN: Cisco Press. 4. Schudel, G., & Smith, D. (2008). Internet protocol operations fundamentals. In Router security strategies. Indianapolis, IN: Cisco Press. 5. Garrett, A., Drenan, G., & Morris, C. (2002). Juniper networks field guide and reference. Reading, MA: Addison-Wesley. 6. Oggerino, C. (2001). High availability network fundamentals: A practical guide to predicting network availability. Indianapolis, IN: Cisco Press. 7. T640 Internet router node overview, from http://www.juniper.net/techpubs/software/nog/noghardware/download/t640-router.pdf. 8. Route flapping, from http://en.wikipedia.org/wiki/Route flapping. 9. Cisco nonstop forwarding with stateful switchover (2006). Deployment guide. Cisco Systems, from http://www.cisco.com/en/US/technologies/tk869/tk769/technologies white paper0900 aecd801dc5e2.html. 10. Graceful restart concepts, from http://www.juniper.net/techpubs/software/junos/junos93/ swconfig-high-availability/graceful-restart-concepts.html#section-graceful-restart-concepts. 11. Sangli, S., Chen, E., Fernando, R., & Rekhter, Y. (2007). Graceful restart mechanism for BGP. RFC 4724. Internet Official Protocol Standards, from http://www.ietf.org/rfc/rfc4724.txt. 12. Kaplan, H. (2002). NSR Non-stop routing technology. White paper. Avici Systems Inc., from http://www.avici.com/technology/whitepapers/reliability series/NSRTechnology.pdf. 13. Router high availability for IP networks (2005). White paper. Alcatel, from http://www. telecomreview.ca/eic/site/tprp-gecrt.nsf/vwapj/Router HA for IP.pdf/$FILE/Router HA for IP.pdf. 14. ISSU: A planned upgrade tool (2009). White paper. Juniper Networks, from http://www. juniper.net/us/en/local/pdf/whitepapers/2000280-en.pdf. 112 Y. Kogan 15. Cisco IOS XE In Service Software Upgrade process (2009). Cisco Systems, from http:// www.cisco.com/en/US/docs/ios/ios xe/ha/configuration/guide/ha-inserv updg xe.pdf. 16. Single-router APS for the Cisco 12000 series router, from http://www.cisco.com/ en/US/docs/ios/12 0s/feature/guide/12ssraps.pdf. 17. Agraval, M., Bailey, S., Greenberg, A., et al. (2006). RouterFarm: Towards a dynamic manageable network edge. In: SIGCOMM’06 Workshops, Pisa, Italy. 18. Ross, S. (1989). Introduction to probability models. San Diego, CA: Academic. 19. Access availability of routers in IP-based networks (2003) Committee T1 tech rep T1.TR.78–2003. 20. Kogan, Y., Choudhury, G., & Tarapore, P. (2004). Evaluation of impact of backbone outages in IP networks. In ITCOM 2004, Philadelphia, PA. 21. Wang, H., Gerber, A., Greenberg, A., et al. (2007). Towards quantification of IP network reliability, from http://www.research.att.com/jiawang/rmodel-poster.pdf. 22. Ciavattone, L., Morton, A., & Ramachandran, G. (2003). Standardized active measurements on a Tier 1 IP backbone. IEEE Communications Magazine, 41, 90–97. 23. Burch, L., & Chase, C. (2005). Monitoring link delays with one measurement host. ACM SIGMETRICS Performance Evaluation Review 33, 10–17. 24. Choudhury, G., Eisenberg, M., Hoeflin, D., et al. (2007). New reliability metrics and measurement techniques for IP networks. Proceedings of Distributed computer and communication networks, RAS, Moscow, 126–130. 25. Johnson, C., Kogan, Y., Levy, Y., et al. (2004). VoIP Reliability: A service provider perspective. IEEE Comunications Magazine, 42, 48–54. 26. Lai, W., Levy, Y., & Saheban, F. (2007). Characterizing IP network availability and VoIP service reliability. Proceedings of Distributed computer and communication networks, RAS, Moscow, 126–130. Chapter 4 Network Performability Evaluation Kostas N. Oikonomou 4.1 Introduction This chapter is an introduction to the area of performability evaluation of networks. The term performability, which stands for performance plus reliability, was introduced in the 1980s in connection with the performance evaluation of faulttolerant, degradable computer systems [23].1 In network performability evaluation, we are interested in investigating a network’s performance not only in the “perfect” state, where all network elements are operating properly, but also in states where some elements have failed or are operating in a degraded mode (see, e.g., [8]). The following example will introduce the main ideas. Consider the network (graph) of Fig. 4.1. On the left, the network is in its perfect state, and on the right one node and one edge have failed.2 Node and edge failures occur independently, according to certain probabilities, which we assume to be known. An assignment of “working” or “failed” states to the network elements defines a state of the network. By the independence assumption, the probability of that state is the product of the state probabilities of the elements. There are two traffic flows in this network: one from node 1 to node 5, and the other from 7 to 3. The flows are deterministic, of constant size, and there is no queuing at the nodes. Our interest is in the latency of each flow, defined as the minimum number of hops (edges) that the flow must traverse to get to its destination when it is routed on the shortest path. In each state of the network, a flow has a given latency: in the perfect state, both flows have latency 2 (hops), but in the example failure state the first flow has latency 3 and the second 1. The simplest characterization of the latency metric would be to find its expected value over the possible network states, K.N. Oikonomou () 200 Laurel Ave, Middletown, NJ, 07748 e-mail: ko@research.att.com 1 Unfortunately, the terminology is not completely standard and some authors still use the term “reliability” for what we call performability; see, e.g., [1]. One may also encounter other terms such as “availability” or “dependability”. 2 When a node fails, we consider that all edges incident to it also fail. C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 4, c Springer-Verlag London Limited 2010 113 114 K.N. Oikonomou 2 2 3 1 3 1 4 7 4 5 5 6 6 Perfect state Failure of node 7 and edge (1,6) Fig. 4.1 A 7-node, 10-edge network with 217 possible states. The performance metric is traffic latency, measured in hops of which there are 217 130;000. A more complete characterization would be to find its entire probability distribution. This would allow one to answer questions such as “what is the probability that the latency of flow 1 does not exceed 3?”, and “what upper bound on the latency of flow 2 can be guaranteed with probability 0.999?”. The answers to these questions ( performability guarantees) are useful in setting performance targets for the network, or SLAs. This basic example illustrates several points, all of which will be covered in more detail in later sections. Reliability/Performance Trade-Off in the Analysis A fundamental fact is that the size of the state space is exponential in the number of network elements. In the above example, if the number of network elements is doubled, the number of network states becomes about 17109 , and this is still a small network, with only 34 elements; a network model with several hundred elements would be much more typical. This means that for any realistic network model the state space is practically infinite, so the amount of work that can be done in each state to compute the performance metrics is limited. In other words, in performability, analysis there is a fundamental trade-off between the reliability (state space) and performance aspects. A consequence of this trade-off is that the performance model cannot be as detailed as it would be in a pure performance analysis: in the example, we assumed constant traffic flows and no queuing at nodes. Another aspect of the trade-off is that only the investigation of the steady-state behavior of the model is, in general, feasible: in the example, we treated the network elements as two-valued random variables, not as two-state random processes. However, a mitigating factor is that the network states generally have very different probabilities, so that we may be able to calculate bounds on the performance metrics by computing their values only on a reasonable number of states, those with high probability. With this fundamental trade-off in mind, we now discuss ways in which the simple performability model of the example can be extended. 4 Network Performability Evaluation 115 Enhancements to the Simple Model To make the model presented in the example more useful for a realistic analysis, we could add capacities to the graph’s edges. We could also add sizes to the traffic flows, and have more sophisticated routing that allows only shortest paths that have enough capacity for a flow. Further, for a better latency measure, we could add lengths to the graph edges. Another category of enhancements would be aimed at representing failures more realistically. To begin with, the network elements could be allowed to have more than one failure mode, e.g., an edge could operate at full capacity, half capacity, or zero capacity (fail). We could separate the network elements from the entities that fail by introducing “components” that have failure modes and affect the graph elements in certain ways. For example, such a component could represent an optical fiber over which two graph edges are carried, and whose failure (cut) would fail both of these edges at the same time. In Section 4.2 we describe a hierarchical network model that has all the features mentioned above, among others. Finally, we could allow different types of routing for traffic flows, and also introduce the notion of network restoration into the model. These additions are described in Section 4.3. Network Performability in the Literature A number of network performability studies have appeared in the literature. Levy and Wirth [21] investigate the call completion rate in a communications network. Alvarez et al. [4] study performability guarantees for the time required to satisfy a web request in a network with up to 50 nodes, where only nodes can fail, but without restoration. Levendovszky et al. [19] study the expected lost traffic in the Hungarian backbone SDH network with 52 nodes and 59 links, and no restoration. Carlier et al. [7] use a three-level network model, and study expected lost traffic in a 111-node, 180-link network using k-shortest path restoration. Gomes and Craveirinha [12] study a 46-node, 693-link representation of the Lisbon urban network with a threelevel performability model, and compute blocking probabilities for a Poisson model of the network traffic, with no restoration. Finally, layered specification of a network for the purposes of performability evaluation has been used in [7,12], which separate the network into a “physical” and a “functional” layer, and in [22], which uses a special-purpose separation into “node cluster” and “call-processing path” layers. Some further references are given in Section 4.4.3. Chapter Outline In Section 4.2 we describe a four-level, hierarchical network model, suited for performability analysis, and illustrate it with an IP-over-optical network example. In Section 4.3 we discuss the performability evaluation problem in general, give a mathematical formulation, present the state-generation approach to the performability evaluation of networks, and discuss basic performance measures and 116 K.N. Oikonomou related issues. We also introduce the nperf network performability analyzer, a software package developed in AT&T Labs Research. In Section 4.4 we conclude by presenting two case studies that illustrate the material of this chapter, the first involving an IPTV distribution network, and the second dealing with architecture choices for network access. 4.2 Hierarchical Network Model For the purpose of our performability modeling, we will think of a “real” network as consisting of three layers3 : a traffic layer, a transport layer, and a physical layer. On the other hand, as shown in Fig. 4.2, our performability model is divided into four levels: traffic, graph, component, and reliability. (In terms of the ISO OSI reference model, both models address layers 1 through 3.) To illustrate the correspondence between the three network layers and the four model levels, we use the case of an IP-over-optical “real” network. The four-level performability model applies to many other types of real networks as well: for example, Oikonomou et al. [25] describe its application to a set of satellites that communicate among themselves and a set of ground stations via microwave links, whereas the ground stations are interconnected by a terrestrial network. 4.2.1 IP-Over-Optical Network Example A modern commercial packet network typically consists of IP routers connected by links, which are transported by an underlying optical network. We describe how we model the traffic, transport, and physical layers of such a network, and how we map them to the levels of the performability model in Fig. 4.2. (For more on this topic, see Chapter 2.) Traffic Layer Based on an estimate of the peak or average traffic pattern, we create a matrix giving the demand or “flow” between each pair of routers. (Methods for creating such a traffic matrix from measurements are described in Chapter 5.) A demand has a rate, a unit, and possibly a type or class associated with it. 3 We say “real” because any description is itself at some level of abstraction and omits aspects which may be important if one adopts a different viewpoint. 4 Network Performability Evaluation 117 point -to-po int de man d Traffic level Routing and restoration F Graph level Component level Reliability level λ2 λ1 W μ1 F W μ2 F λ3 W μ3 λ4 F W μ4 F Fig. 4.2 The four-level network performability model used by the nperf performability analyzer. F is the performance measure, discussed in Section 4.3.3 Transport Layer Nodes A network node represents an IP router. At the component level this node expands into a data plane, a control plane, a hardware and software upgrade component, and a number of networking interfaces (line cards/ports). The data plane, or switching fabric, is responsible for routing packets, while the control plane computes routing tables and processes other network signaling protocols, such as OSPF or BGP. When a data plane component fails, all the links incident to its router fail. When a control plane component fails, the router continues to switch packets, but cannot participate in rerouting, including restoration. Failure of a port component fails the corresponding link(s). The “upgrade” component represents the fact that, 118 K.N. Oikonomou periodically, the router is effectively down because it is undergoing an upgrade of its hardware or software. (This is by no means a very sophisticated router reliability model, see Chapter 3, but exemplifies the performance-reliability trade-off discussed in Section 4.1.) Finally, fix one of the above classes of components, say router cards. At the reliability level we think of all these components as independent copies of a continuous-time Markov process (see, e.g., [5] or [6]) with failure transition rate and repair transition rate , which may be specified in terms of MTBF (mean time between failures, D 1=), and MTTR (mean time to repair, D 1=). Transport Layer Links A link between routers fails if either of the port components at its endpoints fails, if a data plane of one of the endpoint nodes fails, or if a lower-layer component over which the link is routed fails (e.g., a network span, discussed next). Two network nodes may be connected by multiple parallel links. These parallel links may be grouped into a type of virtual link called a composite or bundled link, whose capacity is the sum of the capacities of its constituent links. For the purposes of IP routing, the routers see only a single bundled link. When a constituent link fails, the capacity of the bundled link is reduced accordingly. A bundled link fails (or more precisely is “taken out of service”) when the aggregate capacity of its non-failed constituent links falls below a specified threshold. Physical Layer Spans We use the term “span” to refer to the network equipment and media (e.g., optical fiber) at the physical layer that carries the transport-layer links. Failure of a span component affects all transport-layer links which are routed over this span. When modeling an IP-over-optical layered network, the physical layer usually uses dense wavelength division multiplexing (DWDM), and a span consists of a concatenation of point-to-point DWDM systems called optical transport systems (OTS).4 In turn, an OTS is composed of many elements, such as optical multiplexers/demultiplexers, optical amplifiers, and optical transponders. Also, a basic constraint in commercial transport networks is that a span is considered to be working only if both of its directions are working. With this assumption, it is not difficult to compute the failure probability of a span based on the failure probabilities of its individual elements in both directions. Thus, for simplicity, we generally represent a network span by a single “lumped” component whose MTBF and MTTR are calculated as explained in [28]. 4 There are more complex DWDM systems with various optically-transparent “add/drop” capabilities, which, for simplicity, we do not discuss here. 4 Network Performability Evaluation 119 Other Types of Components A set of fibers that is likely to fail together because they are contained in a single conduit/bundle can be represented by a fiber cut component that brings down all network spans (hence all the higher IP-layer links) that include this fiber bundle. Other types of catastrophic failures of sets of graph nodes and edges may be similarly represented. So far we have mentioned only binary components, i.e., with just two modes of operation, “working” or “failed”. We discuss components with more than two modes in Section 4.2.2.2. 4.2.2 More on the Graph and Component Levels 4.2.2.1 Graph Element Attributes The graph is the level of the performability model at which the network routing and restoration algorithms operate. Graph edges have associated capacities and (routing) costs. In general, an edge’s capacity can be a vector, and this vector has a capacity threshold associated with it, such that the edge is considered failed if the sum of the capacities of its non-failed elements falls below the threshold. An edge with vector capacity can directly represent a bundled link. The nperf performability analyzer presented in Section 4.3 also allows many other attributes for edges, such as lengths, latencies, etc., as well as operations on these attributes. These operations are covered in Section 4.2.2.3. 4.2.2.2 Multi-Mode Components Each component, representing an independent failure or degradation mechanism, has a single working mode and an arbitrary number of failure modes. If it has a single failure mode it is referred to as a “binary” component, otherwise it is called “multi-mode”. In the nperf analyzer a component is represented by a star Markov process, as shown in Fig. 4.3. At the reliability level, the i th failure mode of a particular component is defined by its mean time between failures and its mean time to repair by setting i D 1=MTBFi and i D 1=MTTRi . We now give some examples of using multi-mode components in network modeling. Router Upgrades We mentioned in Section 4.2.1 (binary) software and hardware upgrade components for routers. Now suppose that there is an intelligent network maintenance policy in place, by which router upgrades are scheduled so that only one router in the network undergoes a software or hardware upgrade at any time. 120 K.N. Oikonomou μ1 f1 λ1 μ2 f2 λ2 w λm μm . . . λ w μ f fm Fig. 4.3 A multi-mode component with m failure modes f1 ; : : : ; fm (left), and the special case of a binary component (right). The components are continuous-time Markov processes of the “star” form. The i th mode is entered with (failure) rate i and exited with (repair) rate i This policy cannot be modeled by using binary upgrade components associated with the routers, because (independence) there is nothing to prevent more than one of them failing at a time. However, for an n-router network, the mutually exclusive upgrade events can be represented by defining an .n C 1/-mode component whose mode 1 corresponds to no upgrades occurring anywhere in the network, and each of the remaining n modes corresponds to the upgrade of a single router. Traffic Matrix Suppose we want to take into account daily variations in traffic patterns/levels, e.g., for 60% of a typical day the traffic is represented by matrix T1 , for 20% by matrix T2 , and for another 20% by matrix T3 . This can be done by letting the traffic matrix be controlled by a multi-mode component whose modes w; f1 ; f2 have probabilities 0:6; 0:2; 0:2, respectively, and they set the traffic matrix to T1 ; T2 ; T3 , respectively. Restoration Figure 4.2 implicitly assumes that network restoration happens at only one level. However, multi-mode components afford the capability to model restoration occurring at more than one network layer. The details of how this is done, using the example of IP over SONET, can be found in [25]. 4.2.2.3 Failure Mapping Recall that failure of a binary component may affect a whole set of graph-level elements: the spans of Section 4.2.1 are an example. More generally, when a multimode component enters one of its failure modes, the effect on a graph element is to change some of the element’s attributes. For example, the capacity of an edge may decrease, or a node may become unable to perform routing. Depending on the final values of the attributes, e.g., total edge capacity 6 some threshold, the graph element may be considered “failed”. We refer to the effects of the components on the graph as the component-to-graph- level failure mapping. Some of the ways that a component can affect a graph element attribute are to add a constant to it, subtract a constant from it, multiply it by a constant, or set its value to a constant. 4 Network Performability Evaluation 121 4.3 The nperf Network Performability Analyzer In this section, we begin by discussing how the general, i.e., not specific to networks, performability evaluation problem can be defined mathematically, and then discuss various aspects of this definition. We then review the so-called state generation approach to performability evaluation, and some basic ingredients of the performance measures used when evaluating the performability of networks. We finally present an outline of the nperf network performability analyzer, a tool developed in AT&T Labs Research. Useful background on performability in general is in [16] and in [32]. A more extensive reference on the nperf analyzer itself and the material of this section is [28]. 4.3.1 The Performability Evaluation Problem It is useful to understand the mathematical formulation of the network performability evaluation problem. Let C D fc1 ; : : : ; cn g be a set of “components”, each of which is either working or failed. (As already mentioned in Section 4.2.2, components can be in more than two states, called “modes” to distinguish them from network states, but to simplify the exposition here we restrict ourselves to two mode, or “binary” components.) Abstractly, a component represents a failure or degradation mechanism; examples were given in Section 4.2.1. Component ci is in its working mode with probability pi and in its failed mode with probability qi D 1 pi , both assumed known. Our basic assumption is that all components are independent of one another, so that, e.g., the probability that ci is down, cj is up, and ck is down is qi pj qk . A network state is an assignment of a mode to every component in C and can be represented by a binary n-vector. The set of all network states S.C/ has size 2n , and the probability of a particular state is the product of the mode probabilities of the n components. Let F be a vectorvalued performance measure (a function) defined on S.C/, mapping each state to an m-tuple of real numbers; examples are given in Section 4.3.3. The performability evaluation problem consists in computing the expected value of the measure F over the set S.C/ of network states: X FN D F .s/ Pr.s/: (4.1) s2S.C/ There are various points to note here. Complexity It is well known that the exact evaluation of (4.1) is difficult, even if F is very simple. Intuitively this is because the size of the state space S.C/ is exponential in the size of the set of components C. For a more precise demonstration 122 K.N. Oikonomou of the complexity, suppose that each component corresponds to an edge of a graph, the graph’s nodes do not fail, and we want to know the probability that there is a path between two specific nodes a and b of the graph. This is known as the T WO T ERMINAL N ETWORK R ELIABILITY evaluation problem, and in this case F takes only two values: F .s/ is 1 if there is a path from a to b in the graph state s, and 0 otherwise. Despite the very simple F , this problem is known to be #P-complete (see e.g., [15, 32], or [8]). A consequence of this computational complexity is that, in general, only approximate performability evaluation is feasible. We will return to this in Sect. 4.3.2. Performability Guarantees In practice, we are interested in computing more sophisticated characteristics of F than its expectation FN , such as the probability, over the set of network states, that F is less than some number x, or greater than some number y. For example, we may want to claim that “with probability at least 99.9%, at most 2% of the total traffic is down, and with probability at least 90% at most 10% of it is down”. Formally, such claims are statements of the type Pr.F < x1 / > P1 ; Pr.F < x2 / > P2 ; : : : ; Pr.F > y1 / 6 Q1 ; Pr.F > y2 / 6 Q2 ; : : : or (4.2) that hold over the entire network state space; they are known as performability guarantees, and they can, for example, be used to set SLAs. The important point is that the computation of (4.2) reduces easily to just the computation of expectations of the type (4.1); see, e.g., [28]. Network When we are using the formalism leading to (4.1) to evaluate the performability of a network, all the complexity is in the measure F . As Fig. 4.2 shows, F then includes the failure mapping from the component to the graph level, the routing and restoration algorithms, and the traffic level. Time Recalling the reliability level of Fig. 4.2, each ci is in reality a two-state Markov process, whose state fluctuates in time. If so, what is the meaning of the expectation FN of the measure F ? It can be shown that if we average F over a long time as the network moves through its states, this average will approach FN , if we take the probabilities pi and qi associated with ci to be the steady-state probabilities of the working and failed states of the Markov process representing ci . Steady State The reader familiar with the performance analysis of Markov reward models (see, e.g., [5, 11]) will recognize that the definition (4.1) of the performability evaluation problem is based on steady state expectations of measures. In many cases it is transient, also known as finite-time, measures that may be of interest. The evaluation of such measures on very large state spaces is much more difficult than that of steady state measures, and outside the scope of the treatment in this chapter, but it is currently an area of further development of the nperf tool. 4 Network Performability Evaluation 123 4.3.2 State Generation and Bounds A number of approaches to computing the expectation FN in (4.1) approximately have been developed. Without attempting to be comprehensive, they can be classified into (a) upper and lower bounds for certain F such as connectivity (using the notions of cut and path sets), or special network/graph structures (see [16, 32]), (b) “most probable states” methods ([13, 14, 16, 17, 31–33]), (c) Monte Carlo sampling approaches ([7, 16]), and (d) probabilistic approximation algorithms for simple F , e.g., [18]. Methods of types (a) and (b) produce algebraic bounds on FN (i.e., not involving any random sampling), while (c) and (d) yield statistical bounds. Here we will discuss the “most probable states” methods, which are algorithms for generating network states in order of decreasing probability. The rationale is that if the component failure probabilities are small, most of the probability mass is concentrated on a relatively small fraction of the state space. Thus, as these methods generate states one by one and evaluate F on them, they are attempting to update FN with terms of highest value first. The most probable states methods are particularly well suited to evaluating the performability of complex networks because they make no assumptions (at least to first order) about what the performance measure F might be or what properties it might have, which is especially important in view of the fact that the complexity of network routing and restoration schemes is included in F . The classical algorithms of [13, 33] apply to systems of only binary components, whereas the algorithms of [14,17,30] can handle arbitrary multi-mode components. nperf uses a hybrid state-generation algorithm described in [28], which handles arbitrary multi-mode components and is suited especially to “mostly binary” systems, that is systems where the proportion of components with more than two modes is small. We find that such systems dominate performability models for practical networks. To explain what we mean by “at least to first order”, let ! and ˛ be the smallest and largest values of F over S.C/, and suppose we generate the k highestprobability elements of S.C/. If these states have total probability P , we have the algebraic lower and upper bounds on FN FNl D k X i D1 F .si / Pr.si / C .1 P /!; FNu D k X F .si / Pr.si / C .1 P /˛; (4.3) i D1 first pointed out in [20]. The bounds (4.3) are valid for arbitrary F , but may sometimes require the generation of a large number of states to achieve a small enough FNu FNl D .1 P /.˛ !/. Tighter bounds are possible, but only by requiring F to have some special property, such as monotonicity, limited growth, etc. See [27] for further details. 124 K.N. Oikonomou 4.3.3 Performance Measures There are two measures of fundamental importance in network performability analysis, both having to do with lost traffic. These are tlnr .s/ D total traffic lost because of no route in s tlcg .s/ D total traffic lost because of congestion in s (4.4) (We do not mean to imply that these are the only measures of importance. Depending on the application, the focus may shift to considerations other than lost traffic, e.g., to latency, or to many others.) To define terms, we refer to the IP-over-optical example of Section 4.2.1. A demand corresponds to a source-destination pair of routers; we use traffic to mean the size (volume) of a demand, or of a set of demands. The definition of tlnr is straightforward: a demand fails if a link (multi-edge) on its route fails, and a failed demand is lost because of no route if no path for it can be found after the network restoration process completes. tlnr .s/ is the sum of the volumes of all lost demands in state s. Our definition of tlcg is more involved.5 If the network routing allows congestion, a demand is congested if its route includes an edge with utilization that exceeds a threshold Uc . tlcg is a certain function (not the sum) of all congested demands. Suppose we fix a routing R in state s; then we define tlcg to be the total traffic offered to the network minus the maximum possible total flow F that can be carried in state s using routing R without congestion. Here “there is congestion under R” means “there is a (working) edge with utilization above the threshold Uc ”. Equation (4.5) formalizes this definition. Note that if the network uses flow control, such as TCP in an IP network, the flow control will “throttle” traffic as soon as it detects congestion, so that few packets will be really lost; in that case it is more accurate to call our measure loss in network bandwidth. Now using the “link-path” formulation [29], let D be the set of all subdemands (path flows) and D.e; R/ be the set of subdemands using the non-failed edge e under the routing R. Also let fd be the flow corresponding to subdemand d . Then F is the solution of the linear program F D max X fd (4.5) d 2D subject to 8e; X fd Uc ce ; fd vd ; d 2D.e;R/ where ce is the capacity of edge e and vd the volume of demand d . 5 This definition is by no means unique, we claim only that it is useful in a wide variety of contexts. 4 Network Performability Evaluation 125 Consistent with what we noted in Section 4.3.1, the above discussion centered around steady-state expectations of measures as the quantities of interest. In the context of the case study in Section 4.4.2 we will touch on one interesting sub-class of finite-time measures, event counts. 4.3.4 Network Routing and Restoration The presence of network routing and restoration in the performance measure makes the performability analysis of networks different from other such analyses. The nperf analyzer incorporates three main kinds of network routing methods: Uncapacitated Minimum-Cost This is meant to represent routing by, e.g., the OSPF (Open Shortest Path First) protocol [24]. Link costs correspond to OSPF administrative weights. OSPF path computation does not take into account the capacities or utilizations of the links. Another main IP-layer routing protocol, IS-IS (Intermediate System–Intermediate System) behaves similarly for our purposes. “Optimal” Routing This routing is based on multi-commodity flows ([2, 29]). nperf incorporates both integral and non-integral (“real”) multi-commodity flow methods. These methods could be regarded as representing variants of OSPF-TE. Details are in [28]. Multicast Routing This type of routing sends the traffic originating from a source node on a shortest-path tree rooted at this node and spanning a set of destination nodes. The shortest paths to the destinations are determined by so-called reversepath forwarding. These routing methods are not meant to be emulations of real network protocols; they include only the features of these protocols that are important for the kind of analysis that nperf is aimed at. In particular, a lot of details associated with timing and signaling are absent (another instance of the reliability/performance trade-off noted in Section 4.1). 4.3.5 Outline of the nperf Analyzer With the above material in mind, Fig. 4.4 depicts the structure of the core of the nperf tool. At the top we have the most probable state generation algorithms of [13, 28, 33], mentioned in Section 4.3.2. The “routers” at the bottom of the figure are the routing methods discussed in Section 4.3.4: “iMCF” corresponds to integral multi-commodity flow, “rMCF” to non-integral (“real” or “fractional”) multi-commodity flow, and “USP” to uncapacitated shortest paths. The four-level network model is specified by a set of plain text files, listed in Table 4.1. 126 K.N. Oikonomou YK Hybrid GC State generation algorithms Reliability level R Hierarchical network model: definition of F = ( f1 , . . . , fm) Component level C Failure map C → G Graph level G Demand (traffic) level D iMCF router rMCF router USP router F = ( f1 , . . . , fm) Multicast ... tree router Failure map G→D Measure F ...≤ Pr( f i ≤ x i )≤ ... Fig. 4.4 Structure of the core nperf software Table 4.1 Network model specification files net.graph Specifies the network graph (nodes and edges) net.dmd, net.units Specify the traffic demands, if the network has a traffic layer net.comp Specifies the network components and the C ! G failure mapping net.rel Lists (MTBF, MTTR) pairs for the modes of the components net.perf Parameters for the performance measure(s) The MTBFs for the components are typically obtained from a combination of manufacturer data and in-house testing. The MTTRs are usually determined by network maintanance policies, except for some special types of repairs, such as a software reboot. (Of course, one always has the freedom to use hypothetical values when performing a “what-if” analysis.) Uncertainties in the MTBFs and MTTRs may be dealt with by repeating an analysis with different values of MTBFS and/or MTTRs, and nperf has some facilities to ease this task. A more sophisticated 4 Network Performability Evaluation 127 Table 4.2 Publicly-available tools that have some relation to nperf. Web sites valid as of 2009 P TOLEMY Modeling and design of concurrent, real-time, embedded systems http://ptolemy.eecs.berkeley.edu/ TANGRAM II Computer and communication system modeling http://www.land.ufrj.br/tools/tangram2/tangram2.html M OBIUS Model-based environment for validation of system reliability, availability security, and performance http://www.mobius.uiuc.edu/ Probabilistic model checker P RISM http://www.prismmodelchecker.org/ T OTEM Toolbox for Traffic Engineering Methods http://totem.run.montefiore.ulg.ac.be/ alternative is to assign uncertainties (prior probability distributions) to the MTBFs and MTTRs and propagate them to posterior distributions on FN via a Bayesian analysis. However, this is outside the scope of this chapter. 4.3.6 Related Tools Performance and reliability analyses of systems are vast areas with many ramifications. At this point there exist a number of tools that are, in one way or another, related to some of what nperf does. Table 4.2 mentions some of the author’s favorites, all in the public domain; the interested reader may pursue them further. Vis-a-vis these tools, the main distinguishing features of nperf are that it is geared toward networks (hierarchical model, routing, restoration), and represents them by large numbers of relatively simple independent (noninteracting) components. 4.4 Case Studies We conclude by presenting two case studies that, among other things, illustrate the application of the nperf tool. The first study is on a multicast network for IPTV distribution, and the second involves choosing among a set of topologies for network access. 4.4.1 An IPTV Distribution Network In this study we analyzed a design for an IPTV distribution network similar to the one discussed in [9], but with 65 nodes distributed across the continental US. 128 K.N. Oikonomou These nodes are called VHOs (Video Head Offices), and there is an additional node called an SHO (Super Hub Office), which is the source of all the traffic. The traffic stream from the SHO is sent to the VHOs by multicast6 : when a node receives a packet, it puts a copy of it on each of its outgoing links. Thus traffic flows on the edges of a multicast tree rooted at the SHO, and each VHO is a node on this tree. The tree forms a sub-network of the provider’s overall network. The multicast sub-network uses two mechanisms to deal with failures: fast re-route: each edge of the tree has a pre-defind backup path for it, which uses edges of the encompassing network that are not on the tree. tree re-computation: if a tree edge fails, and fast-reroute is unable to protect it because the backup path itself has also failed, a new tree is computed. This computation is done by so-called reverse path forwarding: each VHO computes a shortest path from it to the SHO, and the SHO then sends packets along each such path in the reverse direction. The advantage of fast re-route (FRR) is that it takes much less time, milliseconds instead of seconds, than tree re-computation. Given a properly designed FRR capability, an interesting feature of the multicast network from the viewpoint of performability analysis is that it essentially tolerates any single link failure.7 Therefore, interesting behavior appears only under failures of higher multiplicity. Indeed, it turns out that multiple failures can result in congestion: the backup paths for different links are not necessarily disjoint and so when FRR is used to bypass a whole set of failed links, a particular network link belonging to more than one backup path may receive traffic belonging to more than one flow. If the link capacity is such that this causes congestion, the congestion will last until the failure is repaired, which may take time of the order of hours. One way to deal with this problem is to compute a new multicast tree after FRR is done, and to begin using this new tree as soon as the computation is complete, as suggested in [9]. This retains the speed advantage of FRR and limits the duration of any congestion to the tree re-computation time. For this network, performance must be guaranteed for every VHO (worst case), not just overall. So, in the terms of Section 4.3.3, the multicast performability measures are two 65-element vectors, one for loss due to no path and one for loss due to congestion, whose elements are computed on each network state. We now summarize some of the results of this study. An initial network design, known as design A, was carried out by experienced network designers. Its performance, after normalizing the expectations of the measures by the total traffic and converting the result to time per year,8 is shown in Fig. 4.5, top. Since this was a well-designed network to begin with, its levels of traffic loss were quite low, better than “five 9s”. Within these low levels, Fig. 4.5 shows that the loss due to no path, the tlnr of (4.4), is dominant for most VHOs, but some of them also exhibit 6 Specifically by Protocol Independent Multicast (PIM). By “link” here we mean an edge at the graph level of the model of Fig. 4.2. 8 For example, a traffic loss of 0.01% of the total translates to 1=10; 000 of a year, i.e., about 52 min/year. 7 4 Network Performability Evaluation 129 τOSPF = 1 sec, τFRR = 0.05 sec 2.5 No path Congestion time / yr. 2 1.5 1 0.5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 VHO # τOSPF = 1 sec, τFRR = 0.05 sec 2.5 No path Congestion time / yr. 2 1.5 1 0.5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 VHO # Fig. 4.5 Expected lost traffic, expressed in time per year, because of no path and congestion in design A (top), and in design C (bottom). These are the tlnr and tlcg defined in (4.4). Design C is A with tuned OSPF weights. For the purposes of comparing the two designs, the time unit of the y-axis is irrelevant significant loss because of congestion (tlcg ). Even though the performability of this network was entirely acceptable, we decided to see if the loss due to congestion could be reduced. A detailed study of the network states generated by nperf that led to congestion in Fig. 4.5 top, revealed that they were double and triple failures. Further, we found that for VHOs 30 to 41 congestion could be practically eliminated by tuning a certain set of OSPF link weights. The result, known as design C , performed as shown in Fig. 4.5 bottom. It can be seen that a lot of congestion-induced 130 K.N. Oikonomou losses were eliminated while the loss due to no path remained at the same level throughout, and this was achieved without adding to the cost of the network design at all. See [10] for more details on the subject of reliable IPTV network design. 4.4.2 Access Topology Choices An issue that arose for a major Internet service provider was that traffic in its network was increasing, but the backbone routers had limited expansion capability (numbers of slots in the chassis). To get around this limitation it was proposed to introduce intermediate aggregation routers in the access part of the network, and the question was how this would affect the reliability of the access. The configuration of the provider’s backbone offices before the introduction of aggregation routers is shown in Fig. 4.6 top, and is referred to as “base”; there is a “local” variant in which all routers are located within a single office, and a “remote” variant in which the routers are in different offices. In reality there are many access routers connecting to a pair of backbone routers, but showing just one in Fig. 4.6 is enough for our purposes. There were two proposals for introducing the aggregation routers, called the “box” and the “butterfly” designs, shown in Fig. 4.6 middle and bottom. These had local and remote variants as well. Further, there was a premium “diverse” option in the butterfly remote design in which the links between a backbone router and its two aggregation routers were carried on two separate underlying optical transport (DWDM) systems, instead of the same transport (the “common” option). It was clear that the box alternative was cheaper because of fewer links, but what was the reduction in availability relative to the costlier butterfly design? Also, how did either of these options compare with the existing base design? The failure modes of interest in all these designs were network spans, router ports, and software failures or procedural errors; these failure modes are depicted as components in Fig. 4.7. The metric chosen to compare the availabilities of the various designs was the mean time between access disconnections, i.e., situations where the access router A had no path to any backbone router BB. Note that network restoration is immaterial for such events. nperf models for the designs of Fig. 4.6 were constructed; given the metric of interest, the models did not include a traffic layer. Typical values for the reliability attributes of the components were selected as in Chapter 3. At a high level, note that the longer links between the aggregation and backbone routers in the remote designs are less reliable than the corresponding links in the local designs. The results of the study are summarized in Table 4.3. The mean access disconnection times are separated into two categories, of which “hardware” includes the first three types of components listed in Fig. 4.7. The most notable result in Table 4.3 is that irrespective of the architecture, software and procedural errors are by far the dominant cause for access router isolations. These events are the ones that cannot be helped by redundancy. The second most important 4 Network Performability Evaluation 131 Base local BB1 remote BB2 BB1 BB2 A A Box local remote BB1 BB2 BB1 BB2 AG1 AG2 AG1 AG2 A A Butterf ly remote diverse local remote common BB1 BB2 BB1 BB2 BB1 BB2 AG1 AG2 AG1 AG2 AG1 AG2 A A A Fig. 4.6 “Base”, “box”, and “butterfly” access configurations. Each has a “local” and a “remote” version. The remote versions have routers spread among different offices (the enclosing blue boxes). BB are backbone routers, AG are aggregation routers, and A is an access router feature is that compared to the base case, the introduction of aggregators doubles the risk of access router isolation due to software and procedural errors, again irrespective of the design. With respect to hardware failures in the local case, the box design increases the risk of isolation by a factor of 3 compared to the base case, but the butterfly design is just as good as the base. In the remote case, the box design is about twice as bad as the base, but the butterfly is in fact better, by at least a factor of 2.75. 132 K.N. Oikonomou Z BB1 Z BB2 A network equipment (DWDM) span BB1 BB2 AG1 AG2 A router port (module) pair router port (module) software failure or procedural error Fig. 4.7 Components for the simplest “base” and most complex “butterfly remote common” topologies. A component affects the edges or nodes which it overlaps in the diagram (the connection to the Z router is fictitious, representing the part of the network beyond the backbone routers, which is common to all alternatives) Table 4.3 Mean access disconnection time (years), i.e., time between disconnections of access router A from both backbone routers BB, for the access topologies of Fig. 4.6 Hardware Software & procedural error Local Base 700 10 Box 232 5 Butterfly 699 5 Remote Base 120 10 Box 61 5 Butterfly diverse 676 5 Butterfly common 329 5 Summarizing availability by reporting only means makes comparisons easy, but hides information that is important in assessing the risk. By making the reasonable assumption that the isolation events occur according to a Poisson distribution with means as specified in Table 4.3, we see that the 5-year mean implies that in a single year one isolation event occurs with probability 16% and two events with probability 2%. 4 Network Performability Evaluation 133 4.4.3 Other Studies Besides what was presented above, nperf has been used in a variety of other studies: the performability of a backbone network under two different types of routing was analyzed in [3], the performability of a multimedia distribution network that tolerates any single link failure was studied in [9, 10], two-layer IP-over-SONET restoration in a satellite network was investigated in [25], and techniques for setting thresholds for bundled links in an IP backbone network were studied in [26]. 4.5 Conclusion This chapter presents an overview of analyzing the combined performance and reliability, known as performability, of networks. Performability analysis may be thought of as repeating a performance analysis in many different states (failures or degradations) of the network, and is thus much more difficult than either reliability or performance analysis on its own. Successful analysis rests on finding a point on the reliability–performance spectrum appropriate to the problem at hand. Our particular approach to network performability analysis is based on a four-level hierarchical network model, and on the nperf software tool, which embodies a number of methods known in the literature, some new techniques developed by us, and is under active development in AT&T Labs Research (finite-time measures, qualityof-service additions to the traffic layer, etc.). We illustrated the ideas of analysing performability by two case studies carried out with nperf and gave references to other studies in the literature. References 1. Aven, T., & Jensen, U. (1999). Stochastic models in reliability. New York: Springer. 2. Ahuja, R., Magnanti, T., & Orlin, J. (1998). Network flows. Englewood Cliffs, NJ: PrenticeHall. 3. Agrawal, G. Oikonomou, K. N., & Sinha, R. K. (2007). Network performability evaluation for different routing schemes. Proceedings of the OFC. Anaheim, CA. 4. Alvarez, G., Uysal, M., & Merchant, A. (2001). Efficient verification of performability guarantees. In PMCCS-5: The fifth international workshop on performability modelling of computer and communication systems. Erlangen, Germany. 5. Bolch, G., Greiner, S., de Meer, H., & Trivedi, K. S.(2006). Queueing networks and Markov chains. Wiley, New Jersey. 6. Bremaud, P. (2008). Markov chains, Gibbs fields, Monte Carlo simulation, and queues. New York: Springer. 7. Carlier, J., Li, Y., & Lutton, J. (1997). Reliability evaluation of large telecommunication networks. Discrete Applied Mathematics, 76(1–3), 61–80. 8. Colbourn, C. J. (1999). Reliability issues in telecommunications network planning. In B. Sansó (Ed.), Telecommunications network planning. Boston: Kluwer. 134 K.N. Oikonomou 9. Doverspike, R. D., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., & Wang, D. (2007). IP backbone design for multimedia distribution: architecture and performance. In Proceedings of the IEEE INFOCOM, Alaska. 10. Doverspike, R. D., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., Sinha, R. K., Wang, D., & Chase, C. (2009). Designing a reliable IPTV network. IEEE internet computing, 13(3), 15–22. 11. de Souza e Silva, E., & Gail, R. (2000). Transient solutions for Markov chains. In W. K. Grassmann (Ed.), Computational probability. Kluwer, Boston. 12. Gomes, T. M. S., & Craveirinha, J. M. F. (1997). A case ctudy of reliability analysis of a multiexchange telecommunication network. In C. G. Soares (Ed.), Advances in safety and reliability. Elsevier Science. 13. Gomes, T. M. S., & Craveirinha J. M. F. (April 1998). Algorithm for sequential generation of states in failure-prone communication network. IEE proceedings-communications, 145(2). 14. Gomes, T., Craveirinha, J., & Martins, L. (2002). An efficient algorithm for sequential generation of failures in a network with multi-mode components. Reliability Engineering & System Safety, 77, 111–119. 15. Garey, M., & Johnson, D. (1978). Computers and intractability: a guide to the theory of NP-completeness. San Francisco, CA: Freeman. 16. Harms, D. D., Kraetzl, M., Colbourn, C. C., & Devitt, J. S. (1995). Network reliability: experiments with a symbolic algebra environment. Boca Raton, FL: CRC Press. 17. Jarvis, J. P., & Shier, D. R. (1996). An improved algorithm for approximating the performance of stochastic flow networks. INFORMS Journal on Computing, 8(4). 18. Karger, D. (1995). A randomized fully polynomial time approximation scheme for the allterminal network reliability problem. In Proceedings of the 27th ACM STOC. 19. Levendovszky, J., Jereb, L., Elek, Zs., & Vesztergombi, Gy. (2002). Adaptive statistical algorithms in network reliability analysis. Performance Evaluation, 48(1–4), 225–236. 20. Li, V. K., & Silvester, J. A. (1984). Performance analysis of networks with unreliable components. IEEE Transactions on Communications, 32, 1105–1110. 21. Levy, Y. & Wirth, P. E. (1989). A unifying approach to performance and reliability objectives. In Teletraffic science for new cost-effective systems, networks and services, ITC-12. Elsevier Science. 22. Mendiratta, V. B. (2001). A hierarchical modelling approach for analyzing the performability of a telecommunications system. In PMCCS-5: the fifth international workshop on performability modelling of computer and communication systems. 23. Meyer, J. F. (1995). Performability evaluation: where it is and what lies ahead. In First IEEE computer performance and dependability symposium (IPDS), pp 334–343. Erlangen, Germany. 24. Moy, J. T. (1998). OSPF: anatomy of an internet routing protocol. Reading, MA: Addison Wesley. 25. Oikonomou, K. N. Ramakrishnan, K. K., Doverspike, R. D., Chiu, A., Martinez Heath, M., & Sinha, R. K. (2007). Performability analysis of multi-layer restoration in a satellite network. Managing traffic performance in converged networks, ITC 20 (LNCS 4516). Springer. 26. Oikonomou, K. N., & Sinha, R. K. (2008). Techniques for probabilistic multi-layer network analysis. In Proceedings of the IEEE Globecomm, New Orleans. 27. Oikonomou, K. N., & Sinha, R. K. (February 2009). Improved bounds for performability evaluation algorithms using state generation. Performance Evaluation, 66(2). 28. Oikonomou, K. N., Sinha, R. K., & Doverspike, R. D. (2009). Multi-layer network performance and reliability analysis. The International Journal of Interdisciplinary Telecommunications & Networking (IJITN), 1(3). 29. Pióro, M., & Medhi, D. (2004). Routing, flow, and capacity design in communication and computer networks. Morgan-Kaufmann. 30. Rauzy, A. (2005). An m log m algorithm to compute the most probable configurations of a system with multi-mode independent components. IEEE Transactions on Reliability, 54(1), 156–158. 4 Network Performability Evaluation 135 31. Shier, D. R., Bibelnieks, E., Jarvis, J. P., & Lakin, R. J. (1990). Algorithms for approximating the performance of multimode systems. In Proceedings of IEEE Infocom. 32. Shier, D. R. (1991). Network reliability and algebraic structures. Oxford: Clarendon. 33. Yang, C. L., & Kubat, P. (1990). An algorithm for network reliability bounds. ORSA Journal on Computing, 2(4), 336–345. Chapter 5 Robust Network Planning Matthew Roughan 5.1 Introduction Building a network encompasses many tasks: from network planning to hardware installation and configuration, to ongoing maintenance. In this chapter, we focus on the process of network planning. It is possible (though not always wise) to design a small network by eye, but automated techniques are needed for the design of large networks. The complexity of such networks means that any “ad hoc” design will suffer from unacceptable performance, reliability, and/or cost penalties. Network planning involves a series of quantitative tasks: measuring the current network traffic and the network itself; predicting future network demands; determining the optimal allocation of resources to meet a set of goals; and validating the implementation. A simple example is capacity planning: deciding the future capacities of links in order to carry forecast traffic loads, while minimizing the network cost. Other examples include traffic engineering (balancing loads across our existing network) and choosing the locations of Points-of-Presence (PoPs) though we do not consider this latter problem in detail in this chapter because of its dependence on economic and demographic concerns rather than those of networking. Many academic papers about these topics focus on individual components of network planning: for instance, how to make appropriate measurements, or on particular optimization algorithms. In contrast, in this chapter we will take a system view. We will present each part as a component of a larger system of network planning. In the process of describing how the various components of network planning interrelate, we observe several recurring themes: 1. Internet measurements are of varying quality. They are often imperfect or incomplete and can contain errors or ambiguities. Measurements should not be taken at face value, but need to be continually recalibrated [48], so that we have M. Roughan () School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia e-mail: matthew.roughan@adelaide.edu.au C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 5, c Springer-Verlag London Limited 2010 137 138 M. Roughan some understanding of the errors, and can take them into account in subsequent processing. We will describe common measurement strategies in Section 5.2. 2. Analysis and modeling of data can allow us to estimate and predict otherwise unmeasurable quantities. However, in the words of Box and Draper, “Essentially, all models are wrong, but some are useful” [9]. We must be continually concerned with the quality of model-based predictions. In particular, we must consider where they apply, and the consequences of using an inaccurate model. A number of key traffic models are described in Section 5.3, and their use in prediction is described in Section 5.4. 3. Decisions based on quantitative data are at best as good as their input data, but can be worse. The quality of input data and resulting predictions are variable, and this can have consequences for the type of planning processes we can apply. Numerical techniques that are sensitive to such errors are not suitable for network engineering. Discussion of robust, quantitative network engineering is the main consideration of Sections 5.5 and 5.6. Noting all of the above, it should not be surprising that a robust design process requires validation. The strategy of “set and forget” is not viable in today’s rapidly changing networking environment. The errors in initial measurements, predictions, and the possibility for mistakes in deployment mean that we need to test whether the implementation of our plan has achieved our goals. Moreover, actions taken at one level of operations may impact others. For example, Qiu et al. [51] noted that attempts to balance network loads by changing routing can cause higher-layer adaptive mechanisms such as overlay networks to change their decisions. These higher-level changes alter traffic, leading to a change of the circumstances that originally lead us to reroute traffic. Thus, the process of measure!analyze/predict!control!validate should not stop. Once we complete this process, the cycle begins again, with our validation measurements feeding back into the process as the input for the next round of network planning, as illustrated in Fig. 5.1. This cycle allows our planning process to correct problems, leading to a robust process. In many ways this resembles the more formal feedback used in control systems, though robust planning involves a range of tasks not typically modeled in formal control theory. For instance, the lead times for deploying network components such as new routers are still quite long. It can take months to install, configure, and test new equipment when done methodically. Even customers ordering access facilities measurement Fig. 5.1 Robust network planning is cyclic decision/ control analysis / prediction 5 Robust Network Planning 139 can experience relatively long intervals from order to delivery, despite the obvious benefits to both parties of a quick startup. So if our network plan is incorrect, we cannot wait for the planning cycle to complete to redress the problem. We need processes where the cycle time is shorter. It is relatively simple to reroute traffic across a network. It usually requires only small changes to router configurations, and so can be done from day to day (or even faster if automated). Rebalancing traffic load in the short term – in the interim before the network capacities can be physically changed – can alleviate congestion caused by failures of traffic predictions. This process is called traffic engineering. Another aspect of robust planning is incorporation of reliability analysis. Internet switches and routers fail from time to time, and must sometimes be removed from service for maintenance. The links connecting routers are also susceptible to failures, given their vulnerability to natural or man-made accident (the canonical example is the careless back-hoe driver). Most network managers plan for the possibility of node or link failures by including redundant routers and links in their network. A network failure typically results in traffic being rerouted using these redundant pathways. Often, however, network engineers do not plan for overloads that might occur as a result of the rerouted traffic. Again, we need a robust planning process that takes into account the potential failure loads. We call this approach network reliability analysis. We organize this chapter around the key steps in network planning. We first consider the standard network measurements that are available today. Their characteristics determine much of what we can accomplish in network planning. We then consider models and predictions, and then finally the processes used in making decisions, and controlling our network. As noted, robust planning does not stop there, we must continue to monitor our network, but there are a number of additional steps we can perform in order to achieve a robust network plan and we consider them in the final section of this chapter. The focus of this chapter is backbone networks. Though many of the techniques described here remain applicable to access networks, there are a number of critical differences. For instance, access network traffic is often very bursty, and this affects the approaches we should adopt for prediction and capacity planning. Nevertheless, the fundamental ideas of robust planning that we discuss here remain valid. 5.2 Standard Network Measurements Internet measurements are considered in more detail in Chapters 10 and 11, but a significant factor in network planning is the type of measurements available, and so we need some planning-specific discussions. In principle, it is possible to collect extremely good data, but in practice the measurements are often flawed, and the nature of the flaws are important when considering how to use the data. The traffic data we might like to collect is a packet trace, consisting of a record of all packets on a subsection of a network along with timestamps. There are various 140 M. Roughan mechanisms for collecting such a trace, for instance, placing a splitter into an optical fiber, using a monitor port on a router, or simply running tcpdump on one of the hosts on a shared network segment. A packet trace gives us all of the information we could possibly need but is prohibitively expensive at the scale we require for planning. The problem with a packet trace (apart from the cost of installing dedicated devices) is that the amount of data involved can be enormous, for example, on an OC48 (2.5 Gbps) link, one might collect more than a terabyte of data per hour. More importantly, a packet trace is overkill. For planning we do not need such detail, but we do need good coverage of the whole network. Packet traces are only used on lower speed networks, or for specific studies of larger networks. There are several approaches we can use to reduce data to a more manageable amount. Filtering, so that we view only a segment of the traffic (say the HTTP traffic) is useful for some tasks, but not planning. A more useful approach is aggregation, where we only store records for some aggregated version of the traffic, thereby reducing the number of such records needed. A common form of aggregation is at the flow-level where we aggregate the traffic through some common characteristics. The definition of “flow” depends on the keys used for aggregation, but we mean here flows aggregated by the five-tuple formed from IP source and destination address, TCP port numbers, and protocol number. Flow data is typically collected within some time frame, for instance, 15 min periods. What is more, flowlevel collection is often a feature of a router, and so does not require additional measurement infrastructure other than the Network Management Station (NMS) at which the data is stored. However, the volume of data can still be large (one network under study collected 500 GB of data per day), and the collection process may impact the performance of the router. As a result, flow-level data is often collected in conjunction with a third method for data reduction: sampling. Sampling can be used both before the flows are created, and afterward. Prior to flow aggregation, sampling is used at rates of around 1:100–1:500 packets. That is, less than 1% of packets are sampled. This has the advantage that less processing is required to construct flow records (reducing the load on the router collecting the flows) and typically fewer flow records will be created (reducing memory and data transmission requirements). However, sampling prior to flow aggregation does have flaws, most obviously, it biases the data collection toward long flows. These flows (involving many packets) are much more likely to be sampled than short flows. However, this has rarely been seen as a problem in network planning where we are not typically concerned with the flow length distribution. Sampling can also be used after flow aggregation to reduce the transmission and storage requirements for such data. The degree of sampling depends on the desired trade-off between accuracy of measurements, and storage requirements for the data. Good statistical approaches for this sampling, and for estimating the resulting accuracy of the samples are available [16,17], though, as noted above, these are predominantly aimed at preserving details such as flow-length distributions, which are largely inconsequential for the type of planning discussed here, so sampling prior to flow construction is often sufficient for planning. 5 Robust Network Planning 141 Of more importance here is the fact that any type of sampling introduces errors into measurements. Any large-scale flow archives must involve significant sampling, and so will contain errors. An alternative to flow-level data is data collected via the Simple Network Management Protocol (SNMP) [39]. Its advantage over flow-level data collection is that it is more widely supported, and less vendor specific. However, the data provided is less detailed. SNMP allows an NMS to poll MIBs (Management Information Bases) at routers. Routers maintain a number of counters in these MIBs. The widely supported MIB-II contains counters of the number of packets and bytes transmitted and received at each interface of a router. In effect, we can see the traffic on each link of a network. In contrast to flow-level data, SNMP can only see link volumes, not where the traffic is going. SNMP has a number of other issues with regard to data collection. The polling mechanism typically uses UDP (the User Datagram Protocol), and SNMP agents are given low priority at routers. Hence SNMP measurements are not reliable, and it is difficult to ensure that we obtain uniformly sampled time series. The result is missing and error-prone data. Flow-level data contains only flow start and stop times, not details of packet arrivals, and typically SNMP is collected at 5-min intervals. The limit on timescale of both data sets is important in network planning. We can only see average traffic rates over these periods, not the variations inside these interval. However, congestion and subsequent packet loss often occur on much shorter timescales. The result is that such average measurements must always be used with care. Typically some overbuild of capacity is required to account for the sub-interval variations in traffic. The exact overbuild will depend on the network in question, and has typically been derived empirically through ongoing performance and traffic measurements. Values are usually fairly conservative in major backbones resulting in apparent underutilization (though this term is unfair as it concerns average utilizations not peak loads), and more aggressive in smaller networks. In addition to traffic data, network planning requires a detailed view of any existing network. We need to know The (layer 3) topology (the locations of, and the links between routers) The network routing policies (for instance, link weights in a shortest-path proto- col, areas in protocols such as OSPF, and BGP policies where multiple interdomain links exist) The mapping between current layer 3 links and physical facilities (WDM equipment and optical fibers), and the details of the available physical network facilities and their associated costs The topology and routing data is principally needed to allow us to map traffic to links. The mapping is usually expressed through the routing matrix. Formally, A D fAir g is the matrix defined by Air D Fir ; 0; if traffic for r traverses link i otherwise; (5.1) 142 M. Roughan where Fi r is the fraction of traffic from source/destination pair r D .s; d / that traverses link i . A network with N nodes, and L links will have an L N.N 1/ routing matrix. Network data is also used to assess how changes in one component will affect the network (e.g., how changes in OSPF link weights will impact link loads); determine shared risk-of-failure between links; and determine how to improve our network incrementally without completely rebuilding it in each planning cycle. The latter is an important point because although it might be preferable to rebuild a network from scratch, the capital value of legacy equipment usually prevents this option, except at rare intervals. For a small, static network, the network data may be maintained in a database, however, best practice for large, complex, or dynamic networks is to use tools to extract the network structure directly from the network. There are several methods available for discovering this information. SNMP can provide this information through the use of various vendor tools (HP Openview, or Cisco NCM, e.g.), but it is not the most efficient approach. A preferable approach for finding layer 3 information is to parse the configuration files of routers directly, for instance, as described in [22,24]. The technique has been applied in a number of networks [5,38]. The advantages of using configuration files are manifold. The detail of information available is unparalleled in other data sources. For instance, we can see details of the links (such as their composition should a single logical link be composed of more than one physical link). The other major approach for garnering topology and routing information is to use a route monitor. Internet routing is built on top of distributed computations supported by routing protocols. The distribution of these protocols is often considered a critical component in ensuring reliability of the protocols in the face of network failures. The distribution also introduces a hook for topology discovery. If any router must be able to build its routing table from the routing information distributed through these protocols, then it must have considerable information about the network topology. Hence, we can place a dummy router into the network to collect such information. Such routing monitors have been deployed widely over the last few years. Their advantage is that they can provide an up-to-date dynamic view. Examples of such monitors exist for OSPF [61, 62], and IS-IS [1, 30], as well as for BGP (the Border Gateway Protocol) [2, 3]. 5.3 Analysis and Modeling of Internet Traffic 5.3.1 Traffic Matrices We will now consider the analysis and modeling of Internet data, in particular, traffic data. When considering inputs to network planning, we frequently return to the topic of traffic matrices. These are the measurements needed for many network planning tasks, and thus the natural structure around which we shall frame our analysis. 5 Robust Network Planning 143 A Traffic Matrix (TM) describes the amount of traffic (the number of packets or more commonly bytes) transmitted from one point in a network to another during some time interval, and they are naturally represented by a three-dimensional data structure Tt .i; j /, which represents the traffic volume (in bytes or packets) from i to j during a time interval Œt; t C t/. The locations i and j are generally considered to be physical geographic locations making i and j spatial variables. However, in the Internet, it is common to associate i and j with logical structures related to the address structure of the Internet, i.e., IP addresses, or natural groupings of such by common prefix corresponding to a subnet. Origin/Destination Matrices One natural approach to describe traffic matrices is with respect to traffic volumes between IP addresses or prefixes. We refer to this as an origin/destination TM because the IP addresses represent the closest approximation we have for the end points of the network (though HTTP-proxies, firewalls, and NAT and other middle-boxes may be obscuring the true end-to-end semantics). IPv4 admits nearly 232 potential addresses, so we cannot describe the full matrix at this level of granularity. Typically, such a traffic matrix would be aggregated into blocks of IP addresses (often using routing prefixes to form the blocks as these are natural units for the control of traffic). The origin/destination matrix is our ideal input for many network planning tasks, but the Internet is made up of many connected networks. Any one network operator only sees the traffic carried by its own network. This reduced visibility means that our observed traffic matrix is only a segment of the real network traffic. So we can’t really observe the origin/destination TM. Instead we typically observe the ingress/egress traffic matrix. Ingress/Egress versus Origin/Destination A more practical TM, the ingress/ egress TM provides traffic volumes from ingress link to egress link across a single network. Note that networks often interconnect at multiple points. The choice of which route to use for egress from a network can profoundly change the nature of ingress/egress TMs, so these may have quite different properties to the origin/destination matrix. Forming an ingress/egress TM from an origin/destination TM involves a simple mapping of prefixes to ingress/egress locations in a network, but in practice this mapping can be difficult unless we monitor traffic as it enters the network. We can infer egress points of traffic using the routing data described above, but inferring ingress is more difficult [22, 23], so it is better to measure this directly. Spatial Granularity of Traffic Matrices As we have started to see with origin/destination traffic matrices, we can measure them at various levels of granularity (or resolution). The same is true of ingress/egress TMs. At the finest level, we measure traffic per ingress/egress link (or interface). However, it is common to aggregate this data to the ingress/egress router. We can often group routers into larger subgroups. A common such group is a Point-of-Presence (PoP), though there are other sub- and super-groupings (e.g., topologically equivalent edge routers are sometimes 144 M. Roughan grouped, or we may form a regional group). Given subsets S and D of locations, may simply aggregate a TM across these by taking Tt .S; D/ D XX Tt .i; j /: (5.2) i 2S j 2D Typical large networks might have 10s of PoPs, and 100s of routers, and so such TMs are of a more workable size. In addition, as we aggregate traffic into larger groupings, statistical multiplexing reduces the relative variance of the traffic and allows us to perform better estimates of traffic properties such as the mean and variance. Temporal Granularity of Traffic Matrices We cannot make instantaneous measurements of a traffic matrix. All such observations occur over some time interval Œt; t C t/. It would be useful to make the interval t smaller (for instance, for detecting anomalies), but typically we face a trade-off against the errors and uncertainties in our measurements. A longer time interval allows more “averaging-out” of errors, and minimizes the impact of missing data. The best choice of time interval for TMs is typically determined by the task at hand, and the network under study, but a common choice is a 1 hour interval. In addition to being easily understood by human operators, this interval integrates enough SNMP or flow-level data to reduce the impact of (typical) missing data and errors, while allowing us to still observe important diurnal patterns in the traffic. 5.3.2 Patterns in Traffic It is useful to have some understanding of the typical patterns we see in network traffic. Such patterns are only visible at a reasonable level of aggregation (otherwise random temporal variation dominates our view of the traffic), but for high degrees of aggregation (such as router-to-router traffic matrices on a large backbone network) the pattern can be very regular. There are two main types of patterns that have been observed: patterns across time, and patterns in the spatial structure. Each is discussed below. Temporal Patterns Internet traffic has been observed to follow both daily (diurnal) and weekly cycles [33–35,57,64]. The origin of these cycles is quite intuitive. They arise because most Internet traffic is currently generated by humans whose activities follow such cycles. Typical examples are shown in Figs. 5.2 and 5.3. Figure 5.2 shows a RRD Tool graph1 of the traffic on a link of the Australian Academic Research Network (AARNet). Figure 5.3 shows the total traffic entering AT&T’s North American backbone network at a Point of Presence (PoP) over two consecutive 1 RRDTool (the Round Robin Database tool) [47] and its predecessor MRTG (the Multi-Router Traffic Grapher [46]) are perhaps the most common tools for collecting and displaying SNMP traffic data. 5 Robust Network Planning 145 Bits per Second 20.4 M 15.3 M 10.2 M 5.1 M 0.0 M Sat Sun Mon Tue Wed Thu Fri Sat Sun Fig. 5.2 Traffic on one link in the Australian Academic Research Network (AARNet) for just over 1 week. The two curves show traffic in either direction along the link Traffic: 08−May−2001 (GMT) traffic rate traffic rate Traffic: 07−May−2001 (GMT) start 08−May−2001 the following week start 07−May−2001 the following week Mon Tue Wed Thu Fri Sat Sun Mon 09:00 12:00 15:00 18:00 21:00 00:00 03:00 06:00 09:00 time (GMT) Fig. 5.3 Total traffic into a region over 2 consecutive weeks. The solid line is the first week’s data (starting on May 7), and the dashed line shows the second week’s data. The second figure zooms in on the shaded region of the first weeks in May 2001. The figure illustrates the daily and weekly variations in the traffic by overlaying the traffic from the 2 weeks. The striking similarity between traffic patterns from week to week is a reflection of the high level of aggregation that we see in a major backbone network. The observation of cycles in traffic is not new. For many years they have been seen in telephony [13]. Typically telephone service capacity planning has been based on a “busy hour”, i.e., the hour of the day that has the highest traffic. The time of the busy hour depends on the application and customer base. Access networks typically have many domestic consumers, and consequently their busy hour is in the evening when people are at home. On the other hand, the busy hour of business customers is typically during the day. Obviously, time-zones have an effect on the structure of the diurnal cycle in traffic, and so networks with a wide geographic dispersion may experience different busy hours on different parts of their network. In addition to cyclical patterns, Internet traffic has shown strong growth over many years [45]. This long-term trend has often been approximated by exponential growth, although care must be taken because sometimes such estimates have been based on poor (short or erratic) data [45]. Long-term trends should be estimated from multiple years of carefully collected data. 146 M. Roughan traffic (PB/quar ter) 102 101 100 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Fig. 5.4 ABS traffic measurements showing Australian Internet traffic, with an exponential fit to the data from 2000 to 2005. Data is shown by ‘o’, and the fit by the straight line. Note that the line continuing past 2005 is a prediction based on the pre-2005 data, showing also the 95th percentile confidence bounds for the predictions One public example is the data collected by the Australian Bureau of Statistics (ABS)2 who have collected historical data on Australian ISP traffic for many years. Figure 5.4 shows Australia’s network traffic in petabytes per quarter with a log-y axis. Exponential growth appears as a straight line on the log-graph, so we can obtain simple predictions of traffic growth through linear regression. The figure shows such a prediction based on pre-2005 data. It is interesting to note that the most recent data point does not, as one might assume without analysis, represent a significant drop in traffic growth. Relative to the long-term data the last point simply represents a reversion to the long-term growth from rather exceptional traffic volumes over 2007. We will discuss such prediction in more detail in the following sections. Standard time-series analysis [10] can be used to build a model of traffic containing long-term trends, cyclical components (often called seasonal components in other contexts), and random fluctuations. We will use the following notation here: S.t/ D seasonal (cyclical) component; (5.3) L.t/ D long-term trend; (5.4) W .t/ D random fluctuations: (5.5) The seasonal component is periodic, i.e., S.t C kTS / D S.t/, for all integers k, where TS is the period (which is either 24 hour or 1 week). Before we can consider how to estimate the seasonal (and trend) components of the traffic, we must 2 www.abs.gov.au 5 Robust Network Planning 147 model these components.3 At the most basic level, consider the traffic to consist of two components, a time varying (but deterministic) mean m.t/ and a stochastic component W .t/. At this level we could construct the traffic by addition or multiplication of these components (both methods are used in econometric and census data). However, in traffic data, a more appropriate model [43, 56] is x.t/ D m.t/ C p am.t/ W .t/; (5.6) where a is called the peakedness of the traffic, W .t/ is a stochastic process with zero mean, and unit variance, and x.t/ represents the average rate of some traffic (say a particular traffic matrix element) at time t. More highly aggregated traffic is smoother, and consequently would have a smaller value for a. The reason for this choice of model lies in the way network traffic behaves when aggregated. When multiple flows are aggregated onto a non-congested link, we should expect them to obey the same model (though perhaps with different parameters). Our model has this property: for instance, take N traffic streams xi with mean mi , peakedness ai , and stochastic components, which are independent realizations of a (zero mean, unit variance) Gaussian process. The multiplexed traffic stream is xD N X i D1 mi C N X p ai mi Wi : (5.7) i D1 P The mean of the new process is m D N i D1 mi , and the peakedness (derived from PN 1 the variance) is a D m i D1 ai mi , which is a weighted average of the component peakednesses. The relative variance becomes Vx D Varfxg=Efxg D N 1 X ai mi : m2 (5.8) i D1 If we take identical streams, then the relative variance decreases as we multiplex more together, which is to be expected. The result is that in network traffic the level of aggregation is important in determining the relative variance: more highly aggregated traffic exhibits less random behavior. The data in Fig. 5.3 from AT&T shows an aggregate of a very large number of customers (an entire PoP of one of North America’s largest networks). The consequence is that we can see the traffic is very smooth. In contrast the traffic shown in Fig. 5.2 is much less aggregated, and shows more random fluctuations. The model described above is not perfect (none are), but it is useful because it (i) allows us to calculate variances for aggregated traffic streams in a consistent way and to use these when planning our network, and (ii) its parameters are relatively 3 The reader should beware of methods, which do not explicitly model the data, because in these methods there is often an implicit model. 148 M. Roughan easy to measure, and therefore to use in traffic analysis. To do so, however, we find it useful to spilt the mean m.t/ into the cyclic component (which we denote S.t/) and the long-term trend L.t/ by taking the product m.t/ D L.t/S.t/: (5.9) We combine the two components through a product because as the overall load increases the range of variation in the size of cycles also increases. When estimating parameters of our models, it is important to allow for unusual or anomalous events, for instance, a Denial of Service (DoS) attack. These events are rare (we hope), but it is important to separate them from the normal traffic. Such terms can sometimes be very large, but we do not plan network capacity to carry DoS attacks! The network is planned around the paying customers. We separate them by including an impulsive term, I.t/, in the model, so that the complete model is x.t/ D L.t/S.t/ C p aL.t/S.t/ W .t/ C I.t/: (5.10) We will further discuss this model in Section 5.4, where we will consider how to estimate its parameters, and to use it in prediction. Spatial Patterns Temporal models are adequate for many applications: for instance, where we consider dimensioning of a single bottleneck link (perhaps in the design of an access network). However, spatial patterns in traffic provide us with addition planning capabilities. For instance, if two traffic sources are active at different times, then clearly we can carry them both with less capacity than if they activate simultaneously. Spatial patterns refer to the structure of a Traffic Matrix (TM) at a single time interval. It is common that TM elements are strongly correlated because they show similar diurnal (and weekly) patterns. For example, in a typical network (without wide geographic distribution) one will find that the busy hour is almost the same for all elements of the TM, but there is additional structure. For a start, TMs often come from skewed distributions. A common example is where the distribution follows a rough 80–20 law (80% of traffic is generated by the largest 20% of TM elements). Similar distributions have often been observed, though often even more skewed: for instance 90–10 laws are not uncommon. However, the distribution is not “heavy-tailed”. Observed distributions have shown a lighter tail than the log-normal distribution [55]. Consequently, traffic matrix work often concentrates on these larger flows, but traditional (rather than heavy-tailed) statistical techniques are still applicable. Another simple feature one might naively expect of TMs – symmetry – is not present. Internet routing is naturally asymmetric, as is application traffic (a large amount of traffic still follows a client–server model, which results in strongly asymmetric traffic). Hence, the matrix will not (generally) be symmetric [21], i.e., T .i; j / ¤ T .j; i /. We observe some additional structure in these matrices. The simplest model that describes some of the observed structure is the gravity model. In network 5 Robust Network Planning 149 applications, gravity models have been used to model the volume of telephone calls in a network [31]. Gravity models take their name from Newton’s law of gravitation, and are commonly used by social scientists to model the movement of people, goods or information between geographic areas [49,50,63]. In Newton’s law of gravitation the force is proportional to the product of the masses of the two objects divided by the distance squared. Similarly, in gravity models for interactions between cities, the relative strength of the interaction might be modeled as proportional to the product of the cities’ populations, so a general formulation of a gravity model is given by T .i; j / D Ri Aj ; fij (5.11) where Ri represents the repulsive factors that are associated with leaving from i ; Aj represents the attractive factors that are associated with going to j ; and fij is a friction factor from i to j . The gravity model was first used in the context of Internet traffic matrices in [67] where we can naturally interpret the repulsion factor Ri as the volume of incoming traffic at location i , and the attractivity factor Aj as the outgoing traffic volume at location j . The friction matrix fij encodes the locality information specific to different source–destination pairs, however, as locality is not as large a factor in Internet traffic as in the transport of physical goods, it is common to assume fij D const. The resulting gravity model simply states that the traffic exchanged between locations is proportional to the volumes entering and exiting at those locations. Formally, let T in .i / and T out .j / denote the total traffic that enters the network via i , and exits via j , respectively. The gravity model can then be computed by T .i; j / D T in .i /T out .j / ; T tot (5.12) where T tot is the total traffic across the network. Implicitly, this model relies on a conservation assumption, P in i.e., traffic P isoutneither created nor destroyed in the network .k/: The assumption may be violated, for so that T tot D k T .k/ D kT instance, when congestion causes packet loss. However, in most backbones congestion is kept low, and so the assumption is reasonable. In the form just described, the gravity model has distinct limitations. For instance, real traffic matrices may have non-constant fij (perhaps as a result of different time zones). Moreover, even if an origin destination traffic matrix matches the gravity model well, the ingress/egress TM may be systematically distorted [7]. Typically, networks use hot-potato routing, i.e., they choose the egress point closest to the ingress point, and this results in a systematic distortion of ingress/egress traffic matrices away from the simple gravity model. These distortions and others related to the asymmetry of traffic and distance sensitivity may be incorporated in generalizations of the gravity model where sufficient data exists to measure such deviations [13, 21, 67]. 150 M. Roughan The use of temporal patterns in planning is relatively obvious. The use of spatial patterns such as the gravity model is more subtle. The spatial structure gives us the capability to fill in missing values of the traffic matrix when our data is not perfect. Hence we can still plan our network, even in the extreme case where we have no data at all. 5.3.3 Application Profile We have so far discussed network traffic along two dimensions: the temporal and spatial. There is a third aspect of traffic to consider: its application breakdown, or profile. Common applications on the Internet are email, web browsing (and other server-based interactions), peer-to-peer file transfers, video, and voice. Each may have a different traffic matrix, and as some networks move toward differentiated Quality of Service (QoS) for different classes of traffic, we may have to plan networks based on these different traffic matrices. Even where differentiated service is not going to be provided, a knowledge of the application classes in our network can be very useful. For instance Voice traffic is less variable than data, and so can require less overhead for sub- measurement interval variations. Peer-to-peer applications typically generate more symmetric traffic than web traffic, and so downstream capacity (toward customer eyeballs) is likely to be more balanced when peer-to-peer applications dominate. We may be planning to eliminate some types of traffic in future networks (e.g., peer-to-peer traffic has often been considered to violate service agreements that prohibit running servers). The breakdown of traffic on a network is not trivial to measure. As noted, typical flow-level data collection includes TCP/UDP port numbers, and these are often associated with applications using the IANA (Internet Assigned Numbers Authority) list of registered ports.4 However, the port numbers used today are often associated with incorrect applications because: Ports are not defined with IANA for all applications, e.g., some peer-to-peer applications. An application may use ports other than its well-known ports to circumvent access control restrictions, e.g., nonprivileged users often run WWW servers on ports other than port 80, which is restricted to privileged users on most operating systems, while port 80 is often used for other applications (than HTTP) in order to work around firewalls. In some cases server ports are dynamically allocated as needed. For example, FTP allows the dynamic negotiation of the server port used for the data transfer. 4 http://www.iana.org/assignments/port-numbers 5 Robust Network Planning 151 This server port is negotiated on an initial TCP, connection which is established using the well-known FTP control port, but which would appear as a separate flow. Malicious traffic (e.g., DoS attacks) can generate a large volume of bogus traffic that should not be associated with the applications that normally use the affected ports. In addition, there are some incorrect implementations of protocols, and ambiguous port assignments that complicate the problem. Better approaches to classification of traffic exist (e.g., [58]), but are not always implemented on commercial measurement systems. Application profiles can be quite complex. Typical Internet providers will see some hundreds of different applications. However, there are two major simplifications we can often perform. The first is a clustering of applications into classes. QoS sometimes forms natural classes (e.g., real-time vs bulk-transfer classes), but regardless we can often group many applications into similarly structured classes, e.g., we can group a number of protocols (IMAP, POP, SMTP, etc.) into one class “email”. Common groupings are shown in Table 5.1, along with exemplar applications. There may be a larger number of application classes, and often there is a significant group of unknown applications, but a typical application profile is highly skewed. Again, it is common to see 80–20 or 90–10 rules. In these cases, it is common to focus attention on those applications that generate the most traffic, reducing the complexity of the profile. However, care must be taken because some applications that generate relatively little traffic on average may be considered very important, and/or may generate high volumes of traffic for short bursts. There are several such examples in enterprise networks, for instance, consider a CEO’s once-a-week company-wide broadcast, or nightly backups. Both generate a large amount of traffic, but in a relative short-time interval, so their proportion of the overall network traffic may be small. More generally, much of the control-plane traffic (e.g., routing protocol traffic) in networks is relatively low volume, but of critical importance. Table 5.1 Typical application classes grouped by typical use Class Bulk-data Database access Email Information Interactive Measurement Network control News Online gaming Peer-to-peer Voice over IP www Example applications FTP, FTP-Data Oracle, MySQL IMAP, POP, SMTP finger, CDDBP, NTP SSH, Telnet SNMP, ICMP, Netflow BGP, OSPF, DHCP, RSVP, DNS NNTP Quake, Everquest Kazaa, Bit-torrent SIP, Skype HTTP, HTTPS 152 M. Roughan 5.4 Prediction There are two common scenarios for network planning: 1. Incremental planning for network evolution 2. Green-fields planning In the first case, we have an existing network. We can measure its current traffic, and extrapolate trends to predict future growth. In combination with business data, quite accurate assessments of future traffic are possible. Typically, temporal models are sufficient for incremental network planning, though better results might be possible with recently developed full spatio-temporal models [52]. In green-fields planning, we have the advantage that we are not constrained in our network design. We may start with a clean slate, without concerning ourselves with a legacy network. However, in such planning we have no measurements on which to base predictions. All is not lost, however, as we may exploit the spatial properties of traffic matrices in order to obtain predictions. We discuss each of these cases below. There are other scenarios of concern to the network planner. For example Network mergers, for instance when two companies merge and subsequently combine their networks. Network migrations, for instance, as significant services such as voice or frame- relay are migrated to operate on a shared backbone. Addition (or loss) of a large customer (say a broadband access provider, a major content provider, or a hosting center). A change in interdomain routing relationships. For instance, the conversion of a customer to a peer would mean that traffic no longer transits from that peer, altering traffic patterns. The impact of these types of event is obviously dependent on the relative volume of the traffic affected. Such events can be particularly significant for smaller networks, but it is not unheard of for them to cause unexpected demands on the largest networks (for instance, the migration of an estimated half-million customers from Excite@home to AT&T in 20025). However, the majority of such cases can be covered by one or both of the techniques below. 5.4.1 Prediction for Incremental Planning Incremental planning involves extending, or evolving a current network to meet changing patterns of demands, or changing goals. The problem involves prediction of future network demands, based on extrapolation of past and present network 5 http://news.cnet.com/ExciteHome-to-shut-down-ATT-drops-bid/2100-1033 3-276550.html 5 Robust Network Planning 153 measurements. The planning problems we encounter are often constrained by the fact that we can make only incremental changes to our network, i.e., we cannot throw away the existing network and start from a clean slate, but let us first consider the problem of making successful traffic predictions. Obviously, our planning horizon (the delay between our planning decisions and their implementation) is critical. The shorter this horizon, the more accurate our predictions are likely to be, but the horizon is usually determined by external factors such as delays between ordering and delivery of equipment, test and verification of equipment, planned maintenance windows, availability of technical staff, and capital budgeting cycles. These are outside the control of the network planner, so we treat the planning horizon as a constant. The planning horizon also suggests how much historical data is needed. It is a good idea to start with historical data extending several planning horizons into the past. Such a record allows not only better determination of trends, but also an assessment of the quality of our prediction process through analysis of past planning periods. If such data is unavailable, then we must consider green-fields planning (see Section 5.4.2), though informed by what measurements are available. Given such a historical record, our primary means for prediction is temporal analysis of traffic data. That is, we consider the traffic measurements of interest (often a traffic matrix) as a set of time-series. However, as noted earlier the more highly we aggregate traffic, the smaller its relative variance, and the easier it is to work with. As a result, it can be a good idea to predict traffic at a high level of aggregation, and then use a spatial model to break it into components. For instance, we might perform predictions for the total traffic in each region of our network, and then break it into components using the current traffic matrix percentages, rather than predicting each element of the traffic matrix separately. There are many techniques for prediction, but we concentrate here on just one, which works reasonably for a wide range of traffic, but we should note that as in all of the work presented here, the key is not the individual algorithms but their robust application through a process of measurement, planning, and validation. 5.4.1.1 Extracting the Long-Term Trend We will exploit the previously presented temporal model for traffic, and note that the key to providing predictions for use in planning is to estimate the long-term trend in the data. We could form such an estimate simply by aggregating our timeseries over periods of 1 week (to average away the diurnal and weekly cycles) and then performing standard trend analysis. However, knowledge of the cycles in traffic data is often useful. Sometimes we design networks to satisfy the demand during a “busy hour.” More generally though, the busiest hours for different components of the traffic may not match (particularly in international networks distributed over several time-zones), and so we need to plan our network to have sufficient capacity at all hours of the day or night. 154 M. Roughan Hence, the approach we present provides the capability to estimate both the longterm trend, and the seasonal components of the traffic. It also allows an estimate of the peakedness, providing the ability to estimate the statistical variations around the expected traffic behavior. The method is hardly the only applicable time-series algorithm for this type of analysis (for another example see [44]), but it has the advantage of being relatively simple. The method is based on a simple signal processing tool, the Moving Average (MA) filter, which we discuss in detail below. The moving average can be thought of as a simple low-pass filter as it “passes” low-frequencies, or long-term behavior, but removes short-term variations. As such it is ideally suited to extracting the trend in our traffic data. Although there are many forms of moving average, we shall restrict our attention to the simplest: a rectangular moving average sDt Cn X 1 MAx .tI n/ D x.s/; (5.13) 2n C 1 sDt n where n is the width of the filter, and 2n C 1 is its length. The length of the filter must be longer than the period of the cyclic component in order to filter out that component. Longer filters are often used to allow for averaging out of the stochastic variation as well. The shortest filter we should consider for extracting the trend is three times the period, which in Internet traffic data is typically 1 week. For example, given traffic data x.t/, measured in 1 hour intervals, we could form our estimate O L.t/ of the trend by taking a filter of length 3 weeks (e.g., 2n C 1 D 504 D O 24 7 3), i.e., we might take L.t/ D MAx .tI 252/ where MAx is defined in (5.13). Care must always be taken around the start and end of the data. Within n data points of the edges the MA filter will be working with incomplete data, and so these estimates should be discounted in further analysis. Once we have obtained estimates for the long-term trend, we can model its behavior. Over the past decade, the Internet has primarily experienced exponential growth (for instance, see Fig. 5.4 or [45]) i.e., L.t/ D L.0/e ˇ t ; (5.14) where L.0/ is the starting value, and ˇ is the growth rate. If exponential growth is suspected the standard approach is to transform the data using the log function so that we see log L.t/ D log L.0/ C ˇt; (5.15) where we can now estimate L.0/ and ˇ from linear regression of the observed data. Care should obviously be taken that this model is reasonable. Regression provides diagnostic statistics to this end, but comparisons to other models (such as a simple linear model) can also be helpful. Such a model can be easily extrapolated to provide long-term predictions of traffic volumes. Standard diagnostics from the regression can also be used to provide confidence bounds for the predictions, allowing us to predict “best” and “worst” case scenarios for traffic growth, and an example of such predictions is given in Fig. 5.4 using the data from 2000 to 2004 to estimate the trend, and then extrapolating this 5 Robust Network Planning 155 until 2009. The figure shows the extrapolated optimistic and pessimistic trend estimates. We can see that actual traffic growth from 2005 to 2007 was on the optimistic side of growth, but that in 2008 the measured traffic was again close to the long-term trend estimate. This example clearly illustrates that understanding the potential variations in our trend estimate is almost as important as obtaining the estimate in the first place. It also illustrates how instructive historical data can be in assessing appropriate models and prediction accuracy. Often, in traffic studies, managers are keen to know the doubling time, the time it takes traffic to double. This can be easily calculated by estimating the value of t such that L.t/ D 2L.0/, or e ˇ t D 2. Again, taking logs we get the doubling time t D 1 ln 2: ˇ (5.16) The Australian data shown in Fig. 5.4 has a doubling time of 477 days. The trend by itself can inform us of growth rate but modeling the cyclic variations in traffic is also useful. We do this by extending the concept of moving average to the seasonal moving average, but before doing so we broadly remove the long-term O trend from the data (by dividing our measurements x.t/ by L.t/). 5.4.1.2 Extracting the Cyclical Component The goal of a Seasonal Moving Average (SMA) is to extract the cyclic component of our traffic. We know, a priori, the period (typically 7 days) and so the design of a filter to extract this component is simple. It resembles the MA used previously in that it is an average, but in this case it is an average of measurements separated in time by the period. More precisely we form the SMA of the traffic with the estimated trend removed, e.g., N 1 1 X O C nTS /; SO .t/ D x.t C nTS /=L.t N nD0 (5.17) where TS is the period, and N TS is the length of the filter. In effect the SMA estimates the traffic volume for each time of day and week as if they were separate time series. It can be combined with a short MA filter to provide some additional smoothing of the results if needed. The advantage of using an SMA as opposed to a straightforward seasonal average is that the cyclical component of network traffic can change over time. Using the SMA allows us to see such variability, while still providing a reasonably stable model for extrapolation. There is a natural trade-off between the length of the SMA, and the amount of change we allow over time (longer filters naturally smooth out transient changes). Typically, the length of filter desired depends on the planning horizon under which we are operating. We extrapolate the SMA in various ways, 156 M. Roughan but the simplest is to repeat the last cycle measured in our data into the future, as if the cyclical component remained constant into the future. Hence, when operating with a short planning horizon (say a week), we can allow noticeable week-to-week variations, and still obtain reasonable predictions, and so a filter length of three to four cycles is often sufficient. Where our planning horizon is longer (say a year) we must naturally assume that the week-to-week variations in the cyclical behavior are smaller in order to extrapolate, and so we use a much longer SMA, preferably at least of the order of the length of the planning horizon. 5.4.1.3 Estimating the Magnitude of Random Variations Once we understand the periodic and trend components of the traffic, the next thing to capture is the random variation around the mean. Most metrics of variation used in capacity planning do not account for the time-varying component, and so are limited O SO .t/ to busy-hour analysis. In comparison, we now have an estimate of m.t/ O D L.t/ and so can use (5.6) to estimate the stochastic or random component of our traffic p by z.t/ D .x.t/ m.t//= O m.t/. O We can now measure the variability of the random component of the traffic using the variance of z.t/, which forms an estimate aO for the traffic’s peakedness. The estimator for aO including the correction for bias is given in [57]. Note that it is also important to separate the impulsive, anomaly terms from the more typical variations. There are many anomaly detection techniques available (see [66] for a review of a large group of such algorithms). These algorithms can be used to select anomalous data points that can then be excluded from the above analysis. 5.4.1.4 From Traffic Matrix to Link Loads Once we have predictions of a TM, we often need to use these to compute the link loads that would result. The standard approach is to write the TM in vectorized form x, where the vector x consists of the columns of the TM (at a particular time) stacked one on top of another. The link loads y can then be estimated through the equation y D Ax; (5.18) where A is the routing matrix. The equation above can also be extended to project observations or predictions of a TM over time into equivalent link loads. Although there are multiple time-series approaches that can be used to predict future behavior (e.g., Holt-Winters [11]), our approach has the advantage that it naturally incorporates multiplexing. As a result, Eq. 5.18 can be extended to other aspects of the traffic model. For instance, the variances of independent flows are additive (the variance of the multiplexed traffic is the sum of the variances of the components), and so the variance of link traffic follows the same relationship, i.e., vy D Avx ; (5.19) 5 Robust Network Planning 157 where vy and vx are the variances of the link loads and TM, respectively. We can use vy to deduce peakedness parameters for the link traffic using (5.7). So far, we have assumed that the network (at least the location of links, and the routing) is static. In reality, part of network planning involves changing the network, and so the matrix A is really a potential variable. When we consider network planning, A appears implicitly as one of our optimization variables. Likewise, A may change in response to link or router failures. The reason-traffic matrices are so important is that they are, in principle, invariant under changes to A. Hence predictions of link loads under the changes in A can be easily made. For example, imagine a traffic engineering problem where we wish to balance the load on a network’s internal links more effectively. We will change routing in the network in order to balance the traffic on links more effectively. In doing so, the link loads are not invariant (the whole point of traffic engineering is to change these). However, the ingress/egress TM is invariant, and projecting this onto the links (via the routing matrix) will predict the link loads under proposed routing changes. In reality, invariance is an approximation. Real TMs are not invariant under all network changes, for instance, if network capacities are chosen to be too small, congestion will result. However, the Transmission Control Protocol (TCP) will act to alleviate this congestion by reducing the actual traffic carried on the network, thereby changing the traffic matrix. In general, different sets of measurements will have different degrees of invariance. For instance, an origin/destination TM is invariant to changes in egress points (due to routing changes), whereas an ingress/egress TM is not. It is clearly better to use the right data set for each planning problem, but the desired data is not always available. The lack of true invariance is one of the key reasons for the cyclic approach to network planning. We seek to correct any problems caused by variations in our inputs in response to our new network design. 5.4.2 Prediction for Green-Fields Planning The above section assumes that we have considerable historical data to which we apply time-series techniques to extrapolate trends, and hence predict the future traffic demands on our network. This has two major limitations: 1. IP traffic is constrained by the pipe through which it passes. TCP congestion control ensures that such traffic does not overflow by limiting the source transmission rate. In most networks our measurements only provide the carried load, not the offered load. If the network capacities change, the traffic may increase in response. This is a concern if our current network is loaded to near its capacity, and in this case we must discount our measurements, or at least treat them with caution. 2. When we design a new network there is nothing in place for us to measure. 158 M. Roughan We will start by considering available strategies for the latter case. We can draw inspiration from the spatial models previously presented. The fact that the simple gravity model describes, to some extent, the spatial structure of Internet traffic matrices presents us with a simple approach to estimate an initial traffic matrix. The first step is to estimate the total expected traffic for the network, based on demographics and market projections. Let us take a simple example: in Australia the ABS measures Internet usage. Across a wide customer base the average usage per customer was roughly 3 GB/month (since 2006). The total traffic for our network is the usage per customer multiplied by the projected number of customers. We can derive traffic estimates per marketing region in the same fashion. Note that the figure used above is for the broad Australian market and is unlikely to be correct elsewhere (typical Australian ISPs have an tiered pricing structure). Where more detailed figures exist in particular markets these should be used. The second step is to estimate the “busy-hour” traffic. As we have seen previously the traffic is not uniformly distributed over time. In the absence of better data, we might look at existing public measurements (such as presented in Figs. 5.2 and 5.3, or as appears in [44]) where the peak to mean ratio is of the order of 3 to 2. Increasing our traffic estimates by this factor gives us an estimate of the peak traffic loads on the network. The third step is to estimate a traffic matrix. The best approach, in the absence of other information, to derive the traffic matrix is to apply the gravity model (5.12). In the simple case, the gravity model would be applied directly using the local regional traffic estimates. However, where additional information about the expected application profile exists, we might use this to refine the results using the “independent flow model” of [21]. Additional structural information about the network might allow use of the “generalized gravity model” of [68]. Each of these approaches allows us to use additional information, but in the absence of such information the simple gravity model gives us our initial estimate of the network traffic matrix. What about the case where we have historical network traffic measurements, but suspect that the network is congested so that the carried load is significantly below the offered load? In this case, our first step is to determine what parts of the traffic matrix are affected. If a large percentage of the traffic matrix is affected, then the only approach we have available is to go back through the historical record until we reach a point (hopefully) where the traffic is not capacity constrained. This has limitations: for one thing, we may not find a sufficient set of data where capacity constraints have left the measurements uncorrupted. Even where we do obtain sufficient data, the missing (suspect) measurements increase the window over which we must make predictions, and therefore the potential errors in these predictions. However, if only a small part of the traffic matrix is affected we may exploit techniques developed for traffic matrix inference to fill in the suspect values with more accurate estimates. These methods originated due to the difficulties in collecting flow-level data to measure traffic matrices directly. Routers (particularly older routers) may not support an adequate mechanism for such measurements (or suffer a performance hit when the measurements are used), and installation of stand-alone measurement devices can be costly. On the other hand, the Simple Network Management Protocol (SNMP) is almost ubiquitously available, and has little overhead. 5 Robust Network Planning 159 Unfortunately, it provides only link-load measurements, not traffic matrices. However, the two are simply related by (5.18). Inferring x from y is a so-called “network tomography” problem. For a typical network the number of link measurements is O.N / (for a network of N nodes), whereas the number of traffic matrix elements is O.N 2 / leading to a massively underconstrained linear inverse problem. Some type of side information is needed to solve such problems, usually in the form of a model that roughly describes a typical traffic matrix. We then estimate the parameters of this crude model (which we shall call m), and perform a regularization with respect to the model and the measurements by solving the minimization problem argmin ky Axk22 C 2 d .x; m/; x (5.20) where k k2 denotes the l 2 norm, > 0 is a regularization parameter, and d.x; m/ is a distance between the model m and our estimated traffic matrix x. Examples of suitable distance metrics are standard or weighted Euclidean distance and the Kullback–Leibler divergence. Approaches of this type, generally called strategies for regularization of ill-posed problems are more generally described in [29], but have been used in various forms in many works on traffic matrix inference. The method works because the measurements leave the problem underconstrained, thereby allowing many possible traffic matrices that fit the measurements, but the model allows us to choose one of these as best. Furthermore, through the method allows us to tradeoff our belief about the accuracy of the model against the expected errors in the measurements. We can utilize TM structure to interpolate missing values by solving a similar optimization problem argmin kA .x/ Mk22 C 2 d.x; mg /; x (5.21) where A .x/ D M expresses the available measurements as a function of the traffic matrix (whether these be link measurements or direct measurements of a subset of the TM elements we do not care), and mg is the gravity model. This regularizes our model with respect to the measurements that are considered valid. Note that the gravity model in this approach will be skewed by missing elements, so this approach is only suitable for interpolation of a few elements of the traffic matrix. If larger numbers of elements are missing, we can use more complicated techniques such as those proposed in [52] to interpolate the missing data. 5.5 Optimal Network Plans Once we have obtained predictions of the traffic on our network we can commence the actual process of making decisions about where links and routers will be placed, their capacities, and the routing policies that will be used. In this section we discuss how we may optimize these quantities against a set of goals and constraints. 160 M. Roughan The first problem we consider concerns capacity planning. If this component of our network planning worked as well as desired, we could stop there. However, errors in predictions, coupled with the long planning horizon for making changes to a network mean that we need also to consider a short-term way of correcting such problems. The solution is typically called traffic engineering or simply load balancing, and is considered in Section 5.5.2. 5.5.1 Network Capacity Planning There are many good optimization packages available today. Commercial tools such as CPLEX are designed specifically for solving optimization problems, while more general purpose tools such as Matlab often include optimization toolkits that can be used for such problems. Even Excel includes some quite sophisticated optimization tools, and so we shall not consider optimization algorithms in detail here. Instead we will formulate the problem, and provide insight into the practical issues. There are three main components to any optimization problem: the variables, the objective, and the constraints. The variables here are obviously the locations of links, and their capacities. The objective function – the function which we aim to minimize – varies depending on business objectives. For instance, it is common to minimize the cost of a network (either its capital or ongoing cost), or packet delays (or some other network performance metric). The many possible objectives in network design result in different problem formulations, but we concentrate here on the most common objective of cost minimization. The cost of a network is a complex function of the number and type of routers used, and the capacities of the links. It is common, however, to break up the problem hierarchically into inter-PoP, and intra-PoP design, and we consider the two separately here. The constraints in the problem fall into several categories: 1. Capacity constraints require that we have “sufficient” link capacity. These are the key constraints for this problem so we consider these in more detail below. 2. Other technological constraints, such as limited port numbers per router. 3. Constraints arising as a result of the difficulties in multiobjective optimization. For example, we may wish to have a network with good performance and low cost. However, multiobjective optimization is difficult, so instead we minimize cost subjected to a constraint on network performance. 4. Reliability constraints require that the network functions even under network failures. This issue is so important that other chapters of this book have been devoted to this issue, but we shall consider some aspects of this problem here as well. 5 Robust Network Planning 161 5.5.1.1 Capacity Constraints and Safe-Operating Points Unsurprisingly, the primary constraints in capacity planning are the capacity constraints. We must have a network with sufficient capacity to carry the offered traffic. The key issue is our definition of “sufficient.” There are several factors that go into this decision: 1. Traffic is not constant over the day, so we must design our network to carry loads at all times of day. Often this is encapsulated in “busy hour” traffic measurements, but busy hours may vary across a large network, and between customers, and so it is better to design for the complete cycle. 2. Traffic has observable fluctuations around its average behavior. Capacity planning can explicitly allow for these variations. 3. Traffic also has unobservable fluctuations on shorter times than our measurement interval. Capacity planning must attempt to allow for these variations. 4. There will be measurement and prediction errors in any set of inputs. Ideally, we would use queueing models to derive an exact relationship between measured traffic loads, variations, and so determine the required capacities. However, despite many recent advances in data traffic modeling, we are yet to agree on sufficiently precise and general queueing models to determine sufficient capacity from numerical formulae. There is no “Erlang-B” formulae for data networks. As a result, most network operators use some kind of engineering rule of thumb, which comes down to an “over-engineering factor” to allow for the above sources of variability. We adopt the same approach here, but the term “over-engineering factor” is misleading. The factor allows for known variations in the traffic. The network is not over-engineered, it only appears so if capacity is directly compared to the available but flawed measurements. In fact, if we follow a well-founded process, the network can be quite precisely engineered.6 We therefore prefer to use the term Safe Operating Point (SOP). A SOP is defined statistically with respect to the available traffic measurements on a network. For instance, with 5-min SNMP traffic measurements, we might define our SOP by requiring that the load on the links (as measured by 5-min averages) should not exceed 80% of link capacity more than five times per month. The predicted traffic model could then be used to derive how much capacity is needed to achieve this bound. Traffic variance depends on the application profile and the scale of aggregation. Moreover, the desired trade-off between cost and performance is a business choice for network operators. So there is no single SOP that will satisfy all operators. Given the lack of precision in current queueing models and measurements, the SOP needs to be determined by each network operator experimentally, preferably starting from conservative estimates. Natural variations in network conditions often allow enough 6 It is a common complaint that backbone networks are underutilized. This complaint typically ignores the issues described above. In reality, many of these networks may be quite precisely engineered, but crude average utilization numbers are used to defer required capacity increases. 162 M. Roughan scope to see the impact of variable levels of traffic, and from these determine more accurate SOP specifications, but to do this we need to couple traffic and performance measurements (a topic we consider later). A secondary set of capacity constraints arises because there is a finite set of available link types, and capacity must be bought in multiples of these links. For instance, many high-speed networks use either SONET/SDH links (typically giving 155 Mbps times powers of 4) and/or Ethernet link capacities (powers of 10 from 10 Mbps to 10 Gbps). We will denote the set of available link capacities (including zero) by C . Finally, most high-speed link technologies are duplex, and so we need to allocate capacity in each direction, but we typically do so symmetrically (i.e., a link has the same capacity from i ! j as from j ! i even when the traffic loads in each direction are different). 5.5.1.2 Intra-PoP Design We divide the network design or capacity planning problem into two components and first consider the design of the network inside a PoP. Typically this involves designing a tree-like network to aggregate traffic up to regional hubs, which then transit the traffic onto a backbone.7 The exact design of a PoP is considered in more detail in Chapter 4, but note that in each of the cases considered there we end up with a very similar optimization problems at this level. There are two prime considerations in such planning. Firstly, it is typical that the majority of traffic is nonlocal, i.e., that it will transit to or from the backbone. Local traffic between routers within the PoP in the Internet is often less than 1% of the total. There are exceptions to this rule, but these must be dealt with on an individual basis. Secondly, limitations on the number of ports on most high-speed routers mean that we need at least one layer of aggregation routers to bring traffic onto the backbone: for instance, see Fig. 5.5. For clarity, we show a very simple design (see Chapter 4 for more examples). In our example, Backbone Routers (BRs) to backbone BR Fig. 5.5 A typical PoP design. Aggregation Routers (AR) are used to increase the port density in the PoP and bring traffic up to the Backbone Routers (BR) AR AR BR AR customers 7 In small PoPs, a single router (or redundant pair) may be sufficient for all needs. Little planning is needed in this case beyond selecting the model of router, and so we do not include this simple case in the following discussions. 5 Robust Network Planning 163 and the corresponding links to Aggregation Routers (ARs) are assigned in pairs in order to provide redundancy, but otherwise the topology is a simple tree. There are many variations on this design, for instance, additional BRs may be needed, or multiple layers. However, in our simple model, the design is determined primarily by the limitations on port density. The routers lie within a single PoP, so links are short and their cost has no distance dependence (and they are relatively cheap compared to wide-area links). The number of ARs that can be accommodated depends on the number of ports that can be supported by the BRs, so we shall assume that ARs have a single high-capacity uplink to each BR to allow for a maximum expansion factor in a one-level tree. As a result, the job of planning a PoP is primarily one of deciding how many ARs are needed. As noted earlier we do not need a TM for this task. The routing in such a network is predetermined, and so current port allocations and the uplink load history are sufficiently invariant for this planning task. We use these to form predictions of future uplink requirements and the loads on each router. When predictions show that a router is reaching capacity (either in terms of uplink capacity, traffic volume, or port usage) we can install additional routers based on our predictions over the planning horizon for router installation. There is an additional improvement we can make in this type of problem. It is rare for customers to use the entire capacity of their link to our network, and so the uplink capacity between AR and BR in our network need not be the sum of the customers’ link capacities. We can take advantage of this fact through simple measurementbased planning, but with the additional detail that we may allocate customers with different traffic patterns to routers in such a way as to leverage different peak hours and traffic asymmetries (between input and output traffic), so as to further reduce capacity requirements. The problem resembles the bin packing problem. Given a fixed link capacity C for the uplinks between ARs and BRs, and K customers with peak traffic demands fTi gK i D1 , the bin packing problem would be as follows: determine the smallest inteof the customers8 such that ger B, such that we can find a B-partition fSk gB kD1 X Ti C for all k D 1; : : : ; B: (5.22) i 2Sk The number of subsets B gives the number of required ARs, and although the problem is NP-hard, there are reasonable approximation algorithms for its solution [18], some of which are online, i.e., they can be implemented without reorganization of existing allocations. The real problem is more complicated. There are constraints on the number of ports that can be supported by ARs dependent on the model of ARs being 8 A B-partition of our customers is a group of B non-empty subsets Sk f1; 2; : : : ; Kg that are disjoint, i.e., Si \ Sj D for all i ¤ j , and which include all customers, i.e., [BkD1 Sk D f1; 2; : : : ; Kg. 164 M. Roughan deployed, constraints on router capacity, and in addition, we can take advantage of the temporal, and directional characteristics of traffic. Customer demands take the form ŒIi .t/; Oi .t/, where Ii .t/ and Oi .t/ are incoming and outgoing traffic demands for customer i at time t. So the appropriate condition for our problem is to find the minimal number B of ARs such that X X Ii .t/ C and Oi .t/ C for all k; t: (5.23) i 2Sk i 2Sk This is the so-called vector bin packing problem, which has been used to model resource constrained processor scheduling problems, and good approximations have been known for some time [15, 28]. The major advantage of this type of approach is that customers with different peak traffic periods can be combined onto one AR so that their joint traffic is more evenly distributed over each 24-hour period. Likewise, careful distribution of customers whose primary traffic flows into our network (for instance, hosting centers) together with customers whose traffic flows out of the network (e.g., broadband access companies) can lead to more symmetric traffic on the uplinks, and hence better overall utilization. In practice, multiplexing gains may improve the situation, so that less capacity is needed when multiple customers’ traffic is combined, but this effect only plays a dominant role when large numbers (say hundreds) of small customers are being combined. 5.5.1.3 Inter-PoP Backbone Planning The inter-PoP backbone design problem is somewhat more complicated. We start by assuming, we know the locations at which we wish to have PoPs. The question of how to optimize these locations does come up, but it is common that these locations are predetermined by other aspects of business planning. In inter-PoP planning, distance-based costs are important. The cost of a link is usually considered to be proportional to its length, though this is approximate. The real cost of a link has a fixed component (in the equipment used to terminate a line) in addition to distancedependent terms derived from the cost to install a physical line, e.g., costs of cables, excavation and right of ways. Even where leased lines are used (so there are minimal installation costs) the original capital costs of the lines are usually passed on through some type of distance sensitive pricing. In addition, higher speed links generally cost more. The exact model for such costs can vary, but a large component of the bandwidth-dependent costs is in the end equipment (router interface cards, WDM mux/demux equipment, etc.). In actuality-real costs are often very complicated: vendors may have discounts for bulk purchases, whereas cutting-edge technology may come at a premium cost. However, link costs are often approximated as linear with respect to bandwidth because we could, in principle, obtain a link with capacity 4c by combining four links of capacity c. 5 Robust Network Planning 165 In the simple case then, cost per link has the form f .de ; ce / D ˛ C ˇde C ce ; (5.24) where ˛ is the fixed cost of link installation, ˇ is the link cost per unit distance, and is the cost per unit bandwidth. As the distance of a link is typically a fixed property of the link, we often rewrite the above cost in the form fe .ce / D ˛e C ce ; (5.25) where now the cost function depends on the link index e. We further simplify the problem by assuming that BRs are capable of dealing with all traffic demands so that only two (allowing for redundancy) are needed in each PoP, thus removing the costs of the router from the problem. Finally, we simplify our approach by assuming that routes are chosen to follow the shortest possible geographic path in our network. There are reasons (which we shall discuss in the following section) why this might not be the case, however, a priori, it makes sense to use the shortest geographic path. There are costs that arise from distance. Most obviously, if packets traverse longer paths, they will experience longer delays, and this is rarely desirable. In addition, packets that traverse longer paths use more resources. For instance, a packet that traverses two hops rather than one uses up capacity on two links rather than one. As noted earlier, we need to specify the problem constraints, the basic set of which are intended to ensure that there is sufficient capacity in the network. When congestion is avoided, queueing delays will be minimal, and hence delays across the network will be dominated by propagation delays (the speed of light cannot be increased). So ensuring sufficient capacity implicitly serves the purpose of reducing networking delays. As noted, we adopt the approach of specifying an SOP, which we do in the form of a factor 2 .0; 1/, which specifies the traffic limit with respect to capacity. That is, we shall require that the link capacity ce be sufficient that traffic takes up only of the capacity, leaving 1 of the capacity to allow for unexpected variations in the traffic. The possible variables are now the link locations and their capacities. So, given the (vectorized) traffic matrix x, our job is to determine link locations and capacities ce , which implicitly defined the network routes (and hence the routing matrix A), such that we solve X ˛e I.ce > 0/ C ce minimize e2E such that Ax c; ce 2 C; (5.26) where Ax D y, the link loads, c is the vector of links capacities, E is the set of possible links, I.ce > 0/ is an indicator function (which is 1 where we build a link, and 0 otherwise), and C is the set of available link capacities (which includes 0). Implicit in the above formulation is the routing matrix A, which results from the particular choice of links in the network design, so A is in fact a function of the 166 M. Roughan network design. Its construction imposes constraints requiring that all traffic on the network can be routed. The problem can be rewritten in a more explicit form using flow-based constraints, but the above formulation is convenient for explaining the differences and similarities between the range of problems we consider here. There may be additional constraints in the above-mentioned problem resulting from router limitations, or due to network performance requirements. For instance, if we have a maximum throughput on each router, we introduce a set of constraints of the form Bx r, where r are router capacities, and B is similar to a routing matrix in that it maps end-to-end demands to the routers along the chosen path. Port P constraints on a router might be expressed by taking constraints of the form j I.ci;j > 0/ pi , where pi is the port limit on router i . Port constraints are complicated by the many choices of line cards available for high-speed routers, and so have sometimes been ignored, but they are a key limitation in many networks. The issue is sometimes avoided by separation of inter- and intra-PoP design, so that a high port density on BRs is not needed. The other complication is that we should aim to optimize the network for 24 7 operations. We can do so simply by including one set of capacity constraints for each time of day and week, i.e., Axt c. The resulting constraints are in exactly the same form as in (5.26) but their number increases. However, it is common that many of these constraints are redundant, and so can be removed from the optimization (without effect) by a pre-filtering phase. The full optimization problem is a linear integer program, and there are many tools available for solution of such programs. However, it is not uncommon to relax the integer constraints to allow any ce 0. In this case, there is no point in having excess capacity, and so we can replace the link capacity constraint by Ax D c. We then obtain the actual design by rounding up the capacities. This approach reduces the numerical complexity of the problem, but results in a potentially suboptimal design. Note though, that integer programming problems are often NP hard, and consequently solved using heuristics, which likewise can lead to suboptimal designs. Relaxation to a linear program is but one of a suite of techniques that can be used to solve problems in this context, often in combination with other methods. Moreover, it is common, the mathematical community to focus on finding provably optimal designs, but this is not a real issue. In practical network design we know that the input data contains errors, and our cost models are only approximate. Hence, the mathematically optimal solution may not have the lowest cost of all realizable networks. The mathematical program only needs to provide us with a very good network design. The components of real network suffer outages on a regular basis: planned maintenance and accidental fiber cuts are simple examples (for more details see Chapters 3 and 4). The final component of network planning that we discuss here is reliability planning: analyzing the reliability of a network. There are many algorithms aimed at maintaining network connectivity, ranging from simple designs such as rings or meshes, to formal optimization problems including connectivity constraints. Commonly, networks are designed to survive all single link or node outages, though more careful planning would concern all Shared Risk Groups (SRG), i.e., groups of links 5 Robust Network Planning 167 and/or nodes who share fates under common failures. For instance, IP links that use wavelengths on the same fiber will all fail simultaneously if the fiber is cut. However, when a link (or SRG) fails, maintaining connectivity is not the only concern. Rerouted traffic creates new demands on links. If this demand exceeds capacity, then the resulting congestion will negatively impact network performance. Ideally, we would design our network to accommodate such failures, i.e., we would modify our earlier optimization problem (5.26) as follows: minimize X e2E ˛e I.ce > 0/ C ce such that Ax c; and Ai x c; 8i 2 F ; (5.27) where F is the set of all failure scenarios considered likely enough to include, and Ai is the routing matrix under failure scenario i . Naively implemented with D , this approach has the limitation that the capacity constraints under failures can come to dominate the design of the network so that most links will be heavily underutilized under normal conditions. Hence, we allow that the SOPs with respect to normal loads, and failure loads to be different, < < 1, so that the mismatch is somewhat balanced, i.e., under normal conditions links are not completely underutilized, but there is likely to be enough capacity under common failures. For example, we might require that under normal loads, peak utilizations remain at 60%, while under failures, we allow loads of 85%. Additionally, the number of possible failure scenarios can be quite large, and as each introduces constraints, it may not be practical to consider all failures. We may need to focus on the likely failures, or those that are considered to be most potentially damaging. However, it is noteworthy that only constraints that involve rerouting need be considered. In most failures, a large number of links will be unaffected, and hence the constraints corresponding to those links will be redundant, and may be easily removed from the problem. The above formulation presumes that we design our network from scratch, but this is the exception. We typically have to grow our network incrementally. This introduces challenges – for instance, it is easy to envisage a series of incremental steps that are each optimal in themselves, but which result in a highly suboptimal network over time. So it is sometimes better to design an optimal network from scratch, particularly when the network is growing very quickly. In the mean time we can include the existing network through a set of constraints in the form ce le Cce0 , where le is the legacy link capacity on link e, and ce0 is the additional link capacity. The real situation is complicated by some additional issues: (i) typical IP router load balancing is not well suited for multiple parallel links of different capacities so we must choose between increasing capacity through additional links (with capacity equal to the legacy links) or paying to replace the old links with a single higher capacity link; and (ii) the costs for putting additional capacity between two routers may be substantially different from the costs for creating an entirely new link. Some work [40] has considered the problem of evolvability of networks, but without all 168 M. Roughan of the addition complexities of IP network management, so determining long-term solutions for optimal network evolution is still an open problem. 5.5.2 Traffic Engineering In practice, it takes substantial time to build or change a network, despite modern innovations in reconfigurable networks. Typical changes to a link involve physically changing interface cards, wiring, and router configurations. Today these changes are often made manually. They also need to be performed carefully, through a process where the change is documented, carefully considered, acted upon, and then tested. The time to perform these steps can vary wildly between companies, but can easily be 6 months once budget cycles are taken into account. In the mean time we might find that our traffic predictions are in error. The best predictions in the world cannot cope with the convulsive changes that seem to occur on a regular basis in the Internet. For instance, the introduction of peer-to-peer networking both increased traffic volumes dramatically in a very short time frame, and changed the structure of this traffic (peer-to-peer traffic is more symmetric that the previously dominant client–server model). YouTube again reset providers’ expectations for traffic. The result will be a suboptimal network, in some cases leading to congestion. As noted, we cannot simply redesign the network, but we can often alleviate congestion by better balancing loads. This process, called traffic engineering (or just load balancing) allows us to adapt the network on shorter time scales than capacity planning. It is quite possible to manually intervene in a network’s traffic engineering on a daily basis. Even finer time scales are possible in principle if traffic engineering is automated, but this is uncommon at present because there is doubt about the desirability of frequent changes in routing. Each change to routing protocols can require a reconvergence, and can lead to dropped packets. More importantly, if such automation is not very carefully controlled it can become unstable, leading to oscillations and very poor performance. The Traffic Engineering (TE) problem is very similar to the network design problem. The goal or optimization objective is often closely related to that in design. The constraints are usually similar. The major difference is in the planning horizon (typically days to weeks), and as a result the variables over which we have control. The restriction imposed by the planning horizon for TE is that we cannot change the network hardware: the routers and links between them are fixed. However, we can change the way packets are routed through the network, and we can use this to rebalance the traffic across the existing network links. There are two methods of TE that are most commonly talked about. The most often mentioned uses MultiProtocol Label Switching (MPLS) [54], by which we can arbitrarily tunnel traffic across almost any set of paths in our network. Finding a general routing minimizing max-utilization is an instance of the classical multi-commodity flow problem, which can be formulated as a linear program 5 Robust Network Planning 169 [6, Chapter 17], and is hence solvable using commonly available tools. We shall not spend much time on MPLS TE, because there is sufficient literature already (for instance, see [19, 36]). We shall instead concentrate on a simpler, less well known, and yet almost as powerful method for TE. Remember that we earlier argued that shortest-geographic paths made sense for network routing. In fact, shortest-path routing does not need to be based on geographic distances. Most modern Interior Gateway Protocols allow administratively defined distances (for instance, Open Shortest Path First (OSPF) [42] and Intermediate System-Intermediate System (IS-IS) [14]). By tweaking these distances we can improve network performance. By making a link distance smaller, you can make a link more “attractive”, and so route more traffic on this link. Making the distance longer can remove traffic. Configurable link weights can be used, for example, to direct traffic away from expensive (e.g., satellite) links. However, we can formulate the TE problem more systematically. Let us consider a shortest-path protocol with administratively configured link weights (the link distances) we on each link e. We assume that the network is given (i.e., we know its link locations and capacities), and that the variables that we can control are the link weights. Our objective is to minimize the congestion on our network. Several metrics can be used to describe congestion. Network-wide metrics such as that proposed in [25, 26] can have advantages, but we use the common metric of maximum utilization here for its simplicity. In many cases, there are additional “human” constraints on the weights we can use in the above optimization. For instance, we may wish that the resulting weights do not change “too much” from our existing weights. Each change requires reconfiguration of a router, and so reducing the number of changes with respect to the existing routing may be important. Likewise, the existing weights are often chosen not just for the sake of distance, but also to make the network conceptually simpler. For instance, we might choose smaller weights inside a “region” and large weights between regions, where the regions have some administrative (rather than purely geographical) significance. In this case, we may wish to preserve the general features of the routing, while still fine-tuning the routes. We can express these constraints in various ways, but we do so below by setting minimum and maximum values for the weights. Then the optimization problem can be written: choose the weights w, such that we minimize max ye =ce e2E (5.28) such that Ax D y; we wmax and wmin e e ; 8e 2 E where A is the routing matrix generated by shortest-path routing given by link weights we , and the link utilizations are given by ye=ce (the link load divided by its capacity). The wmin and wmax constrain the weights for each link into a e e range determined by existing network policies (perhaps within some bound of the existing weights). Additional constraints might specify the maximum number of weights we are allowed to change, or require that links weights be symmetric, i.e., w.i;j / D w.j;i / . 170 M. Roughan The problem is in general NP-hard, so it is nontrivial to find a solution. Over the years, many heuristic methods [12,20,25,26,37,41,53] have been developed for the solution of this problem. The exciting feature of this approach is that it is very simple. It uses standard IP routing protocols, with no enhancements other than the clever choice of weights. One might believe that the catch was that it cannot achieve the same performance as full MPLS TE. However, the performance of the above shortest-path optimization has been shown on real networks to suffer only by a few percent [59,60], and importantly, it has been shown to be more robust to errors in the input traffic matrices than MPLS optimization [60]. This type of robustness is critical to real implementations. Moreover, the approach can be used to generate a set of weights that work well over the whole day (despite variations in the TM over the day) [60], or that can help alleviate congestion in the event of a link failure [44], a problem that we shall consider in more detail in the following section. 5.6 Robust Planning A common concern in network planning is the consequence of mistakes. Traffic matrices used in our optimizations may contain errors due to measurement artifacts, sampling, inference, or predictions. Furthermore, there may be inconsistencies between our planned network design, and the actual implementation through misconfiguration or last minute changes in constraints. There may be additional inconsistencies introduced through the failure of invariance in TMs used as inputs, for example, caused by congestion alleviation in the new network. Robust planning is the process of acknowledging these flaws, and still designing good networks. The key to robustness is the cyclic approach described in Section 5.1: measure ! predict ! plan ! and then measure again. However, with some thought, this process can be made tighter. We have already seen one example of this through TE, where a short-term alteration in routing is used to counter errors in predicted traffic. In this section we shall also consider some useful additions to our kitbag of robust planning tools. 5.6.1 Verification Measurements One of the most common sources of network problems is misconfiguration. Extreme cases of misconfigurations that cause actual outages are relatively obvious (though still time-consuming to fix). However, misconfigurations can also result in more subtle problems. For instance, a misconfigured link weight can mean that traffic takes unexpected paths, leading to delays or even congestion. One of the key steps to network planning is to ensure that the network we planned is the one we observe. Various approaches have been used for router configura- 5 Robust Network Planning 171 tion validation: these are considered in more detail in Chapter 9. In addition, we recommend that direct measurements of the network routing, link loads, and performance can be made at all times. Routing can be measured through mechanisms such as those discussed in Section 5.2 and in more detail in Chapter 11. When performed from edge node to edge node, we can use such measurements to confirm that traffic is taking the routes we intended it to take in our design. By themselves, routing measurements only confirm the direction of traffic flows. Our second requirement is to measure link traffic to ensure that it remains within the bounds we set in our network design. Unexpected traffic loads can often be dealt with by TE, but only once we realize that there is a problem. Finally, we must always measure performance across our network. In principle, the above measurements are sufficient, i.e., we might anticipate that a link is congested only if traffic exceeds the capacity. However, in reality, the typical SNMP measurements used to measure traffic on links are 5-min averages. Congestion can occur on smaller time scales, leading to brief, but nonnegligible packet losses that may not be observable from traffic measurements alone. We aim to reduce these through choice of SOP, but note that this choice is empirical in itself, and an accurate choice relies on feedback from performance measurements. Moreover, other components of a network have been known to cause performance problems even on a lightly loaded network. For instance, such measurements allowed us to discover and understand delays in routing convergence times [32, 61], and that during these periods bursts of packet loss would occur, from which improvements to Interior Gateway Protocols have been made [27]. The importance of the problem would never have been understood without performance measurements. Such measurements are discussed in more detail in Chapter 10. 5.6.2 Reliability Analysis IP networks and the underlying SONET/WDM strata on which they run are often managed by different divisions of a company, or by completely different companies. In our planning stages, we would typically hope for joint design between these components, but the reality is that the underlying physical/optical networks are often multiuse, with IP as one of several customers (either externally or internally) that use the same infrastructure. It is often hard to prescribe exactly which circuits will carry a logical IP link. Therefore, it is hard in some cases to determine, prior to implementation, exactly what SRG exist. We may insist, in some cases, that links are carried over separate fibers, or even purchase leased lines from separate companies, but even in these cases great care should be taken. For instance, it was only during the Baltimore train tunnel fire (2001) [4] it was discovered that several providers ran fiber through the same tunnel. Our earlier network plan can only accommodate planned network failure scenarios. In robust planning, we must somehow accommodate the SRGs that have arisen in the implementation of our planned network. The first step, obviously, is to 172 M. Roughan determine the SRGs. The required data mapping IP links to physical infrastructure is often stored in multiple databases, but with care it is possible to combine the two to obtain a list of SRGs. Once we have a complete list of failure scenarios we could go through the planning cycle again, but as noted, the time horizon for this process would leave our network vulnerable for some time. The first step therefore is to perform a network reliability analysis. This is a simple process of simulating each failure scenario, and assessing whether the network has sufficient capacity, i.e., whether Ai x c. If this condition is already satisfied, then no action need to be taken. However, where the condition is violated, we must take one of two actions. The most obvious approach to deal with a specific vulnerability is to expedite an increase in capacity. It is often possible to reduce the planning horizon for network changes at an increased cost. Where small changes are needed, this may be viable, but it is clearly not satisfactory to try to build the whole network in this way. The second alternative is to once again use traffic engineering. MPLS provides mechanisms to create failover paths, however, it does not tell you where to route these to ensure that congestion does not occur. Some additional optimization and control is needed. However, we cannot do this after the failure, or recovery will take an unacceptable amount of time. Likewise, it is impractical in today’s networks to change link weights in response failures. However, previous studies have shown that shortest-path link weight optimization can be used to provide a set of weights that will alleviate congestive effects under failures [44], and such techniques have (anecdotally) been used in large networks with success. 5.6.3 Robust Optimization The fundamental issue we deal with is “Given that I have errors in my data, how should I perform optimization?” Not all the news are bad. For instance, once we acknowledge that our data is not perfect, we realize that finding the mathematically optimal solution for our problem is not needed. Instead, heuristic solutions that find a near optimal solution will be just as effective. This chapter is not principally concerned with optimization, and so we will not spend a great deal of time on specific algorithms, but note that once we decide that heuristic solutions will be sufficient, several meta-heuristics such as genetic algorithms and simulated annealing become attractive. They are generally easy to program, and very flexible, and so allow us to use more complex constraints and optimization objective functions than we might otherwise have chosen. For instance, it becomes easy to incorporate the true link costs, and technological constraints on available capacities. The other key aspect to optimization in network planning directly concerns robustness. We know there are errors in our measurements and predictions. We can save much time and effort in planning if we accommodate some notion of these errors in our optimization. A number of techniques for such optimization have been proposed: oblivious routing [8], and Valiant network design [69, 70]. These papers 5 Robust Network Planning 173 present methods to design a network and/or its routing so that it will work well for any arbitrary traffic matrix. However, this is perhaps going too far. In most cases we do have some information about possible traffic whose use is bound to improve our network design. A simple approach is to generate a series of possible traffic matrices by adding random noise to our predicted matrix, i.e., by taking xi D x C ei , for i D 1; 2; : : : ; M . Where sufficient historical data exist, the noise terms ei should be generated in such a way as to model the prediction errors. We can then optimize against the set of TMs, i.e., minimize X e2E ˛e I.ce > 0/ C ce (5.29) such that Axi c; 8i D 1; 2; : : : ; M: Once again this can increase the number of constraints dramatically, particularly in combination with reliability constraints, unless we realize that again many of these constraints will be redundant, and can be pruned by preprocessing. The above approach is somewhat naive. The size of the set of TMs to use is not obvious. Also we lack guidance about the choice we should make for . In principle, we already accommodate variations explicitly in the above optimization and so we might expect D 1. However, as before we need < 1 to accommodate inter-measurement time interval variations in traffic, though the choice should be different than in past problems. Moreover, there may be better robust optimization strategies that can be applied in the future. For instance, robust optimization has been applied to the traffic engineering problem in [65], where the authors introduce the idea of COPE (Common-case Optimization with a Penalty Envelope) where the goal is to find the optimal routing for a predicted TM, and to ensure that the routing will not be “too bad” if there are errors in the prediction. 5.6.4 Sensitivity Analysis Even where we believe that our optimization approach is robust, we must test this hypothesis. We can do so by performing a sensitivity analysis. The standard approach in such an analysis is to vary the inputs and examine the impact on the outputs. We can vary each possible input to detect robustness to errors in this input, though the most obvious to test is sensitivity to variations in the underlying traffic matrix. We can test such sensitivity by considering the link loads under a set of TMs generated, as before, by adding prediction errors, i.e., xi D x C ei , for i D 1; 2; : : : ; M , and then simply calculating the link loads yi D Axi . There is an obvious relationship to robust optimization, in that we should not be testing against the same set of matrices against which we optimized. Moreover, in sensitivity analysis it is common to vary the size of the errors. However, simple linear 174 M. Roughan algebra allows us to reduce the problem to a fixed load component y D Ax and a variable component wi D Aei , which scales linearly with the size of the errors, and which can be used to see the impact of errors in the TM directly. 5.7 Summary “Reliability, reliability, reliability” is the mantra of good network operators. Attaining reliability costs money, but few companies can afford to waste millions of dollars on an inefficient network. This chapter is aimed at demonstrating how we can use robust network planning to attain efficient but reliable networks, despite the imprecision of measurements, uncertainties of predictions, and general vagaries of the Internet. Reliability should mean more than connectivity. Network performance measured in packet delay or loss rates is becoming an important metric for customers deciding between operators. Network design for reliability has to account for possible congestion caused by link failures. In this chapter we consider methods for designing networks where performance is treated as part of reliability. The methodology proposed here is built around a cyclic approach to network design exemplified in Fig. 5.1. The process of measure ! analyze/predict ! control ! validate should not end, but rather, validation measurements are fed back into the process so that we can start again. In this way, we attain some measure of robustness to the potential errors in the process. However, the planning horizon for network design is still quite long (typically several months) and so a combination of techniques such as traffic engineering are used at different time scales to ensure robustness to failures in predicted behavior. It is the combination of this range of techniques that provides a truly robust network design methodology. Acknowledgment This work was informed by the period M. Roughan was employed at AT&T research, and the author owes his thanks to researchers there for many valuable discussions on these topics. M. Roughan would also like to thank the Australian Research Council from whom he receives support, in particular through grant DP0665427. References 1. Python routing toolkit (‘pyrt’). Retrieved from http://ipmon.sprintlabs.com/pyrt/. 2. Ripe NCC: routing information service. Retrieved from http://www.ripe.net/projects/ris/. 3. University of Oregon Route Views Archive Project. Retrieved from www.routeviews.org. 4. CSX train derailment. Nanog mailing list. Retrieved July 18, 2001 from http://www.merit.edu/ mail.archives/nanog/2001-07/msg00351.html. 5. Abilene/Internet2. Retrieved from http://www.internet2.edu/observatory/archive/datacollections.html#netflow. 6. Ahuja, R. K., Magnanti, T. L., & Orlin, J. B. (1993). Network flows: Theory, algorithms, and applications. Upper Saddle River, NJ: Prentice Hall. 5 Robust Network Planning 175 7. Alderson, D., Chang, H., Roughan, M., Uhlig, S., & Willinger, W. (2006). The many facets of Internet topology and traffic. Networks and Heterogeneous Media, 1(4), 569–600. 8. Applegate, D., & Cohen, E. (2003) Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In ACM SIGCOMM (pp. 313–324). Germany: Karlsruhe. 2003. 9. Box, G. E. P., & Draper, N. R. (2007). Response surfaces, mixtures and ridge analysis (2nd ed.). New York: Wiley. 10. Brockwell, P., & Davis, R. (1987). Time series: Theory and methods. New York: Springer. 11. Brutag, J. D. (2000). Aberrant behavior detection and control in time series for network monitoring. In Proceedings of the 14th Systems Administration Conference (LISA 2000), New Orleans, LA, USA, USENIX. 12. Buriol, L. S., Resende, M. G. C., Ribeiro, C. C., & Thorup, M. (2002) A memetic algorithm for OSPF routing. In Proceedings of the 6th INFORMS Telecom (pp. 187–188). 13. Cahn, R. S. (1998). Wide area network design. Los Altos, CA: Morgan Kaufman. 14. Callon, R. (1990). Use of OSI IS-IS for routing in TCP/IP and dual environments. Network Working Group, Request for Comments: 1195. 15. Chekuri, C., & Khanna, S. (2004) On multidimensional packing problems. SIAM Journal of Computing, 33(4), 837–851. 16. Duffield, N., & Lund, C. (2003). Predicting resource usage and estimation accuracy in an IP flow measurement collection infrastructure. In ACM SIGCOMM Internet Measurement Conference, Miami Beach, Florida, October 2003. 17. Duffield, N., Lund, C., & Thorup, M. (2004). Flow sampling under hard resource constraints. SIGMETRICS Performance Evaluation Review, 32(1), 85–96. 18. Coffman, J. E. G., Garey, M. R., & Johnson, D. S. (1997). Approximation algorithms for bin packing: A survey. In D. Hochbaum (Ed.), Approximation algorithms for NP-hard problems. Boston: PWS Publishing. 19. Elwalid, A., Jin, C., Low, S. H., & Widjaja, I. (2001). MATE: MPLS adaptive traffic engineering. In INFOCOM (pp. 1300–1309). 20. Ericsson, M., Resende, M., & Pardalos P. (2002). A genetic algorithm for the weight setting problem in OSPF routing. Journal of Combinatorial Optimization, 6(3), 299–333. 21. Erramilli, V., Crovella, M., & Taft, N. (2006). An independent-connection model for traffic matrices. In ACM SIGCOMM Internet Measurement Conference (IMC06), New York, NY, USA, ACM (pp. 251–256). 22. Feldmann, A., Greenberg, A., Lund, C., Reingold, N., & Rexford, J. (2000). Netscope: Traffic engineering for IP networks. IEEE Network Magazine, 14(2), 11–19. 23. Feldmann, A., Greenberg, A., Lund, C., Reingold, N., Rexford, J., & True, F. (2001). Deriving traffic demands for operational IP networks: Methodology and experience. IEEE/ACM Transactions on Networking, 9, 265–279. 24. Feldmann, A., & Rexford, J. (2001). IP network configuration for intradomain traffic engineering. IEEE Network Magazine, 15(5), 46–57. 25. Fortz, B., & Thorup, M. (2000). Internet traffic engineering by optimizing OSPF weights. In Proceedings of the 19th IEEE Conference on Computer Communications (INFOCOM) (pp. 519–528). 26. Fortz, B., & Thorup, M. (2002). Optimizing OSPF/IS-IS weights in a changing world. IEEE Journal on Selected Areas in Communications, 20(4), 756–767. 27. Francois, P., Filsfils, C., Evans, J., & Bonaventure, O. (2005). Achieving sub-second IGP convergence in large IP networks. SIGCOMM Computer Communication Review, 35(3), 35–44. 28. Garey, M., Graham, R., Johnson, D., & Yao, A. (1976). Resource constrained scheduling as generalized bin packing. Journal of Combinatorial Theory A, 21, 257–298. 29. Hansen, P. C. (1997). Rank-deficient and discrete ill-posed problems: Numerical aspects of linear inversion. Philadelphia, PA: SIAM. 30. Iannaccone, G., Chuah, C.-N., Mortier, R., Bhattacharyya, S., & Diot, C. (2002). Analysis of link failures over an IP backbone. In ACM SIGCOMM Internet Measurement Workshop, Marseilles, France, November 2002. 176 M. Roughan 31. Kowalski, J., & Warfield, B. (1995). Modeling traffic demand between nodes in a telecommunications network. In ATNAC’95. 32. Labovitz, C., Ahuja, A., Bose, A., & Jahanian, F. (2000). Delayed Internet routing convergence. In Proceedings of ACM SIGCOMM. 33. Lakhina, A., Crovella, M., & Diot, C. (2004). Characterization of network-wide anomalies in traffic flows. In ACM SIGCOMM Internet Measurement Conference, Taormina, Sicily, Italy. 34. Lakhina, A., Crovella, M., & Diot, C. (2004). Diagnosing network-wide traffic anomalies. In ACM SIGCOMM. 35. Lakhina, A., Papagiannaki, K., Crovella, M., Diot, C., Kolaczyk, E. D., & Taft, N. (2004). Structural analysis of network traffic flows. In ACM SIGMETRICS/Performance. 36. Lakshman, U., & Lobo, L. (2006). MPLS traffic engineering. Cisco Press. Available from http://www.ciscopress.com/articles/article.asp?p=426640, 2006. 37. Lin, F., & Wang, J. (1993). Minimax open shortest path first routing algorithms in networks supporting the SMDS services. In Proceedings of the IEEE International Conference on Communications (ICC), 2, 666–670. 38. Maltz, D., Xie, G., Zhan, J., Zhang, H., Hjalmtysson, G., & Greenberg, A. (2004). Routing design in operational networks: A look from the inside. In ACM SIGCOMM, Portland, OR, USA. 39. Mauro, D. R., & Schmidt, K. J. (2001) Essential SNMP. Sabastopol, CA: O’Reilly. 40. Maxemchuk, N. F., Ouveysi, I., & Zukerman, M. (2000). A quantitative measure for comparison between topologies of modern telecommunications networks. In IEEE Globecom. 41. Mitra, D., & Ramakrishnan, K. G. (1999). A case study of multiservice, multipriority traffic engineering design for data networks. In Proceedings of the IEEE GLOBECOM (pp. 1077–1083). 42. Moy, J. T. (1998). OSPF version 2. Network Working Group, Request for comments: 2328, April 1998. 43. Norros, I. (1994). A storage model with self-similar input. Queueing Systems, 16, 387–396. 44. Nucci, A., & Papagiannaki, K. (2009) Design, measurement and management of large-scale IP networks. New York: Cambrigde University Press. 45. Odlyzko, A. M. (2003). Internet traffic growth: Sources and implications. In B. B. Dingel, W. Weiershausen, A. K. Dutta, & K.-I. Sato (Eds.), Optical transmission systems and equipment for WDM networking II (Vol. 5247, pp. 1–15). Proceedings of SPIE. 46. Oetiker, T. MRTG: The multi-router traffic grapher. Available from http://oss.oetiker.ch/mrtg//. 47. Oetiker, T. RRDtool. Available from http://oss.oetiker.ch/rrdtool/. 48. Paxson, V. (2004). Strategies for sound Internet measurement. In ACM Sigcomm Internet Measurement Conference (IMC), Taormina, Sicily, Italy. 49. Potts, R. B., & Oliver, R. M. (1972). Flows in transportation networks. New York: Academic Press. 50. Pyhnen, P. (1963). A tentative model for the volume of trade between countries. Weltwirtschaftliches Archive, 90, 93–100. 51. Qiu, L., Yang, Y. R., Zhang, Y., & Shenker, S. (2003). On selfish routing in internet-like environments. In ACM SIGCOMM (pp. 151–162). 52. Qui, L., Zhang, Y., Roughan, M., & Willinger, W. (2009). Spatio-Temporal Compressive Sensing and Internet Traffic Matrices”, Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qui, ACM Sigcomm, pp. 267–278, Barcellona, August 2009. 53. Ramakrishnan, K., & Rodrigues, M. (2001). Optimal routing in shortest-path data networks. Lucent Bell Labs Technical Journal, 6(1), 117–138. 54. Rosen, E. C., Viswanathan, A., & Callon, R. (2001). Multiprotocol label switching architecture. Network Working Group, Request for Comments: 3031, 2001. 55. Roughan, M. (2005). Simplifying the synthesis of Internet traffic matrices. ACM SIGCOMM Computer Communications Review, 35(5), 93–96. 56. Roughan, M., & Gottlieb, J. (2002). Large-scale measurement and modeling of backbone Internet traffic. In SPIE ITCOM, Boston, MA. 57. Roughan, M., Greenberg, A., Kalmanek, C., Rumsewicz, M., Yates, J., & Zhang, Y. (2003). Experience in measuring Internet backbone traffic variability: Models, metrics, measurements and meaning. In Proceedings of the International Teletraffic Congress (ITC-18) (pp. 221–230). 5 Robust Network Planning 177 58. Roughan, M., Sen, S., Spatscheck, O., & Duffield, N. (2004). Class-of-service mapping for QoS: A statistical signature-based approach to IP traffic classification. In ACM SIGCOMM Internet Measurement Workshop (pp. 135–148). Taormina, Sicily, Italy. 59. Roughan, M., Thorup, M., & Zhang, Y. (2003). Performance of estimated traffic matrices in traffic engineering. In ACM SIGMETRICS (pp. 326–327). San Diego, CA. 60. Roughan, M., Thorup, M., & Zhang, Y. (2003). Traffic engineering with estimated traffic matrices. In ACM SIGCOMM Internet Measurement Conference (IMC) (pp. 248–258). Miami Beach, FL. 61. Shaikh, A., & Greenberg, A. (2001). Experience in black-box OSPF measurement. In Proceedings of the ACM SIGCOMM Internet Measurement Workshop (pp. 113–125). 62. Shaikh, A., & Greenberg, A. (2004). OSPF monitoring: Architecture, design and deployment experience. In Proceedings of the USENIX Symposium on Networked System Design and Implementation (NSDI). 63. Tinbergen, J. (1962). Shaping the world economy: Suggestions for an international economic policy. The Twentieth Century Fund. 64. Uhlig, S., Quoitin, B., Balon, S., & Lepropre, J. (2006). Providing public intradomain traffic matrices to the research community. ACM SIGCOMM Computer Communication Review, 36(1), 83–86. 65. Wang, H., Xie, H., Qiu, L., Yang, Y. R., Zhang, Y., & Greenberg, A. (2006). COPE: Traffic engineering in dynamic networks. In ACM SIGCOMM (pp. 99–110). 66. Zhang, Y., Ge, Z., Roughan, M., & Greenberg, A. (2005). Network anomography. In Proceedings of the Internet Measurement Conference (IMC ’05), Berkeley, CA. 67. Zhang, Y., Roughan, M., Duffield, N., & Greenberg, A. (2003). Fast accurate computation of large-scale IP traffic matrices from link loads. In ACM SIGMETRICS (pp. 206–217). San Diego, CA. 68. Zhang, Y., Roughan, M., Lund, C., & Donoho, D. (2003). An information-theoretic approach to traffic matrix estimation. In ACM SIGCOMM (pp. 301–312). Karlsruhe, Germany. 69. Zhang-Shen, R., & McKeown, N. (2004). Designing a predictable Internet backbone. In HotNets III, San Diego, CA, November 2004. 70. Zhang-Shen, R., & McKeown, N. (2005). Designing a predictable Internet backbone with Valiant load-balancing. In Thirteenth International Workshop on Quality of Service (IWQoS), Passau, Germany, June 2005. Part III Interdomain Reliability and Overlay Networks Chapter 6 Interdomain Routing and Reliability Feng Wang and Lixin Gao 6.1 Introduction Routing as the “control plane” of the Internet plays a crucial role on the performance of data plane in the Internet. That is, routing aims to ensure that there are forwarding paths for delivering packets to their intended destinations. Routing protocols are the languages that individual routers speak in order to cooperatively achieve the goal in a distributed manner. The Internet routing architecture is structured in a hierarchical fashion. At the bottom level, an Autonomous System (AS) consists of a network of routers under a single administrative entity. Routing within an AS is achieved via an Interior Gateway Protocol (IGP) such as OSPF or IS-IS. At the top level, an interdomain routing protocol glues thousands of ASes together and plays a crucial role in the delivery of traffic across the global Internet. In this chapter, we provide an overview of the interdomain routing architecture and its reliability in maintaining global reachability. Border Gateway Protocol (BGP) is the current de-facto standard for interdomain routing. As a path vector routing protocol, BGP requires each router to advertise only its best route for a destination to its neighbors. Each route includes attributes such as AS path (the sequence of ASes to traverse to reach the destination), and local preference (indicating the preference order in selecting the best route). Rather than simply selecting the route with the shortest AS path, routers can apply complex routing policies (such as setting a higher local preference value for a route through a particular AS) to influence the best route selection, and to decide whether to propagate the selected route to their neighbors. Although BGP is a simple path vector protocol, configuring BGP routing policies is quite complex. Each AS typically F. Wang School of Engineering and Computational Sciences, Liberty University e-mail: fwang@liberty.edu L. Gao () Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, Amherst, MA01002, USA e-mail: lgao@ecs.umass.edu C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 6, c Springer-Verlag London Limited 2010 181 182 F. Wang and L. Gao configures its routing policy according to its own goals, such as load-balancing traffic among its links, without coordinating with other networks. However, arbitrary policy configurations might lead to route divergence or persistent oscillation of the routing protocol. That is, although BGP allows flexibility in routing policy configuration, BGP itself does not guarantee routing convergence. Arbitrary policy configurations, such as unintentional mistakes or intentional malicious configuration, can lead to persistent route oscillation [9, 11]. Besides being a policy-based routing protocol, BGP has many features that aim to scale a large network such as the global Internet. One feature is that BGP sends incremental updates upon routing changes rather than sending complete routing information. BGP speaking routers send new routes only when there are changes. Related with the incremental update feature, BGP uses a timer, referred to as the Minimum Route Advertisement Interval (MRAI) timer, to determine the minimum amount of time that must elapse between routing updates in order to limit the number of updates for each prefix. Therefore, BGP does not react to changes in topology or routing policy configuration immediately. Rather, it controls the frequency in which route changes can be made in order to avoid overloading router CPU cycles or reduce route flap. While MRAI timers can be effective in reducing routing update frequency, the slow reaction to changes can delay route convergence. More importantly, during the delayed route convergence process, routes among neighboring routers might be inconsistent. This can lead to transient routing loops or transient routing outages (referred to as transient routing failures) caused by the delay in discovering alternate routes. The goal of this chapter is to provide an overview of BGP, to give practical guidelines for configuring BGP routing policy and offer a framework for understanding how undesirable routing states such as persistent routing oscillation and transient routing failures or loops can arise. We also present a methodology for measuring the extent to which these undesirable routing states can affect the quality of end-to-end packet delivery. We will further describe proposed solutions for reliable interdomain routing. Toward this end, we outline this chapter as follows. We begin with an introduction to BGP in Section 6.2. We first describe interdomain routing architecture, and then illustrate the details of how BGP enables ASes to exchange global reachability information and various BGP route attributes. We further present routing policy configurations that enable each individual AS to meet its goal of traffic engineering or commercial agreement. In Section 6.3, we introduce multihoming technology. Multihoming allows an AS to have multiple connections to upstream providers in order to survive a single point of failure. We present various multihoming approaches, such as multihoming to multip le upstream providers or single upstream provider to show the redundancy and load-balancing benefits associated with being multihomed. In Section 6.4, we highlight the limitations of BGP. For example, the protocol design does not guarantee that routing will converge to a stable route. We further show how incentive compatible routing policies can prevent routing oscillation, and how transient routing failures or loops can occur even under incentive compatible routing configuration or redundant underlying infrastructure. 6 Interdomain Routing and Reliability 183 Having understood the potential transient routing failures and routing loops, we describe a measurement methodology, and measurement results that quantify the impact of transient routing failures and routing loops on end-to-end path performance in Section 6.5. This illustrates the severity that routing outages can affect the quality of packet delivery. In Section 6.6, we present a detailed overview of the existing solutions to achieve reliable interdomain routing. We show that both protocol extensions and routing policies can enhance the reliability of interdomain routing. Finally, we conclude the chapter by pointing out possible future research directions in Section 6.7. 6.2 Interdomain Routing This section introduces the interdomain routing architecture, the interdomain routing protocol, BGP, and BGP routing policy configuration. 6.2.1 Interdomain Routing Architecture The Internet consists of a large collection of hosts interconnected by networks of links and routers. The Internet is divided into thousands of ASes. Examples range from college campuses and corporate networks to global Internet Service Providers (ISPs). An AS has its own routers and routing policies, and connects to other ASes to exchange traffic with remote hosts. A router typically has very detailed knowledge of the topology within its AS, and limited reachability information about other ASes. Figure 6.1 shows an example of the Internet topology, where there are large transit ISPs such as MCI or AT&T, and stub ASes, such as the University of Massachusetts’ network, which does not provide transit service to other ASes. Google.com Sprint AS 15169 AS 1249 Servers Umass.edu MCI AT & T Fig. 6.1 An example topology of interconnection among Internet service providers and stub networks 184 F. Wang and L. Gao Note that the topologies of the transit ISPs and stub ASes shown in this example are much simpler than those in reality. Typically, a large transit ISP consists of hundreds or thousands of routers. ASes interconnect at public Internet exchange points (IXPs) such as MAE-EAST or MAE-WEST, or dedicated point-to-point links. Public exchange points typically consist of a shared medium such as a Gigabit Ethernet, or an ATM switch, that interconnects routers from several different ASes. Physical connectivity at the IXP does not necessarily imply that every pair of ASes exchanges traffic with each other. AS pairs negotiate contractual agreements that control the exchange of traffic. These relationships include provider-to-customer, peer-to-peer, and backup, and are discussed in more detail in Section 6.4.1. Each AS has responsibility for carrying traffic to and from a set of customer IP addresses. The scalability of the Internet routing infrastructure depends on the aggregation of IP addresses in contiguous blocks, called prefixes, each consisting of a 32-bit IP address and a mask length (e.g., 1:2:3:0=24). An IP address is generally shown as four octets of numbers from 0 to 255 represented in decimal form. The mask length is used to indicate the number of significant bits in the IP address. That is, a prefix aggregates all IP addresses that match the IP address in the significant bits. For example, prefix 1:2:3:0=24 represents all addresses between 1:2:3:0 and 1:2:3:255. An AS employs an intradomain routing protocol (IGP) such as OSPF or ISIS to determine how to reach routers and networks within itself, and employs an interdomain routing protocol, i.e., Border Gateway Protocol (BGP) in the current Internet, to advertise the reachability of networks (represented as prefixes) to neighboring ASes. 6.2.2 IGP Each AS uses an intradomain routing protocol or IGP for routing within the AS. There are two classes of IGP: (1) distance vector and (2) link state routing protocol. In distance-vector routing, every routing message propagated by a router to its neighbors contains the length of the shortest path to a destination. In link-state routing, every router learns the entire network topology along with the link costs. Then it computes the shortest path (or the minimum cost path) to each destination. When a network link changes state, a notification, called link state advertisement (LSA), is flooded throughout the network. All routers note the change and recompute their routes accordingly. 6.2.3 BGP The interdomain routing protocol, BGP, is the glue that pieces together the various diverse networks or ASes that comprise the global Internet today. It is used among 6 Interdomain Routing and Reliability 185 ASes to exchange network reachability information. Each AS has one or more border routers that connect to routers in neighboring ASes, and possibly a number of internal BGP speaking routers. BGP is a path-vector routing protocol that facilitates routers to exchange the path used for reaching a destination. By including the path in the route update information, one can avoid loops by eliminating any path that traverses the same node twice. Using a path vector protocol, routers running BGP distribute reachability information about destinations (network prefixes) by sending route updates – containing route announcements or withdrawals – to their neighbors in an incremental manner. BGP constructs paths by successively propagating advertisements between pairs of routers that are configured as BGP peers. Each advertisement concerns a particular prefix and includes the list of ASes along the path (the AS path) to the network containing the prefix. By representing the path to be traversed by the ASes, BGP hides the details of the topology and routing information inside each AS. Before accepting an advertisement, the receiving router checks for the presence of its own AS number in the AS path to discard routes with loops. Upon receiving an advertisement, a BGP speaking router must decide whether or not to use this path and, if the path is chosen, whether or not to propagate the advertisement to neighboring ASes (after adding its own AS number of the AS path). BGP requires that a router simply advertise its best route for each destination to its neighbors. A BGP speaking router withdraws an advertisement when the prefix is no longer reachable with this route, which may lead to a sequence of withdrawals by upstream ASes that are using this path. When there is an event affecting a router’s best route to a destination, that router will compute a new best route and advertise the routing change to its neighbors. If the router no longer has any route to the destination, it will send a withdrawal message to neighbors for that destination. When an event causes a set of routers to lose their current routing information, the routing change will be propagated to other routers. To limit the number of updates that a router has to process within a short time period, a rate-limiting timer, called the Minimum Route Advertisement Interval (MRAI) timer, determines the minimum amount of time that must elapse between routing updates to a neighbor [26]. This has the potential to reduce the number of routing updates, as a single routing change might trigger multiple transient routes during the path exploration or route convergence process before the final stable route is determined. If new routes are selected multiple times while waiting for the expiration of the MRAI timer, the latest selected route shall be advertised at the end of MRAI. To avoid long time loss of connectivity, RFC 4271 [26] specifies that the MRAI timer is applied to only BGP announcements, not to explicit withdrawals. However, some router implementations might apply the MRAI timer to both announcements and withdrawals. BGP sessions can be established between router pairs in the same AS (we refer the BGP session as iBGP session) or different ASes (we refer the BGP session as eBGP session). Figure 6.2 illustrates examples of iBGP and eBGP sessions. Each BGP speaking router originates updates for one or more prefixes, and can send the updates to the immediate neighbors via an iBGP or eBGP session. iBGP sessions 186 F. Wang and L. Gao AS 1 iBGP GP iB GP eB iBGP P iBGP iBG P iBG eBGP P iBGP iBG P iBGP iBG P GP iB iBGP AS 2 iBG P iBG eBGP iBGP P G iB iBGP P iBG GP GP GP iB iB iBGP GP eB eBGP iB iBGP AS 3 Fig. 6.2 Internal BGP (iBGP) versus external BGP (eBGP) are established between routers in the same AS in order for the routers to exchange routes learned from other ASes. In the simplest case, each router has an iBGP session with every other router (i.e., fully meshed iBGP configuration). In the fullymeshed iBGP configuration, a route received from an iBGP router cannot be sent to another iBGP speaking router, since a route via an iBGP peer should be directly received from the iBGP peer. In practice, an AS with hundreds or thousands of routers may need to improve scalability using route reflectors to avoid a fully-meshed iBGP configure. These optimizations are intended to reduce iBGP traffic without affecting the routing decision. Each route reflector and its clients (i.e., iBGP neighbors that are not route reflectors themselves) form a cluster. Figure 6.3 shows an example of route reflector cluster, where cluster 1 contains route reflector RR1 and its three clients. Typically, route reflectors and their clients are located in the same facility, e.g., in the same Point of Presence (PoP). Route reflectors themselves are fully meshed. For example, in Fig. 6.3, the three route reflectors RR1, RR2 and, RR3 are fully meshed. A route reflector selects the best route among the routes learned via clients in the cluster, and sends the best route to all other clients in the cluster except the one from which the best route is learned, as well as to all other route reflectors. Similarly, it also reflects routes learned from other route reflectors to all of its own clients. 6.2.4 Routing Policy and Route Selection Process The simplest routing policy is the shortest AS path routing, where each AS selects a route with the shortest AS path. BGP, however, allows much more flexible routing 6 Interdomain Routing and Reliability 187 Clu client sterclient 1 AS 1 client client Cluster 2 client RR1 RR2 client client RR3 client Cl us ter 3 client client Fig. 6.3 An example of route reflector configuration for scaling iBGP BGP Updates Import Policies Best Route Selection BGP Updates Export Policies Fig. 6.4 Import policies, route selection, and export policies policies than the shortest AS path routing. An AS can favor a path with a longer AS path length by assigning a higher local preference value. BGP also allows an AS to send a hint to a neighbor on the preference that should be given to a route by using the community attribute. BGP also enables an AS to control how traffic enters its network by assigning a different multiple exit discriminator (MED) value to the advertisements it sends on each link to a neighboring AS. Otherwise, the neighboring AS would select the link based on the link cost within its own intradomain routing protocol. An AS can also discourage traffic from entering its network by performing AS prepending, which inflates the length of the AS path by listing an AS number multiple times. Processing an incoming BGP update involves three steps as shown in Fig. 6.4: 1. Import policies that decide which routes to consider 2. Path selection that decides which route to use 3. Export policies to decide whether (and what) to advertise a neighboring AS An AS can apply both implicit and explicit import policies. Every eBGP peering session has an implicit import policy that discards a routing update when the receiving BGP speaker’s AS already appears in the AS path; this is essential to avoid 188 Table 6.1 Steps in the BGP path selection process F. Wang and L. Gao 1. 2. 3. 4. 5. 6. Highest local preference Shortest AS path Lowest origin type Smallest MED Smallest IGP path cost to egress router Smallest next-hop router id introducing a cycle in the AS path. The explicit import policy includes denying or permitting an update, and assigning a local-preference value. For example, an explicit import policy could assign local preference to be 100 if a particular AS appears in the AS path or deny any update that includes AS 2 in the path. After applying the import policies for a route update from an eBGP session, each BGP speaking router then follows a route selection process that picks the best route for each prefix, which is shown in Table 6.1. The BGP speaking router picks the route with the highest local preference, breaking ties by selecting the route with the shortest AS path. Note that local preference overrides the AS-path length. Among the remaining routes, the BGP speaking router picks the one with the smallest MED, breaking ties by selecting the route with the smallest cost to the BGP speaking router that passes the route via an iBGP session. Note that, since the tiebreaking process draws on intradomain cost information, two BGP speaking routers in the same AS may select different best routes for the same prefix. If a tie still exists, the BGP speaking router picks the route with the smallest next hop router ID. Each BGP speaking router sends only its best route (one best route for each prefix) via BGP sessions, including eBGP and iBGP sessions. The BGP speaking router applies implicit and explicit export policies on each eBGP session to a neighboring BGP speaker. Each BGP speaking router applies an implicit policy that sets MED to default values, assigns next hop to interface that connects the BGP session, and prepends the AS number of the BGP speaking router to the AS path. Explicit export policies include permitting or denying the route, assigning MED, assigning community set, and prepending the AS number one or more times to the AS path. For example, an AS could prepend its AS number several times to the AS path for a prefix. Although the BGP route selection process aims to select routes based mostly on BGP attributes, it is not totally independent from IGP. In fact, IGP cost can influence route selection when the best path is based on the comparison of the IGP cost to the egress routers. We refer to this tie-break BGP route selection as hotpotato routing, since with all other BGP attributes being equal, each AS selects the route with the shortest path to exit its network. For example, in Fig. 6.5, AS 3 learns BGP routes to destination, originated by AS 0 at egress routers C1 and C2 from AS 1 and AS 2, respectively. The value on each link within AS 3 represents the corresponding IGP cost. Suppose that the two learned routes to the destination have identical local preferences. We see that the AS path lengths of the two routes are equal. Router C3 learned two routes from C1 and C2, respectively, and selects the one learned from C1 as the best route because the IGP cost of path (C3 C1) is smaller 6 Interdomain Routing and Reliability 189 Fig. 6.5 An example illustrating hot-potato routing at AS 3. The value around a link represents an IGP weight AS 3 C3 C4 8 6 9 14 C2 C1 AS 2 AS 1 AS 0 1.1.1.1 Set local pref 100 12.1.1.0/24 2.2.2.1 Set local pref 90 12.1.1.0/24 1.1.1.2 RTA 4.4.4.1 2.2.2.2 4.4.4.2 RTB Fig. 6.6 Local preference configuration than that of path (C3 C2). Similarly, router C4 will select the route learned from C2 as the best route because the path has smaller IGP cost than path (C4 C2). However, hot-potato routing means that changing IGP weight can cause BGP speaking routers to select a different best rout and therefore, shift egress routers. For instance, by changing the IGP link cost between router C1 and C3 from 8 to 10, router C3 will change its egress router from C1 to C2. BGP routing policy configuration is typically indicated by a router configuration file. A BGP routing policy can be assigned based on the destination prefix or the next hop AS. For example, in Fig. 6.6, AS 0 advertises a prefix “10.1.1.0/24” to the Internet. AS 3 connects to AS 1 and AS 2, and will get routing updates about the destination “10.1.1.0/24” from the two ASes. AS 3 decides what path its outbound 190 F. Wang and L. Gao traffic to the destination is going to take. Suppose that AS 3 prefers to use the connection via AS 1 to reach the destination. As shown in the following configuration based on Cisco IOS commands, Router RTA at AS 3 sets an explicit import policy that assigns a local preference value 100 to the route from AS 1: router bgp 3 neighbor 1.1.1.1 remote-as 1 neighbor 1.1.1.1 route-map AS1-IN in neighbor 4.4.4.2 remote-as 3 access-list 1 permit 0.0.0.0 255.255.255.255 route-map AS1-IN permit match ip address 1 set local-preference 100 We describe the commands in the above configuration as follows. The first command starts a BGP process with an AS number of 3 at router RTA. The second command sets up an eBGP session with router at AS 1. The route-map command associated with the neighbor statement applies route map AS1-IN to inbound updates from AS 1. Just like the first neighbor command, the fourth command sets up an iBGP session with router RTB. The access-list command creates an access list named 1 to permit all advertisements. The route-map command creates a route map named AS1-IN that uses the access list 1 to identify routes to be assigned local preference of 100. 6.2.5 Convergence Process of BGP In this section, we illustrate how BGP routing processes converge to stable routes. Figure 6.7 shows an example of a routing policy configuration of a simple topology. In this chapter, we simplify the representation of the network using graph theoretical notations of nodes and edges, where a node represents either an AS or a BGP speaking router, and an edge represents the link between two nodes. In this example, we use a node to represent an AS. Furthermore, throughout this chapter, we focus on one destination prefix, d , which is always originated from AS 0. The figure indicates the export policy by showing all AS paths that an AS can receive from the adjacent Fig. 6.7 An example of policy configuration that converges. The paths around a node represents its permissible AS paths and the paths are ordered in the descending order of preference 2 230 20 0 10 120 1 3 310 30 6 Interdomain Routing and Reliability 191 router on the associated interface (referred to as permissible AS paths). The figure also indicates the import policy by ordering the paths in the descending order of local preference. The BGP routing process converges as follows. 1. Destination prefix d is announced to ASes 1, 2, 3 via direct links. 2. ASes 1, 2, and 3 all choose its direct path as their best route since those are the only route they received, and announce these direct paths to neighbors. 3. AS 1 now has two paths, (1 0) and (1 2 0), since these are only permissible paths. AS 2 now has two paths, (2 0) and (2 3 0). AS3 now has two paths, (3 0) and (3 1 0). According to the local preference of each AS, AS 1 ends up choosing (1 0) as its best route, AS 3 chooses (3 1 0) as its best route, and AS 2 chooses (2 3 0) as its best route. 4. AS 3 announces its best path (3 1 0), and therefore, implicitly withdraws its route announcement of (3 0) from AS 2. Now, with (2 0) as its only path, AS 2 chooses (2 0) as its best path. 5. AS 2 announces its best path to both AS 1 and AS 3. However, such an announcement does not change the route that AS 1 or AS 3 chooses. Therefore, all ASes choose a stable route where no routers need to send new update messages, and hence the BGP process converges. Note that during the convergence process, each AS selects and/or announces its best route in an asynchronous manner that is determined by the expiration of MRAI timers. We simplify the process by assuming that route announcements are performed in “a lock step”. Nevertheless, it can be proved that in this example, no matter what the exact steps of the convergence process are, the stable route reached by each AS is the same. 6.3 Multihoming Technology In this section, we provide an overview of the current multihoming technology, which is widely used to provide redundant connection. Multihoming refers to the technology where an AS connects to the Internet through multiple connections via one or more upstream providers. It is intended to enhance the reliability of the Internet connectivity. When one of the connections fails or is in maintenance, the AS can still connect to the Internet via other connections. Multihoming configuration can be achieved using BGP configuration, static routes, Network Address Translation (NAT), or a combination of the above. In this section, we focus on describing multihoming with BGP configuration. The redundancy provided by multihoming can bring additional complexity to the network configuration. First of all, it is imperative to designate primary and backup connections in such a manner so that when the primary connection fails, it can automatically fall back to the backup connection. Second, it is desirable to distribute traffic across multiple connections. Traffic can be classified into inbound and outbound traffic. Outbound traffic is the traffic originating within the multihomed AS or its customers destined to other ASes; inbound traffic is the traffic destined to the AS or its customers coming from other ASes. 192 F. Wang and L. Gao A multihomed AS can be multihomed to a single provider, or to multiple providers. We will describe how multihoming to a single provider and multiple providers can be configured in the next two Sections 6.3.1 and 6.3.2. 6.3.1 Multihoming to a Single Provider The simplest way for an AS to connect to the Internet is by setting up a single connection with a provider. However, the AS has only one connection to send and receive data. This single-homed configuration cannot be resilient to a single point of failure such as link or router failure or maintenance. To address this issue, the AS can set up multiple connections to the provider. Four types of connections can be established between an AS and its provider. We describe each type of the connections as follows: Multiple Connections Between a Single Customer Router and Single Provider Access Router (SSA) An AS has a single border router connected to its provider’s access router with multiple links. As illustrated in Fig. 6.8a, AS 0 has a single (a) SSA (c) MMA Fig. 6.8 Four types of multihoming connections (b) SMA (d) MMB 6 Interdomain Routing and Reliability 193 border router BoR1, which connects to AS 1’s access router, AR1, via two links. If one of the links fails, the other link can be used. Multiple Connections Between a Single Customer Router and Multiple Provider Access Routers (SMA) An AS has a single router connected to its provider’s multiple access routers. For example, in Fig. 6.8b, BoR1 connects to AS 1 at both AR1 and AR2. This configuration can maintain connectivity with a single point of failure of links or the access routers, but cannot do so with a failures of the customer router. Multiple Connections Between Multiple Customer Routers and Multiple Provider Access Routers (MMA) An AS has multiple routers connected to its provider’s multiple access routers. Note that those multiple access routers at the provider are connected to the same backbone router. For example, in Fig. 6.8c, AS 0 has two routers: BoR1 and BoR2. Each border router connects to an access router (AR) in AS 1. This configurations can maintain connectivity with a single point of failure of access routers or border routers. However, the two access routers connect to the same backbone router, BaR1. A failure at BaR1 can cause both the connections to become unavailable. Multiple Connections Between Multiple Customer Routers and Multiple Provider Backbone Routers (MMB) An AS has multiple connections between its multiple border routers and multiple backbone routers as its provider. This configuration can achieve higher reliability than that of MMA. For example, in Fig. 6.8d, AS 0 has two border routers, BoR1 and BoR2, which are connected to geographically separate backbone routers at AS 1. AS 0’s BoR1 connects to AS 1’s access router AR 1, and they are at the same geographical location, while the border router BoR2 is connected to another backbone router BaR1. A private physical connection connects the customer AS’s border router BoR2 and the backbone router BaR1. This method can maintain connectivity even under a failure of the backbone router. Next, we describe how an AS can control traffic over the primary and backup link. First, we discuss the control of outbound traffic. A multihomed AS can assign different local preference values to the routes learned from its provider to control its outgoing traffic. For example, in Fig. 6.8b, BoR1 will receive two identical routes for each destination prefix. AS 0 can assign higher local preference values to prefer the routes received through one particular connection over other routes for the same destination received through the other connection. Multihomed configurations of SSA, MMA or MMB can apply the same method to control outbound traffic over the primary link. In addition, an AS multihomed to a single provider with SSA, can use another method – setting the next hop to a virtual address to control outbound traffic. For example, in Fig. 6.8a, AR1 can be assigned a virtual address – a loopback interface. BoR1 will set up a connection with the loopback address. As a result, all routes that BoR1 receives from AR1 will have the same next hop 20.10.10.1. Since next hop 20.10.10.1 can be reached via two connections, outbound traffic can be distributed over the two links. 194 F. Wang and L. Gao Second, we discuss how an AS multihomed to a single provider can control its inbound traffic. In this case, the multihomed AS can tweak the BGP attribute values, such as AS path length or MED, to influence route selection at the providers’ router. For example, an AS can prepend its AS number on the AS path of the route update announced via the backup link, or send the route update via the backup link with a higher MED value than that via the primary link. As a result, the primary link is used in normal situations since it has a shorter AS path or lower MED value. When the primary link is down, the backup link will be used. 6.3.2 Multihoming to Multiple Providers The availability of the Internet connectivity provided by upstream providers is very important for an AS. Multihoming to more than one provider can ensure that the AS maintains the global Internet connectivity even if the connection to one of its providers fails [1]. For example, in Fig. 6.9. AS 0 is multihomed to two upstream providers: AS 1 and AS 2. AS 0 may use one of its providers as its primary provider, and the other as a backup provider. When connectivity through the primary provider fails, AS 0 still has its connectivity to the Internet through the backup provider. A multihomed AS can be configured to direct its outbound traffic through the primary provider. Only when the connection through the primary provider fails, its outbound traffic can use the connection through the backup provider. To achieve this goal, a multihomed AS can use the same approach described for the AS multihomed to a single provider. That is, an AS may assign a higher local preference for the route through the primary provider than that through the backup. For its outbound Fig. 6.9 An example of an AS multihomed to two upstream providers 6 Interdomain Routing and Reliability 195 traffic, an AS multihomed to multiple providers can use the same approach as those described for an AS multihomed to a single provider. A multihomed AS might control which provider its inbound traffic can use. There are several approaches to control the route used for inbound traffic. The simplest approach is to advertise its prefixes only to the primary provider so that inbound traffic can use the primary provider. For example, in Fig. 6.9, AS 0 can advertise its prefix to its primary provider, say, AS 1. However, such selective advertisement cannot provide the redundancy afforded by multihoming. In the above example, if the link between AS 1 and AS 0 fails, AS 0 becomes unreachable until AS 0 notices the failure and advertises its prefixes to the backup provider, AS 2. In this case, the time it takes to fail over to the backup provider depends on how fast the multihomed AS detects the failure and determines to announce its profixes to the backup provider, and how fast the announcement propagates to the global Internet. Alternatively, an AS can control the route taken by the inbound traffic by splitting its prefix into several specific prefixes, and advertise the more specific prefixes to the primary providers. For example, in Fig. 6.10, AS 0 has a prefix, “12.0.0.0/19”. AS 0 splits the prefix into two more specific prefixes: “12.0.0.0/20” and “12.0.16.0/20”. AS 0 can announce “12.0.0.0/20” to AS 1, and “12.0.16.0/20” to AS 2. At the same time, AS 0 can advertise its prefix, “12.0.0.0/19” to both providers. As a result, inbound traffic to “12.0.0.0/20” comes from AS 1, while inbound traffic to “12.0.16.0/20” comes from AS 2. This approach can balance the traffic load between the two providers by designating each one as the primary provider for a specific prefix. At the same time, the approach can tolerant failure of links to providers. For example, if the link between AS 0 and AS 1 fails, destinations within prefix “12.0.0.0/20” can still be reached via AS 2 since prefix “12.0.0.0/19” is announced via AS 2. Despite the advantage of load balancing and fault tolerance, this approach has the drawback of potentially increasing the number of prefixes announced to the global Internet. Fig. 6.10 An example of splitting prefixes 196 F. Wang and L. Gao Another approach to control the route of inbound traffic is via AS prepend. An AS can prepend its AS number, one or several times when announcing to the backup provider. This can “discourage” other AS to select the route via the backup provider. Note that this approach cannot ensure that all inbound traffic will go through the primary provider. It is possible for an AS to use the longer backup path rather than the shorter primary path if the backup path has a higher local preference. In fact, most providers prefer customers over providers. Consider the example network in Fig. 6.9, AS 2 learns paths to reach prefixes in AS 0 from both the direct and its upstream connections, but AS 2 will prefer the direct connection, although AS 0 intends it to be a backup path. In summary, multihoming techniques aim to provide redundant connectivity. Nevertheless, the extent that these multihoming techniques can ensure continuous connectivity is hinged on how long it takes for the routing protocol, BGP, to failover to backup routes. In Section 6.4.2, we will discuss how BGP can recover from a failure and how long it takes BGP to discover alternate routes. 6.4 Challenges in Interdomain Routing Failures and changes in topology or routing policy are fairly common in the Internet due to various causes such as maintenance, router crash, fiber cuts, and misconfiguration [4, 17, 18]. Ideally, when such changes occur, routing protocols should be able to quickly react to those failures to find alternate paths. However, BGP is a policy-based routing protocol, and is not guaranteed to converge to a stable state, in which all routers agree on a stable set of routes. Persistent route oscillation can significantly degrade the end-to-end performance of the Internet. Furthermore, even if BGP converges, it has been known to be slow to react and recover from network changes. During routing convergence, there are three potential routing states from the perspective of any given router: path exploration during which an alternate route instead of the final stable route is used, transient failures during which there is no route to a destination but a route will be eventually discovered, and transient forwarding loops in which routes to a destination form a forwarding loop and the forwarding loop will eventually disappear. Path exploration does not lead to packet drops, while transient failures or transient loops do. In this chapter, we describe how persistent route oscillation, routing failures, and routing loops can occur. 6.4.1 Persistent Route Oscillation BGP routing protocol provides great flexibility in routing policies that can be set by each AS. However, arbitrary setting of routing policies can lead to persistent route oscillation. For example, Fig. 6.11 shows the “bad gadget” example used in [9]. In this example and all of the following examples, we focus on a single destination 6 Interdomain Routing and Reliability Fig. 6.11 An example of BGP routing policy that leads to persistent route oscillation. The AS paths around a node represent a set of permissible paths, which are ordered in the descending order of local preference 197 2 230 20 0 120 10 1 3 310 30 prefix that originates from AS 0, without losing generality. In this example, ASes 1, 2, and 3 receive only the direct path to AS 0 and indirect path via their clockwise neighbor, and prefer to route via their clockwise neighbor over the direct path to AS 0. For example, AS 2 receives only paths (2 1 0) and (2 0) and prefers route (2 1 0) over route (2 0). This routing policy configuration will lead to persistent route oscillation. In fact, it can be proved that no matter what route an AS chooses initially [9], it will keep changing its route and never reach a stable route. For example, the following sequence of route changes shows how a persistent route oscillation can occur. 1. Initially, ASes 1, 2, and 3 choose paths (1 2 0), (2 0), and (3 0), respectively. 2. After AS 2 receives path (3 0) from AS 3, it changes from its current path (2 0) to the higher preference path (2 3 0), which in turn forces AS 1 to change its path from (1 2 0) to (1 0) because path (1 2 0) is no longer available. 3. When AS 3 notices that AS 1 uses path (1 0), it changes its path (3 0) to (3 1 0). This in turn forces AS 2 to change its path to (2 0). 4. After AS 2 sends path (2 0) to AS 1, AS 1 changes its path (1 0) to (1 2 0), which in turn forces AS 3 to change its path (3 1 0) to (3 0), and the oscillation begins again. In practice, however, routing policies are typically set according to commercial contractual agreements between ASes. Typically, there are two types of AS relationship: provider-to-customer and peer-to-peer. In the first case, a customer pays the provider to be connected to the Internet. In the second case, two ASes agree to exchange traffic on behalf of their respective customers free of charge. Note that contractual agreement between peering ASes typically requires that traffic via both directions of the peering link has to be within a ratio negotiated between peering ASes. In addition to these two common types of relationship, an AS may have a backup relationship with a neighboring AS. Having a backup relationship with a neighbor is important when an AS has limited connectivity to the rest of the Internet. For example, two ASes could establish a bilateral backup agreement for providing the connection to the Internet in the case that one AS’ link to its primary provider fails. Typically, provider-to-customer relationships among ASes are hierarchical. The hierarchical structure arises because an AS typically selects a provider with a network of larger size and scope than its own. An AS serving a metropolitan area is likely to have a regional provider, and a regional AS is likely to have a national provider as its provider. It is very unlikely that a nationwide AS would be a customer of a metropolitan-area AS. 198 F. Wang and L. Gao It is common for an AS to adopt an import routing policy, referred to as prefer customer routing policy, where routes received from an AS’ customers are always preferred over those received from its peers or providers. Such a partial order on the set of routes is compatible with economic incentives. Each AS has economic incentives to prefer routes via a customer link to those via peer or provider links, since it does not have to pay for the traffic via customer links. On the other hand, the AS has to pay for traffic via provider links, and traffic sent to its peer has to be “balanced out” with traffic from its peer. It is also common for an AS to adopt an export routing policy, referred to as no-valley routing policy, where an AS does not announce a route from a provider or peer to another provider or peer. For example, in Fig. 6.12, and the following examples, an arrowed line between two nodes represents a provider-to-customer relationship, with the arrow ending at the customer. A dashed line represents a peer-to-peer relationship. We visualize a sequence of customer-to-provider links as an uphill path, for example, path (1 3 5) is an uphill path. We define a sequence of provider-to-customer links as a down hill path, for example, path (5 4 1) is a down hill path. A peer-to-peer link is defined as a horizontal path. The no-valley routing policy ensures that no path contains a valley where a downhill path is followed by either a peer-to-peer link or uphill path, or a peer-to-peer link is follower by an uphill path or a peer-to-peer link. That is, an AS path may take one of the following forms: (1) an uphill path followed by one or no peer-to-peer link, (2) a downhill path, (3) a peer-to-peer link followed by a downhill path, (4) an uphill path followed by a downhill path, or (5) a uphill path followed by a peering link, followed by a downhill path. For example, in Fig. 6.12, paths (3 5 4) and (1 3 5 6 4 2) are no-valley paths while AS paths (3 1 4) and (3 1 2 6) are not no-valley paths. ASes adopt these rules since there is no economic incentive for an AS to transit traffic between its providers and peers. Note that we name it no-valley routing policy since such an export policy ensures that no route traverses a provider-to-customer link and then a customer-to-provider link, or a provider-to-customer link and then a AS 5 AS 3 AS 6 AS 4 Provider-to-customer Peer-to-peer AS 1 AS 2 Fig. 6.12 Paths (3 5 4) and (1 3 5 6 4 2) are no-valley paths while AS paths (3 1 4) and (3 1 2 6) are not no-valley paths 6 Interdomain Routing and Reliability 199 peer-to-peer link, or a peer-to-peer link and then another peer-to-peer link, or peerto-peer link and then customer-to-provider link, all of which are valley paths if there is a hierarchical structure in provider-to-customer relationships. It has been proved that under the hierarchical provider-to-customer relationships, these common routing policies can indeed ensure route convergence [8]. Furthermore, these policies ensure route convergence under router or link failures, and changes in routing policy. Note that each AS can configure its routers with the prefer customer routing policy without knowing the policies applied in other ASes. Therefore, each AS has an economic incentive to follow the preferred customer routing policy. In addition, it is practical to implement the policy since ASes can set their routing policies without coordinating with other ASes. In addition to local preference setting, it has been observed that certain iBGP configuration may result in persistent route oscillation [2, 10]. Figure 6.13 shows an example of route reflector and policy configuration that can lead to persistent route oscillation. AS 1 consists of two route reflectors, A and B. A has two clients, C1 and C2, while B has one client, C3. The IGP cost of the link between two nodes is indicated beside the link, and the MED value of the routes is indicated in parentheses. It can be proved that no matter what the initial route is for each router, it is not possible for the routers to reach a stable route. As an example, we show below a possible sequence of route changes that lead to persistent oscillation. 1. Route reflector A selects path p2 and route reflector B selects path p3 . 2. Route reflector A receives p3 and selects p1 because p3 has a lower MED than p2 and p1 has lower IGP metric than p3 . 3. Route reflector B receives p1 and selects p1 as the best path (due to a lower IGP cost) and withdraws p3 . 4. Route reflector A selects p2 over p1 (due to a lower IGP cost) and withdraws p1 . 5. Route reflector B selects p3 over p2 (due to lower MED). Now both A and B return back to their initial routes. Fig. 6.13 An example route reflector configuration that leads to persistent oscillation 200 F. Wang and L. Gao One of the reasons that this route reflector configuration can lead to persistent route oscillation is that MED is compared only among links in the same AS. It is possible to enforce a rule that MED is always compared even when they come from links to different ASes. Other guidelines have also been proposed to prevent route reflector configuration from persistent oscillation. These guidelines include exploiting the hierarchical structure of route reflector configuration [10] similar to that proposed in [8]. That is, if a route reflector configuration ensures that a route reflector chooses a route from its client over that from another route reflector (e.g. with IGP cost setting), then it can ensure route convergence. 6.4.2 Transient Routing Failures Even when BGP eventually converges to a set of stable routes, network failures, maintenance events, and router configuration changes can cause BGP to reconverge. Ideally, when such an event occurs, routing protocols should be able to react quickly to those failures to find alternate paths. However, BGP is known to be slow in reacting and recovering from network events. Previous measurement studies have shown that BGP may take tens of minutes to reach a consistent view of the network topology after a failure [17–19]. During the convergence period, a router might contain routing information that lags behind the state of the network. For example, it is possible for a router to eventually discover an alternate path when one of the links in its original path fails. However, during the discovery process, the router might lose all of its paths before an alternate path is discovered. Such a transient loss of reachability is referred to as a transient routing failure. Figure 6.14 shows an example of policy configuration and link failure scenario that can lead to a transient routing failure. In this example, AS 1 and AS 2 are providers of AS 3, AS 0 is a customer of AS 1, and AS 1 is a peer of AS 2. Note that the import and export policies are realistic in the sense that it follows the prefercustomer and no-valley routing policy. When the link between AS 3 and AS 0 fails, AS 3 temporarily loses its connection to the destination AS 0. AS 3 has to send a withdrawal message to cause its neighbor AS 1 to select a new best path. Before AS 3 receives the new path from AS 1, it will experience transient loss of reachability to AS 0. In addition, the timing of sending withdrawal and announcement Fig. 6.14 An example illustrating routing failure at AS 3. The text around a node represents a set of permissible paths and their ordering in local preference (higher preference first) 130 10 1 1230 2 3 0 230 210 2130 30 310 3210 Provider−to−customer Peer−to−peer 6 Interdomain Routing and Reliability Fig. 6.15 Transient routing failures take place in a typical eBGP system. The AS paths around a node represent a set of permissible paths, which are ordered in the descending order of local preference 201 76310 7850 2 7 6 6310 67850 3 4 26310 267850 8 850 876310 5 50 5876310 310 46310 367850 467850 1 10 1367850 0 messages are determined by the expiration of MRAI timers, which can take several seconds to tens of seconds. During this period, all packets destined to AS 0 at AS 3 will be dropped. In a typical AS where the prefer-customer and no-valley routing policies are followed, it is quite likely to have ASes experience transient failures. In fact, when an event causes an AS to change from a customer route to a provider route and all of its providers use it to reach a destination, the AS will definitely experience a transient failure. This is because the AS has to withdraw the customer route first before its provider can discover an alternate path and send the path to it. Please refer to [30] for a proof. Figure 6.15 shows an example to illustrate this point. Suppose that before the link between AS 1 and AS 0 fails, AS 1, AS 3, and AS 6 all have only one path via their customers to reach the destination. When the link failure occurs, the ASes will experience transient failure before they can learn the route via their providers. AS 2 may experience the failure (depending on whether the withdrawal from AS 6 is suppressed the MRAI timer), but AS 7 does not experience any transient routing failure. In previous section, we have shown that multihoming technology can provide redundant underlying connections. Here, we use several examples to discuss whether BGP can fully exploit the redundancy to quickly recover from failures. In fact, BGP fails to take advantage of this redundancy to provide high degree of path diversity. The reason is due to the iBGP configuration. A typical hierarchical iBGP system consists of a core with fully meshed core routers, i.e., route reflectors, and the edge routers which are the clients of the relevant route reflectors. Transient routing failures can occur within a hierarchical iBGP system. Figure 6.16 shows an example that illustrates how routing failures can occur due to iBGP configuration. A multihoming AS AS 0 has two providers: AS 1 and AS 2. AS 1 can reach a destination originated at AS 0 via one of two access routers, AR1 or AR2. According to the prefer-customer routing policy, the path via AR1 is assigned higher local preference value than those via AR2. As a result, all routers inside AS 1 will use the path via AR1 to reach the destination except the access router AR2. Once the link between AR1 and AS 2 fails, all routers except AR2 might experience transient routing failures, before failover to the path via AR2. 202 F. Wang and L. Gao 10 10 120 10 BaR3 10 AR2 BaR2 BaR1 10 AR1 12.1.1.0/24 12.1.1.0/24 12.1.1.0/24 Fig. 6.16 An AS with a hierarchical iBGP configuration can experience transient failures 10 10 BaR3 BaR2 1 0 via AR1 BaR1 10 1000 10 AR1 AR2 12.1.1.0/24 with AS path (0) 12.1.1.0/24 with AS path (0 0 0) BoR1 BoR2 12.1.1.0/24 Fig. 6.17 An AS with multiple connections to a destination prefix can experience transient failures Our second example, shown in Fig. 6.17, is used to show the reliability issue for an AS with multiple connections to a single provider. In this example, AS 0 has two connections to AS 1. Suppose that AS 0 considers the connection via AS 1’s AR1 as the primary link, and the other connection via AR2 as the backup link. Suppose that AS 0 uses AS path prepending to implement this configuration. AS 0’s BoR2 advertises its prefix with AS path (0 0 0). As a result, all routers inside AS 1 except router AR2 have only one single route to reach the destination. If the link between AS 0’s BoR1 and AS 1’s AR1 fails, all routers within AS 1 except AR2 will experience transient failures. 6 Interdomain Routing and Reliability 203 Our third example, shown in Fig. 6.18, is used to show the reliability issue for an AS with multiple geographical connections to a single provider. In this example, we assume that AS 0 considers the connection via AS 1’s AR2 as the primary link, and the connection via AR1 as the backup link. Just like the previous example, suppose that AS 0 uses AS path prepending to implement this configuration. As a result, all routers inside AS 1 except router AR2 has only one single route to reach the destination. If the link between AS 0’s BoR2 and AS 1’s AR2 fails, all routers within AS 1 except AR2 will experience transient failures. Our last example used to show load balancing can avoid transient routing failures. In Fig. 6.19, AS 0 distributes its inbound traffic among the two connections by applying hot-potato routing policy. That is, the backbone routers within AS 1 select the best route according to IGP costs to the egress routers, AR1 and AR2. Fig. 6.18 An AS with geographical connections to a destination prefix can experience transient failures Fig. 6.19 Load balancing configuration can avoid transient failures 204 F. Wang and L. Gao Fig. 6.20 A transient failure experienced by router RT1 when the link between AS 0 and AS 1 is added or recovered As a result, all backbone routers have two different routes to reach the destination. This configuration can avoid single points of failures for backbone routers and link failures between AS 1 and AS 0. So far we have focused on scenarios that lose a route. In fact, when gaining a route, it is still possible to experience transient routing failures. For example, Fig. 6.20 shows a scenario where a router can experience transient routing failure due to iBGP configuration. In this example, AS 1 and AS 2 are providers of AS 0, and AS 1 and AS 2 have peer-to-peer relationship. When the link between AS 1 and AS 0 is added or recovered from a failure, AS 1 prefers direct path to destination AS 0. Before the link is recovered, all routers within AS 1 select the path via AS 2 as their best paths. After the recovery event, all routers within AS 1 use the path through the recovered link. During the route convergence process, router RT3 first selects the direct path to AS 0 and then sends the new route to router RT2 and router RT1. Once router RT2 receives the direct route from router RT3, it selects the route and withdraws its route through AS 2 from router RT1, since it cannot announce its currently selected route via router RT3 to router RT2 (due to the fact that a fully meshed iBGP session cannot reflect a route learned from one peer to another). If router RT1 receives the withdraw message from router RT2 before receiving the announcement message from router RT3, it will experience transient routing failures. 6.4.3 Transient Routing Loops During the route convergence process, it is possible to have not only transient routing failures, but also transient routing loops. A topology or routing policy change can lead the routers to recompute their best routes and update forwarding tables. During this process, the routers can be in an inconsistent forwarding state, causing 6 Interdomain Routing and Reliability 205 Fig. 6.21 An example of transient routing loop between AS 2 and AS 3. The list of AS paths shown beside each node is the set of permissible paths for the node, and the permissible paths are ordered in the descending order of local preference transient routing loops. Measurement studies have shown that the transient loops can last for more than several seconds [13, 29, 31]. Figure 6.21 shows a scenario where a transient routing loop can occur. In this example, when the link between AS 1 and AS 0 fails, AS 2 and AS 3 receive a withdrawal message from AS 1. These two ASes will each select the path via the other to reach the destination because the local preference value of a path via a peer is higher than that of a path via a provider. As a result, there is a routing loop. After AS 2 and AS 3 exchange their new routes, AS 2 will remove the path from AS 3 and select the path from AS 4 as the best path. Finally, all ASes will use the path via AS 4. 6.5 Impact of Transient Routing Failures and Loops on End-to-End Performance In this section, we aim to understand the impact that transient routing failures and loops have on end-to-end path performance. We describe an extensive measurement study that involves both controlled routing updates of a prefix and active probes from a diverse set of end hosts to the prefix. 6.5.1 Controlled Experiments The infrastructure for the controlled experiments is shown in Fig. 6.22. The infrastructure includes a BGP Beacon prefix from the Beacon routing experiment infrastructure [21]. The BGP Beacon is multihomed to two tier-1 providers to which we refer to as ISP1 and ISP 2. We control routing events by injecting well-designed routing updates from BGP Beacon at scheduled times to emulate link failures and recoveries. To understand the impact of routing events on the data plane performance, we select geographic and topologically diverse probing locations from the PlanetLab experiment testbed [25] to conduct active probing while routing changes are in effect. 206 F. Wang and L. Gao Fig. 6.22 Measurement infrastructure Fig. 6.23 Time schedule (GMT) for injecting routing events from BGP beacon Every 2 hours, the BGP Beacon sends a route withdrawal or announcement to one or both providers according to the time schedule shown in Fig. 6.23. Each circle denotes a state, indicating the providers offering transit service to the Beacon. Each arrow represents a routing event and state transition, marked by the time that the routing event (either a route announcement or a route withdrawal) occurs. For example, at midnight Beacon withdraws the route through ISP 1, and at 2:00 a.m., Beacon announces the route through ISP 1. There are 12 routing events every day. Only eight routing events keep the Beacon connected to the Internet; the other four serve the purpose of resetting the Beacon connectivity. These eight beacon events are classified into two categories: failover beacon event and recovery beacon event. In a failover beacon event, the Beacon changes from the state of using both providers to the state of using only a single provider. In a recovery beacon event, the Beacon changes from the state of using a single provider for connectivity to the state of using both providers. These two classes of routing changes emulate the control plane changes that a multihomed site may experience in terms of losing and restoring a link to one or more of its providers. For example, between midnight and 2:00 a.m., 6 Interdomain Routing and Reliability 207 the BGP Beacon is in a state that is only connected to ISP 2; at 2:00 a.m., it announces the Beacon prefix to ISP 1, leading to connectivity to both ISPs. This event emulates a link recovery event. At 4:00 a.m., the Beacon sends a withdrawal to ISP 1 so that the Beacon is in a state that is only connected to ISP2. This event emulates a failover event. A set of geographically diverse sites in the PlanetLab infrastructure probe a host within the Beacon prefix by using three probing methods: UDP packet probing, ping, and traceroute. Probing is performed every hour during injected routing events and when there are no routing events, so as to calibrate the results. At every hour, every probing source sends a UDP packet stream marked by sequence numbers to the BGP Beacon host at 50 ms interval. The probe starts 10 min before each hour and ends 10 min after that hour (i.e., the probing duration is 20 min for each hour). Upon the arrival of each UDP packet, the Beacon host records the timestamp and sequence number of the UDP packet. In addition, ping and traceroute are sent from the probe hosts toward the Beacon host, for measuring round-trip time (RTT) and IP-level path information during the same 20 min time period. Both ping and traceroute are run as soon as the previous ping or traceroute probe completes. Thus, their probing frequency is limited by the round-trip delay and the probe response time from routers. 6.5.2 Overall Packet Loss In this section, we present data plane performance during failover and recovery beacon events. Packet loss and loss burst length are used to measure the impact of routing events on end-to-end path performance. We refer to a series of consecutively lost packets during a routing event as a loss burst. Loss burst length is the maximum number of consecutive lost packets during a routing event. Since several lost bursts can be observed during a routing event, we consider the one with the maximum number of consecutive lost packets, which represents the worst-case scenario during the event. Figure 6.24a shows the number of loss bursts over all probing hosts during failover beacon events for the entire duration of measurement. The x-axis represents the start time of a loss burst, which is measured (in second) relative to the injection of withdrawal messages. We observe that the majority of loss bursts occur right after time 0, i.e., the time when a withdrawal message is advertised. Figure 6.24b shows the number of loss bursts during recovery beacon events across all probe hosts undergoing path changes. We observe that loss bursts occur right after time 0, and can last for 10 s. Figure 6.25a shows the distributions of loss burst length before, during, and after a path change for failover beacon events. The x-axis is shown in log scale. We find that the packet loss burst length during path change can have as many as 480 consecutive packets. Compared with the loss burst length during a path change, the packet loss burst size before and after a path change are quite short. Figure 6.25b F. Wang and L. Gao 200 180 160 140 120 100 80 60 40 20 0 –600 –400 –200 0 200 400 Starting time (seconds) 200 Number of loss bursts Number of loss burst 208 150 100 50 0 –600 –400 –200 0 200 400 Starting time (seconds) 600 (a) Failover 600 (b) Recovery Fig. 6.24 Number of loss bursts starting at each second [31] (Copyright 2006 Association for Computing Machinery, Inc. Reprinted by permission) 1 1 0.95 0.8 CDF CDF 0.9 0.6 0.4 0 1 10 100 Loss burst length (a) Failover 0.8 0.75 during path change before path change after path change 0.2 0.85 during path change before path change after path change 0.7 1000 0.65 1 10 Loss burst length 100 (b) Recovery Fig. 6.25 The cumulative distribution of loss burst length [31] (Copyright 2006 Association for Computing Machinery, Inc. Reprinted by permission) shows the loss burst length during recovery beacon events. We observe that the loss burst length during routing change does not show a significant difference compared with those before or after routing change. In addition, loss burst length can be as long as 140 packets for recovery beacon events. Such loss is most likely caused by routing failures. 6.5.3 Packet Loss Due to Transient Routing Failures or Loops From the measurement results, we see that during both events, many packet loss bursts occur. Packet loss can be attributed to network congestion or routing failures. In order to identify routing failures, ICMP response messages, as measured by traceroutes and pings, are used. After deriving loss burst, unreachable responses from traceroutes and pings are correlated with the loss bursts. Since hosts in PlanetLab are NTP time synchronized, the loss bursts are correlated with ICMP 6 Interdomain Routing and Reliability 209 messages using the time window [1 s, 1s]. When a router does not have a route entry for an incoming packet, it will send an ICMP network unreachable error message back to the source to indicate that the destination is unreachable if it is allowed to do so. Based on the ICMP response message, we can determine when and which router does not have a route entry to the Beacon host. Loss bursts that have corresponding unreachable ICMP messages are attributed to routing failures. In addition, if a packet is trapped in forwarding loops, its TTL value will decrease until the value reaches 0 at some router. The router will send a “TTL exceeded” message back to the source. Thus, from traceroute data, we can observe forwarding loops. Table 6.2 shows the number of failover beacon events, the number of loss bursts, and the number of lost packets that can be verified as caused by routing failures or loops. We verify that 23% of the loss bursts, corresponding to 76% of lost packets, are caused by routing failures or loops. We are unable to verify the remaining 77% of loss bursts, which correspond to only 24% of packet loss. These loss bursts may be caused by either congestion or routing failures for which traceroute or ping is not sufficient (due to either insufficient probe frequency or lack of ICMP messages) for the verification. Similar to our analysis on failover events, we correlate ICMP unreachable messages with loss bursts occurring during recovery events. Table 6.3 shows that 26% of packet loss is verified to be caused by routing failures. Since routers in the Internet may filter out ICMP packets, it is possible that some loss packets do not have corresponding ICMP messages even if those loss bursts might be caused by routing failures or routing loops. As a result, we may underestimate the number of loss bursts due to routing failures or routing loops. Therefore, the number of loss bursts caused by routing failures or routing loops might be more than what can be identified by our methodology. Table 6.2 Overall packet loss caused by routing failures or loops during failover events Failover Loss Lost Causes beacon events bursts packets Routing failures Routing loops Unknown 451 (38%) 208 (18%) 539 (44%) 607 (16%) 239 (7%) 2,875 (77%) 37,751 (42%) 30,592 (34%) 21,948 (24%) Table 6.3 Packet loss caused by routing changes during recovery events Recovery Loss Loss Causes beacon events bursts packets Routing failures Routing loops Unknown 17 (5%) 24 (7%) 290 (88%) 39 (2%) 37 (2%) 1,714 (96%) 480 (11%) 640 (15%) 3,266 (74%) 210 F. Wang and L. Gao 1 1 0.8 0.8 0.6 0.6 CDF CDF We measure the duration of a loss burst as the time interval between the latest received packets before the loss and the earliest one after the loss. Figure 6.26a shows the duration of loss bursts that can and cannot be verified as caused by routing failures or routing loops during failover events. Again, we observe that the loss bursts that are verified as caused by routing failures or routing loops last longer than those unverified loss bursts. Figure 6.26b further shows that loss bursts caused by routing loops last longer than those caused by routing failures. Figure 6.27a shows the cumulative distribution of the duration of loss bursts that are verified and unverified as caused by routing failures or routing loops during recovery events. We observe that verified loss bursts on average are longer than those unverified. In addition, during recovery events, more than 98% of routing failures or routing loops last less than 5 seconds, while during failover events, about 80% of routing failures or routing loops last less than 5 seconds as shown in Fig. 6.26. This means that loss bursts caused by routing failures during recovery events last much shorter than those during failover events. We also observe that unverified loss bursts 0.4 0.2 0 0 Unverified loss bursts Verified loss bursts 5 10 15 20 25 Duration (seconds) 0.4 0.22 0 30 (a) Loss burst verified vs. unverified Routing failures Routing loops 0 5 10 15 20 Duration (seconds) 25 30 (b) Routing loops vs. routing failures 1 1 0.8 0.8 0.6 0.6 CDF CDF Fig. 6.26 Duration for verified vs. unverified loss bursts during failover events [31] (Copyright 2006 Association for Computing Machinery, Inc. Reprinted by permission.) 0.4 0.2 0 0.2 Unverified loss bursts Verified loss bursts 0 2 4 6 8 Duration (seconds) 0.4 10 (a) Loss bursts verified vs. unverified 0 Routing failures Routing loops 0 2 4 6 8 Duration (seconds) 10 (b) Routing loops vs. routing failures Fig. 6.27 Duration of verified loss bursts during recovery events [31] (Copyright 2006 Association for Computing Machinery, Inc. Reprinted by permission.) 6 Interdomain Routing and Reliability 211 last less than 4 seconds. Figure 6.27b shows the duration of verified loss bursts that are caused by routing failures and loops during recovery events. We observe that 57% of packet loss is due to forwarding loops, which is slightly higher than that for failover events (47%). This implies that forwarding loops are also quite common during recovery events. 6.6 Research Approaches We have seen from the measurement study in the previous section that routing failures and routing loops contribute to degraded end-to-end path performance significantly. Several approaches have been proposed to address the problem of routing failures and routing loops. These approaches can be broadly classified into three categories: convergence-based solution, path protection-based solution, and multiple path-based solution. Convergence-Based Solutions These approaches focus on reducing BGP convergence delay. In particular, they aim to reduce convergence delay by eliminating invalid routes quickly. Reducing convergence delay may indirectly shrink the periods of routing failures or routing loops since it takes less time to converge to a stable route. Path Protection-Based Solutions These approaches focus on preestablishing recovery paths before potential network events. These preestablished paths supplement the best path selected by BGP. When there is a routing outage, the recovery path is used to route traffic. The recovery path could be a preestablished protection tunnel, or an alternate AS path. Multipath-Based Solutions The goal of these approaches is to exploit path diversity to provide fault tolerance. To increase path diversity, multipath routes are discovered. For example, multiple routing trees can be created on the same underlying topology. When one of the routes fails, other routes can be probed and then used if valid to route traffic. 6.6.1 Convergence Based Solutions BGP is a path vector protocol. Each BGP speaking router has to rely on its neighbors’ announcements to select its best route. Since each BGP speaking router does not have the topology information, it is possible that an AS explores many AS paths before eventually reaching the final stable path. Figure 6.28 shows an example of the path exploration process during BGP convergence. Suppose the link between AS 1 and AS 0 fails. This failure event makes the destination unreachable at each AS. We refer to this type of events as fail-down events. The following potential sequence of route changes shows how path exploration can occur. 212 Fig. 6.28 An example of path exploration during BGP convergence. The list of AS paths shown beside each node is the set of permissible paths for the node, and the permissible paths are ordered in the descending order of local preference F. Wang and L. Gao 4 4310 4210 42310 210 2310 2 24310 3 1 310 3210 34210 10 0 1. AS 1 sends a withdrawal message to AS 2 and AS 3, respectively. 2. As AS 2 receives the withdrawal, it removes path (2 1 0) from its routing table, selects path (2 3 1 0) as its new best path, and advertises the new path to all neighbors. 3. After AS 3 receives the withdrawal from AS 1, it will use path (3 2 1 0), and advertise it to its neighbors. 4. When AS 2 and AS 3 learn the new paths (2 3 1 0) and (3 2 1 0) from each other, they will remove their best paths, and use path (2 4 3 1 0) and path (3 4 2 1 0), respectively. 5. Since both AS 2 and AS 3 use the paths from AS 4, they will send AS 4 withdrawal messages to withdraw their previously advertised paths. As a result, AS 4 loses its all paths, and sends a withdrawal message to AS 2 and AS 3, respectively. 6. After AS 2 and AS 3 receive the withdrawals from AS 4, their routing tables do not have any route to the destination. This example shows that each node literally has to try several AS paths that traverse the failed link/node before it finally chooses the best valid path or determines that there is no best path. For instance, AS 2 might explore the sequence AS paths (2 1 0) ! (2 3 1 0) ! (2 4 3 1 0) before it removes all paths from its routing table. Previous measurement studies have shown that BGP may take tens of minutes to reach a consistent view of the network topology after a failure [17–19]. Note that although this example shows a fail-down scenario, we can indeed extend it to show a fail-over scenario in which an AS has to explore many invalid paths before finalizing to a stable valid path. Several solutions have been proposed to rapidly indicate and remove invalid routes to suppress the exploration of obsoleted paths [5, 7, 23, 24]. Consistency Assertions (CA) [24] tries to achieve this goal by examining path consistency based solely on the AS path information carried in BGP announcements. Suppose that an AS has learned two paths to a destination from neighbor N1 and neighbor N2 , respectively. N1 advertises path (N1 A B C 0) and neighbor N2 advertises (N2 B X Y 0). CA assumes that each AS can only use one path. Thus, by comparing these two paths, it can detect that the two paths advertised by AS B ((B C 0) and (B X Y 0)) are not consistent. We use an example shown in Fig. 6.28 to show how 6 Interdomain Routing and Reliability 213 an AS can take advantage of consistency checking to accelerate route convergence. A router can use a withdrawal received directly from a neighbor to check path consistency. When the link between AS 1 and AS 0 fails, AS 1 sends withdrawals to AS 2 and AS 3. Once AS 2 and AS 3 notice that their neighbor AS 1 withdraws its path to the destination, they check whether AS 1 appears in any existing path. Since the two path (2 3 1 0) and (2 4 3 1 0) contains path (1 0), neither can be selected and AS 2 removes them from its routing table. Similarly, AS 3 removes path (3 2 1 0) and (3 4 2 1 0). Eventually, AS 2 and AS 3 will withdraw their paths to the destination. As a result, CA eliminates the paths to be explored. However, the AS path consistency might not contain sufficient information about invalid paths. It is hard to accurately detect invalid routes based solely on the AS path information. For example, in Fig. 6.28, after AS 2 and AS 3 receive the withdrawals sent by AS 1 due to link (1 0) failure, AS 2 and AS 3 send withdrawals to AS 4 since all of their paths go through AS 1. Now suppose that AS 2’s withdrawal reaches AS 4 before AS 3 does. In this case, AS 4 cannot consider path (4 3 1 0) as an invalid path since the path does not contain the withdrawn path (2 1 0). AS 4 cannot determine if the withdrawal of path (2 1 0) is due to the failure of link (2 1) or link (1 0). To accurately identify invalid paths, Ghost Flushing [5] reduces convergence delay by aggressively sending explicit withdrawals to quickly remove invalid paths. Whenever an AS’s current best path is replaced by a less preferred route, Ghost Flushing allows the AS to immediately generate and send explicit withdrawal messages to all its neighbors before sending the new path. The withdrawal messages is to flush out the path previously advertised by the AS. For example, in Fig. 6.28, after AS 2 receives the withdrawal sent by AS 1 due to link (1 0) failure, AS 2 will use less preferred path (2 3 1 0). Before sending the path (2 3 1 0) to its neighbors, AS 2 sends extra withdrawal messages to its neighbors AS 3 and AS 4. Because BGP withdrawal messages are not subjected to the MRAI timer, invalid paths can potentially be quickly deleted from the AS’s neighbors. For example, the withdrawal sent by AS 2 will help AS 3 to remove the invalid path (3 2 1 0). From this example, we know that Ghost Flushing does not really prevent path exploration, but instead attempts to speed up the process. To further identify invalid routes quickly, additional information can be incorporated into BGP route updates. BGP-RCN and EPIC [7, 23] propose to use with location information about failures, or root cause information, to identify invalid routes. When a link failure occurs, the nodes adjacent to the link will detect the change. The node, referred to as the root cause node (RCN), will attach its name to the routing update it sends out. The RCN is propagated to other ASes along each impacted path. Thus, an AS can use the RCN to remove all the invalid paths at once. For example, Fig. 6.28 illustrates the basic idea of BGP-RCN. When the link between AS 1 and AS 0 fails, root cause notification is sent with a withdrawal by AS 1. When AS 2 receiving the withdrawal, it uses the root cause notification to find invalid paths that contain AS 1. Thus, path (2 3 1 0) is considered as an invalid path and will be removed. Similarly, at AS 3, path (3 2 1 0) is detected as an invalid route. AS 2 and AS 3 send withdrawals to AS 4, and piggyback the root cause in the 214 F. Wang and L. Gao Table 6.4 Properties of convergence-based solutions. M is the MRAI timer value. n is the number of ASes in the network. D is the diameter of the network. jEj is the number of AS level links. h is the processing delay for a BGP update message to traverse an AS hop Modification Convergence delay Messages Modification to to BGP route eBGP iBGP Protocols (fail-down) (fail-down) BGPs messages selection Standard BGP M n jEj n N/A N/A N/A N/A CA M n jEj No Yes Yes No No Yes Yes Yes Ghost Flushing h n 2jEjn Mh BGP-RCN hD jEj n C 1 Yes Yes Yes No EPIC hD jEj 1 Yes Yes Yes Yes withdrawals. After receives the withdrawal messages with root cause, AS 4 removes all its routes because all paths contain the root cause node AS 1. EPIC [7] further extends the idea of root cause notification so that it can be applied to a router rather than an AS. In general, a failure can occur to a router or a link between a pair of routers. A failure on a link between two ASes does not necessarily mean that all links between the two ASes fail. The root cause notification in BGPRCN can only indicate failures on an AS or links between a pair of ASes. EPIC further allows routing information that contains failure information about router or link between a pair of routers. We summarize important properties of the four approaches in Table 6.4. We consider the upper bound of convergence time and the number of messages during a fail-down event. We also compare those approaches in term of the modifications need from the standard BGP. For example, we consider if an approach needs to modify to BGP’s messages format or BGP route selection, and if those approaches can be applied to eBGP or iBGP. 6.6.2 Path Protection-Based Solutions The convergence based-approaches focus on rapidly removing invalid routes to accelerate BGP convergence process. They are efficient in reducing convergence delay. However, simply applying those methods might not necessarily lead to reliable routing. In fact, accelerating the process of identifying invalid routes might sometimes exacerbate routing outages. Figure 6.29 shows such an example. We first consider the case of running the standard BGP. When the link between AS 1 and AS 0 fails, AS 1 sends a withdrawal to AS 2 and AS 3 immediately, and AS 2 sends a withdrawal to AS 3 right after. Upon receiving the withdrawal, AS 3 will quickly switch to the path (3 4 0). At the same time, when AS 2 receives the withdrawal message, it selects path (2 3 1 0). Even though this path is invalid, AS 2 still reroutes traffic to a valid next hop AS, which has a valid path. Therefore, in this case, AS 2 can reroute traffic to the destination before it receives the valid path (3 4 0). 6 Interdomain Routing and Reliability 215 4 40 4310 43210 310 3210 340 3 2 1 210 2310 2340 10 1340 12340 0 Fig. 6.29 An example showing transient routing failures at AS 2 when RCN is used. The list of AS paths shown beside each node is the set of permissible paths for the node, and the permissible paths are ordered in the descending order of local preference On the contrary, if the root cause information is sent with the withdrawal by AS 1. AS 2 will remove path (2 3 1 0), and temporarily lose its reachability to AS 0 until receiving the new path from AS 3. The duration of temporary loss of reachability could last longer than that in the case of the standard BGP. The duration that AS 2 loses its reachability depends on the delay to get the alternate path from AS 3, which is determined by the time it takes to receive the announcement of path (3 4 0) from AS 3, which is subjected to MRAI timer. Without using the root cause information, the duration that AS 2 loses its reachability depends on the propagation delay of the withdrawal from AS 1 to AS 2, which is not subjected to MRAI timer [26]. The path protection-based solutions are designed specifically for improving the reliability of interdomain routing. The major idea is that local protection paths are identified before failures. When the primary path fails, local protection paths are temporarily used. Many approaches have been proposed for link-state intradomain routing protocols to protect intradomain link failures [6, 14, 16, 27, 33]. However, the BGP speaking routers do not have the knowledge of the global network topology. They have routing information from neighbors only. Therefore, there are two challenges in implementing path protection in BGP; first, one needs to find local preplanned protection paths; second, one needs to decide how and when to use the protection paths. Next, we present several path protection-based approaches. We first focus on how they address the first challenge. We then discuss how they address the second challenge. Bonaventure et al. [3] have proposed a fast reroute technique, referred to as R-Plink, to protect direct interdomain links. The basic idea is that each router precomputes recovery path for each of its BGP peering links, which is used to reroute traffic when the protected BGP link fails. In order to discover an appropriate recovery path, each edge router inside an AS advertises its currently active eBGP sessions by using a new type of iBGP update message. After having other routers’ routing information, an edge router chooses a path to protect its current active eBGP session from all recovery routes. Figure 6.30 shows an example to illustrate this approach. In this example, AS 2 advertises the same destination to AS 1’s two routers A and C. Suppose that the routing policies on AS 1 are configured to select the path via router A as the best path. However, router A cannot learn any route via router C through BGP because of the local-preference settings on this router. 216 F. Wang and L. Gao Fig. 6.30 A precomputed protection path is used to protect the interdomain link between AS 1 and AS 2 To automatically discover the alternate path, routers A and C advertise their active eBGP sessions. Thus, router A will know an alternate path via routers C and E, and choose the path to protect its current path to the destination. Once the link (A D) fails, router A can forward the packets affected by the failure through the alternate path via (C E) link. In contrast of R-Plink, R-BGP aims to solve the transient routing failures problem for any interdomain link failure, not just for the failure of a direct neighboring interdomain link [15]. R-BGP precomputes an alternate path for each AS to protect interdomain links. In particular, an AS first checks all paths it knows, and then selects the one most disjoint from its current best path, which is defined as the failover path. Finally, the AS advertises the failover path only to the next-hop AS along its best path. Note that in the standard BGP, an AS should not advertise its best path to the neighbor currently used to reach that destination, since this path would generate a loop. Advertising a failover path guarantees that, whenever a link goes down, the AS immediately upstream of the down link knows a failover path and can avoid unnecessary packet drops. One limitation of this approach is that it guarantees to avoid routing failures only under the hierarchical provider-customer relationships and the common routing policy, i.e., the no-valley and prefer-customer routing policy. Further, it does not address the routing failures caused by iBGP configuration. Backup Route Aware Routing Protocol (BRAP) is to achieve fast transient failure recovery considering both eBGP routing policy and iBGP configurations [28]. To achieve this, BRAP requires that a router should be enabled to advertise an alternate path if its best path is not allowed to be advertised due to loop prevention or routing policies. The general idea for BRAP is as follows: a router should advertise following policy compliant paths in addition to the best path: (1) a failover 6 Interdomain Routing and Reliability 217 Table 6.5 Comparing path protection-based solutions. jEj is the number of AS level links, jEr j is the number of router level links Messages Modification to Modification to eBGP iBGP Protocols (failover) BGPs messages other part of BGP R-Plink R-BGP BRAP N/A jEj jEr j Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes path to the nexthop router along the best path; and (2) a loop-free alternate path, defined as a temporary backup path, to its upstream neighbors. BRAP extends BGP to distribute the alternate routes along eBGP and iBGP sessions. Now, we describe how to use a protection path. When a router needs to use a protection path, the router needs to inform the other routers along the path of the change. Otherwise, redirecting traffic to the protection path could cause forwarding loops. For example, in Fig. 6.30, when router A sends traffic along the alternate path via routers B and C, their routing tables still consider router A as the next hop. Protection tunnels on the data plane is proposed to avoid such forwarding loops [3]. Protection tunnels can be implemented by using encapsulation schemes such as MPLS over IP. With MPLS over IP, only the ingress border router consults its BGP routing table to forward a packet, and encapsulates IP header with the destination set to the IP address of the egress border router. All the other routers inside the AS will rely on their IGP routing tables or their label forwarding table to forward the packet. R-BGP utilizes “virtual” connections to avoid forwarding loops. There are two “virtual” connections between each pair of BGP-speaking routers, one for the primary path traffic, and the other for the failover traffic. The virtual connection can be implemented by using virtual interfaces when the two routers are physically connected, or MPLS or IP tunnels if they are not. Similarly, BRAP uses a protection path through MPLS or IP tunnels. We summarize the features of the three path protection-based solutions in Table 6.5. We consider the upper bound of the number of messages during a failover event, modification to BGP, and whether those approaches can be applied to eBGP or iBGP. 6.6.3 Multiple Path-Based Solution A straightforward solution to improve the route reliability is to discover multiple paths. There are two proposals for multiple path interdomain routing. The first one is MIRO [32] that allows routers to inform their neighbors multiple routes instead of only the best one. Thus, MIRO can allow ASes to have more control over the flow of traffic in their networks, as well as enable quick reaction to path failures. The second one is Path Splicing [22], which aims to take advantage of alternate paths in BGP routing table to discover multiple paths. Instead of using only the best 218 F. Wang and L. Gao path in the BGP routing table, a packet can select any path in the BGP routing table by indicating which one to use in its header. Clearly, probing has to be deployed before multiple paths can be discovered since arbitrary selection of alternate paths can lead to routing loops. 6.7 Conclusion and Future Directions Interdomain routing is the glue that binds thousands of networks in the Internet together. Its reliability plays determinable role on the end-to-end path performance. In this chapter, we have presented the challenges in designing and implementing a reliable interdomain routing protocol. Specifically, through measurement studies, we present a clear overview of the impact of transient routing failures and transient routing loops on the end-to-end path performance. Finally, we have critically reviewed the existing proposals in this field, highlighting pros and cons of those approaches. While certain efforts have been made to enhance interdomain routing reliability, this issue remains open. We believe that the development of new routing infrastructure, for example, multipath routing is one promising direction of future research. Reliability enhancement through multiple path advertisement is not a new idea. Many efforts have been been made to extend BGP to allow the advertisement of multiple paths [12, 20]. However, designing scalable interdomain routing through multiple path advertisement is challenging. One of those challenges is to understand the degree of path diversity provided by multiple path advertisement is sufficient to overcome network failures. At the same time, this challenge highlights the need for designing new path diversity metrics. Path diversity metrics such as the number of node-disjoint and link-disjoint links can be used to compute the inter-AS path diversity. However, new path diversity metrics needs to be devised to take into account the performance, reliability, and stability. Acknowledgments The authors would like to thank the editors, Chuck Kalmanek and Richard Yang, for their comments and encouragement. This work is partially supported by NSF grants CNS-0626617 and CNS-0626618. References 1. Akella, A., Maggs, B., Seshan, S., Shaikh, A., & Sitaraman, R. (2003). A measurement-based analysis of multihoming. In Proceedings of ACM SIGCOMM, August 2003. 2. Basu, A., Ong, L., Shepherd, B., Rasala, A., & Wilfong, G. (2002). Route oscillations in I-BGP with route reflection. In Proceedings of the ACM SIGCOMM. 3. Bonaventure, O., Filsfils, C., & Francois, P. (2007). Achieving sub-50 milliseconds recovery upon BGP peering link failures. IEEE/ACM Transactions on Networking (TON), 15(5), 1123– 1135. 4. Boutremans, C., Iannaccone, G., Bhattacharyya, S. C., Chuah, C., & Diot, C. (2002). Characterization of failures in an IP backbone. In Proceedings of ACM SIGCOMM Internet Measurement Workshop, November, 2002. 6 Interdomain Routing and Reliability 219 5. Bremler-Barr, A., Afek, Y., & Schwarz, S. (2003). Improved BGP convergence via ghost flushing. In Proceedings of IEEE INFOCOM 2003, vol. 2, San Francisco, CA, Mar. 30-Apr. 3, 2003, pp. 927–937. 6. Bryant, S., Shand, M., Previdi, S. (2009). IP fast reroute using not-via addresses. Draft-ietfrtgwg-ipfrr-notvia-addresses-04. 7. Chandrashekar, J., Duan, Z., Zhang, Z. L., & Krasky, J. (2005). Limiting path exploration in BGP. In Proceedings of IEEE INFOCOM 2005, Miami, Florida, March 13–17 2005, Volume: 4, 2337–2348. 8. Gao, L., & Rexford, J. (2001). A stable internet routing without global coordination. IEEE/ACM Transactions on Networking, 9(6), 681–692. 9. Griffin, T. G., & Willfong, G. (1999). An analysis of BGP convergence properties. In Proceedings of ACM SIGCOMM, pp. 277–288, Boston, MA, September 1999. 10. Griffin, T. G., & Willfong, G. (2002). On the correctness of IBGP configuration. In Proceedings of ACM SIGCOMM, pp. 17–29, Pittsburgh, PA, August 2002. 11. Griffin, T. G., Shepherd, B. F., & Wilfong, G. (2002). The stable paths problem and interdomain routing. IEEE/ACM Transactions on Networking (TON), 10(2) pp. 232–243. 12. Halpern, J. M., Bhatia, M., & Jakma, P. (2006). Advertising Equal Cost Multipath routes in BGP. Draft-bhatia-ecmp-routes-in-bgp-02.txt 13. Hengartner, U., Moon, S., Mortier, R., & Diot, C. (2002). Detection and analysis of routing loops in packet traces. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurement, Marseille, France, pp. 107–112. 14. Iselt, A., Kirstdter, A., Pardigon, A., Schwabe, T. (2004). Resilient routing using ECMP and MPLS. In Proceedings of HPSR 2004, Phoenix, Arizona, USA April 2004, pp. 345–349. 15. Kushman, N., Kandula, S., Katabi, D.,& Maggs, B. (2007). R-BGP: staying connected in a connected world. In 4th USENIX Symposium on. Networked Systems Design & Implementation, Cambridge, MA, April 2007, pp. 341–354. 16. Kvalbein, A., Hansen, A. F., Cicic, T., Gjessing, S., & Lysne, O. (2006). Fast IP network recovery using multiple outing configurations. In Proceedings IEEE INFOCOM, pp. 23–26, Barcelona, Spain, Mar. 2006. 17. Labovitz, C., Malan, G. R., & Jahanian, F. (1998). Internet routing instability. IEEE/ACM Transactions on Networking 6(5): 515–528 (1998). 18. Labovitz, C., Ahuja, A., Bose, A., et al. (2001). Delayed internet routing convergence. IEEE/ACM Transactions on Networking, Publication Date: June 2001, 9(3), pp. 293–306. 19. Labovitz, C., Ahuja, A., Wattenhofer, R., et al. (2001). The impact of internet policy and topology on delayed routing convergence. In Proceedings of IEEE INFOCOM’01, Anchorage, AK, USA, April 2001, pp. 537–546. 20. Mohapatra, P., Fernando, R., Filsfils, C., & Raszuk, R. (2008). Fast connectivity restoration using BGP add-path. Draft-pmohapat-idr-fast-conn-restore-00. 21. Morley Mao, Z., Bush, R., Griffin, T., & Roughan, M. (2003). BGP Beacons. In Proceedings of IMC, October 27–29, 2003, Miami Beach, Florida, USA, pp. 1–14. 22. Motiwala, M., Feamster, N., & Vempala, S. (2008). Path splicing. SIGCOMM 2008. Seattle, WA: August. 23. Pei, D., Azuma, M., Massey, D., & Zhang, L. (2005). BGP-RCN: improving BGP convergence through root cause notification. Computer Networks, 48(2), 175–194. 24. Pei, D., Zhao, X., Wang, L., Massey, D., Mankin, A., Wu, S. F., & Zhang, L. (2002). Improving BGP convergence through consistency assertions. In Proceedings of the IEEE INFOCOM 2002, vol. 2, New York, NY, June 23–27, 2002, pp. 902–911. 25. PlanetLab, http://www.planet-lab.org 26. Rekhter, Y., Li, T., Hares, S. (2006). A border gateway protocol 4 (BGP-4). RFC 4271. 27. Stamatelakis, D., & Grover, W. D. (2000). IP layer restoration and network planning based on virtual protection cycles. IEEE Journal on Selected Areas in Communications, 18(10), Oct 2000, pp. 1938–1949. 28. Wang, F., & Gao, L. (2008). A backup route aware routing protocol – fast recovery from transient routing failures. Proceedings of IEEE INFOCOM Mini-Conference, April 2008. Arizona: Phoenix. 220 F. Wang and L. Gao 29. Wang, F., Gao, L., Spatscheck, O., & Wang, J. (2008). STRID: Scalable trigger-based route incidence diagnosis. Proceedings of IEEE ICCCN 2008, St. Thomas, U.S. Virgin Islands, August 3–7, 2008, pp. 1–6. 30. Wang, F., Gao, L., Wang, J., & Qiu, J. (2009). On understanding of transient interdomain routing failures. IEEE/ACM Transactions on Networking, 17(3), June 2009, pp. 740–751. 31. Wang, F., Mao, Z. M., Gao, L., Wang, J., & Bush, R. (2006). A measurement study on the impact of routing events on end-to-end internet path performance. Proceedings of ACM SIGCOMM 2006, September 11–15. Pisa, Italy, pp. 375–386. 32. Xu, W., & Rexford, J. (2006). MIRO: multi-path interdomain routing. In Proceedings of ACMSIGCOMM 2006, pp. 171–182, Pisa, Italy. 33. Zhong, Z., Nelakuditi, S., Yu, Y., Lee, S., Wang, J., & Chuah, C.-N. (2005). Failure inferencing based fast rerouting for handling transient link and node failures. In Proceedings of IEEE Global Internet, Miami, Fl, USA, Mar. 2005, pp. 2859–2863. Chapter 7 Overlay Networking and Resiliency Bobby Bhattacharjee and Michael Rabinovich 7.1 Introduction An “overlay” is a coordinated collection of processes that use the Internet for communication. The overlay uses the connectivity provided by the network to form any overlay topologies and information flows fitting its applications, irrespective of the topology of the underlying network infrastructure. In a broad sense, every distributed system and application forms an overlay. Certainly, routing protocols form overlays as does the interconnection of NNTP servers that form the Usenet. We use the term “overlay networks” in a narrower sense: an application uses an overlay only if processes on end-hosts are used for routing and relaying messages. The overlay network is layered atop the physical network, which enables additional flexibility. In particular, the overlay topology can be tailored to application requirements (e.g., overlay topologies can be set up to provide low-latency lookup on flat names spaces), overlay routing may choose application-specific policies (e.g., overlay routing meshes can find paths in contradiction of policies exported by BGP), and overlay networks can emulate functionality not supported by the underlying network (e.g., overlays can implement application-layer multicast over an unicast network). The flexibility enabled by overlay networks can be both a blessing and a curse. On the one hand, it gives application developers the control they need to implement sophisticated measures to improve the resilience of their application. On the other hand, overlay networks are built over end-hosts, which are inherently less stable, reliable, and secure than lower-layer network components comprising the Internet fabric. This presents significant challenges in overlay network design. B. Bhattacharjee Department of Computer Science, University of Maryland, College Park, MD 20742, USA e-mail: bobby@cs.umd.edu M. Rabinovich Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, Ohio 44106–7071, USA e-mail: misha@eecs.case.edu C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 7, c Springer-Verlag London Limited 2010 221 222 B. Bhattacharjee and M. Rabinovich In this chapter, we concentrate on the former aspect of overlay networks and present a survey of overlay applications with a focus on how they are used to increase network resilience. We begin with a high-level overview of some issues that can hamper the network operation and how overlay networks can help address these issues. In particular, we consider how overlay networks can make a distributed application more resilient to flash crowds and overload, to component failures and churn, network failures and congestion, and to denial of service attacks. 7.1.1 Resilience to Flash Crowds and Overload The emergence of the Web has led to a new phenomenon where Internet resources are exposed to potentially unlimited demand. It is difficult (and indeed inefficient) for content providers to provision sufficient capacity for the worst-case load (which is often hard to predict). Inability to predict worst-case load leaves content providers susceptible to flash crowds: rapid surges of demand that exceed the provisioned capacity. Approaches to address flash crowds differ by resource type. It is useful to distinguish the following types of Internet resources: Large files, exemplified by software packages and media files, with file sizes on the order of megabytes for audio tracks, going up to tens or even hundreds of megabytes for software packages and gigabytes for full-length movies. Web objects, consisting of typical text and pictures on Web pages, with sizes ranging from one to hundreds of kilobytes. Streaming media, where the download (often at bounded bit rates) continues over the duration of content consumption. Internet applications, where a significant part of service demand to process a client request is due to the computation at the server rather than delivering content from the server to the client. IP multicast is a mechanism at the IP level that could potentially address the flash crowd problem in the first three of these resource types. At a high level, IP multicast creates a tree with the content source as the root, and the content consumers as the leaves. The source sends only one copy of a packet, and routers inside the network forward and duplicate packets as necessary to implement forwarding to all receivers. IP multicast decouples the resources requirements at the source from the number of simultaneous receivers of identical data. However, IP multicast cannot help when different contents need to be sent to different clients, or when the same content needs to be sent at different times, or when one needs to scale up an Internet application. Furthermore, although IP multicast is widely implemented, access to the IP multicast service is enabled only in the confine of individual ISPs to selected applications. 7 Overlay Networking and Resiliency 223 Overlay networks can help overcome these limitations. Content delivery networks are an overlay-based approach widely used for streaming, large file, and Web content delivery. A content delivery network (CDN) is a third-party infrastructure that content providers employ to deliver their data. In a sense, it emulates multicast at the application level, with content providers’ sites acting as roots of the multicast trees and servers within the CDN infrastructure as internal multicast tree nodes. What distinguishes a CDN from IP multicast is that, as with any overlays, its deployment does not rely on additional IP services beyond the universal IP unicast service, and that CDN nodes have long-term storage capability, allowing the distribution trees to encompass clients consuming content at different times. A CDN derives economy of scale from the fact that its infrastructure is shared among multiple content providers who subscribe to the CDN’s service. Indeed, because flash crowds are unlikely to occur at the same time for multiple content providers, a CDN needs much less overprovisioning of its infrastructure than an individual content provider: a CDN can reuse the same capacity slack to satisfy peak demands for different content at different times. Another overlay approach, called peer-to-peer (P2P) delivery, provides resilience to flash crowds by utilizing client bandwidth in delivering content. By integrating clients into the delivery infrastructure, P2P approaches promise the ability to organically scale with the demand surge: the more clients want to obtain certain content, the more resources are added to the delivery infrastructure. The P2P paradigm has been explored in various contexts, but most widely used are P2P approaches to large-file downloads and streaming content. Peer-to-peer or peer-assisted delivery of streaming content is particularly compelling because streaming taxes the capacity of the network and at the same time imposes stringent timing requirements. Consider, for example, a vision for a future Internet TV service (IPTV), where viewers can seamlessly switch between tens of thousands of live broadcast channels from around the world, millions of video-on-demand titles, and tens of millions of videos uploaded by individual users using capabilities similar to those provided by today’s YouTube-type applications. Consider a global carrier providing this service in high-definition to 500 million subscribers, with 200 million simultaneous viewers at peak demand watching different streams – either distinct titles or the same titles shifted in time. Assume conservatively that a high-definition stream requires a streaming rate of 6 Mbps (it is currently close to 10 Mbps but is projected to reduce with improvements in coding). The aggregate throughput to deliver these streams to all the viewers is 1.2 Petabits per second. Even if a video server could deliver 10 Gbps of content, the carrier would need to deploy 120,000 video servers to satisfy this demand through naive unicast. Given these demands on the network and server capacities, overlay networks – in particular peer-to-peer networks – are important technologies to enable IPTV on a massive scale. 224 B. Bhattacharjee and M. Rabinovich 7.1.2 Resilience to Component Failures and Churn A distributed application needs to be able to operate when some of its components fail. For example, we discussed how P2P networks promise resiliency to flash crowds. However, because they integrate users’ computers into the content delivery infrastructure, they are especially prone to component failures (e.g., when a user kills a process or terminates a program) and to peer churn (as users join and leave the P2P networks). The flexibility afforded by overlay networks can be exploited to incorporate a range of redundancy mechanisms. These mechanisms allow system designers to utilize many failure prone components (often user processes on end-hosts) to craft highly resilient applications. Existing P2P networks have proven this resiliency by functioning successfully despite constant peer churn. Besides traditional file-sharing P2P networks, other examples of churn-resistant overlay network designs include a peer-to-peer Web caching system [36] and a churn-resistant distributed hash table [52]. 7.1.3 Resilience to Network Failures and Congestion Overlay networks can mitigate the effects of network outages and hotspots. Two end-hosts communicating over an IP network have little control over path selection or quality. The end-to-end path is a product of the IGP routing metrics used within the involved domains, and the BGP policies (set by administrators of these domains) across the domains. These metrics and policies are often entirely nonresponsive to transient congestion; in some case, two nodes may fail to find a path (due to BGP policies) even when a path exists. Overlay networks allow end-users finer-grained control over routing and thus can be agile in reacting to the underlying network conditions. Consider a hypothetical voice-over-IP communication between hosts at the University of Maryland (in College Park, Maryland) and Case Western Reserve University (in Cleveland, Ohio). The default path may traverse an Internet2 router in Pennsylvania. However, if this router is congested, an overlay-based routing system that is sensitive to path latency could try to route around the congestion. For instance, the routing overlay could tunnel the packets through overlay nodes at the University of Virginia and the University of Illinois, which might bypass the temporary congestion on the default path. Systems such as RON [4], Detour [55, 56], and Peerwise [38] create such routing overlays that route around adverse conditions in the underlying IP network. These systems build meshes for overlay routing and make autonomous routing decisions. RON builds a fully connected mesh and continually monitors all edges. When the direct path between two nodes fails or has shown degraded performance, communication is rerouted through the other overlay nodes. Not all systems build a fully connected mesh: Nakao et al. [44] use topology information and geographybased distance prediction to build a mesh that is representative of the underlying 7 Overlay Networking and Resiliency 225 physical network. Peerwise creates overlay links only between nodes that can provide shortcuts to each other. Experiments with all of these systems show that it is indeed possible to reduce end-to-end latency and improve connectivity using routing overlays. 7.1.4 Resilience to DoS Attacks Overlay networks can be used to protect content providers from Distributed Denialof-Service (DDoS) attacks. During a DDoS attack, an attacker directs a set of compromised machines to flood the victim’s incoming links. DDoS attacks are effective because (1) the content provider often cannot distinguish an attacking connection from a legitimate client connection, (2) the number of attacking hosts can be large enough that it is difficult for the victim’s network provider to set up static address filters, and (3) the attackers may spoof their source IP addresses. Over the last decade, DDoS attacks have interrupted service to many major Internet destinations, and in some cases, have been the root cause for the termination of service [31]. Networking researchers have developed many elegant approaches to mitigating the effect of and tracing the root of DDoS attacks; unfortunately, almost all of them require changes to the core Internet protocols. Overlay services can be used to provide resiliency without changing protocols or infrastructure. SOS [28] and Mayday [3] are overlay services that “hide” the address of the content-providing server. Instead the server is “protected” by an overlay, and access to the server may require strong authentication or captchas (that can distinguish attackers from legitimate clients). The protective overlay is large enough that it is not feasible or profitable to attack the entire overlay. The content provider’s ISP blocks all access to the server except by a small set of (periodically changing) trusted nodes who relay legitimate requests to the server. 7.1.5 Chapter Organization We have discussed various ways in which overlay networks can improve resiliency of networked applications. In the rest of this chapter we discuss some of these applications in more detail. We begin by introducing a foundational concept used in many overlay applications – a distributed hash table – in Section 7.2. We then discuss representative overlay applications including streaming media systems in Section 7.3 and Web content delivery networks in Section 7.4. Section 7.5 describes an overlay approach to improving the resiliency of Web services against DDoS attacks. We discuss swarming protocols for bulk transfer in Section 7.6, and conclude in Section 7.7. 226 B. Bhattacharjee and M. Rabinovich 7.2 A Common Building Block: DHTs Distributed applications often maintain large sets of identifiers or keys, such as names of files, IDs of game players, or addresses of chat rooms. For scalability, resilience, and load-balance, the task of maintaining these keys is divided amongst the nodes participating in the system. This approach scales since each node only deals with a limited subset of keys, it is resilient since a single key can be replicated onto more than one node, and finally it balances load since lookups and storage overhead are distributed (relatively) evenly over all the participants. A node responsible for a key may perform various application-specific actions related to this key: store the corresponding data, act as a control server for a named group, and so forth. A fundamental capability such a system must support is to allow each participating node to identify the node(s) responsible for a given key. Once a seeking node locates the node(s) that store a key, it may initiate corresponding actions. Distributed Hash Tables (DHTs) are a technique for efficiently distributing keys among nodes. DHTs provide this capability while limiting the knowledge each node must maintain about the other nodes in the system: instead of directly determining a responsible node (as would be the case with regular hashing), a node can only determine some nodes that are “closer” (by some metric) to the responsible node. The node then sends its request to one of the closer nodes, which in turn would forward the request toward a responsible node until the request reaches its target. Good DHTs ensure that requests must traverse only a small number of overlay hops en route to a responsible node. In a system with n nodes, many DHT protocols limit this hop count to O.log n/ while storing only O.log n/ routing state at each node for forwarding requests. Newer designs reduce some of the overheads to constants [23, 41, 50]. DHTs are a common building block for many types of distributed services, including distributed file systems [18], publish–subscribe systems [14, 58], cooperative Web caching [25], and name service [6]. They have even been proposed as a foundation for general Internet infrastructures [58]. DHTs can be built using a structured network, in which the DHT protocol chooses which nodes in the network are linked (and uses the structure inherent in these connections to reduce lookup time) or an unstructured network, in which the node interconnection is either random or an external agent specifies which nodes may be connected (as can be the case if links are constrained as in a wireless network or have specific semantics such as trust). We next describe prototypical DHT systems that are designed for cooperative environments. 7.2.1 Chord: Lookup in Structured Networks Chord [59] was one of the first DHTs that routed requests in O.log n/ overlay hops while requiring each node to store only O.log n/ routing state. The routing state at 7 Overlay Networking and Resiliency 227 each node contains pointers to some other nodes and is called a node’s finger table. Nodes responsible for a key store a data item associated with this key; the DHT can be used to lookup data items by key. Chord assigns an identifier (uniformly at random) to each node from a large ID space (2N IDs, N is usually set at 64 or 128). Each item to be stored in the DHT is also assigned an ID from the same space. Chord orders IDs onto a ring modulo 2N . An item is mapped to the node with the smallest ID larger than the item’s ID modulo 2N . Using this definition, we say that each item is mapped onto the node “closest” to the item in the ID space. A node with ID x stores a “finger table”, which consists of references to nodes closest to IDs x C 2i ; i 2 f0; N 1g. The successor of i , denoted as s.i /, is the node whose ID is immediately greater than i ’s ID modulo 2N . Likewise, the predecessor of i , p.i /, is the node whose ID is immediately less than n’s (Fig. 7.1). Each Chord node is responsible for the half-open interval consisting of its predecessor’s ID (noninclusive) and its own ID (inclusive). When a new node joins, it finds its “place” on the ring by routing to its own ID (say x), and can populate its own routing table by successively querying for nodes with the appropriate IDs (x C 1; x C 2; x C 4; : : : ). In the worst case, this incurs O.log2 n/ overhead. A node returns the data (if any) upon receiving a lookup for a key in the range of IDs it stores. For other lookups, it “routes” (forwards) the query to the node in its finger table with the highest ID (modulo 2N ) smaller than the key. This process iterates until the item is found or it is determined that there is no item corresponding to the lookup. Figure 7.2 shows two examples of lookups in Chord. In the first case, the data corresponding to key value 3 is looked up (starting from node 52); in the 62 2 4 55 2+20 3 maps to node 4 2+21 4 4 2 52 finger 6 8 2+23 10 15 2+2 4 18 21 2+2 5 34 34 2+2 46 8 15 43 21 22 Fig. 7.1 Finger table state for Node 2 34 31 28 228 B. Bhattacharjee and M. Rabinovich Key = 3 Interval = [2, 4) Next hop = 2 Key = 3 Interval = [61, 5) Next hop = 62 Key = 3 Interval = [61, 5) Next hop = 62 62 2 4 55 52 8 Key = 42 Interval = [31, 47) Next hop = 31 Key = 42 Interval = [14, 46) Next hop = 15 15 46 43 21 22 34 31 28 Key = 42 Interval = [39, 47) Next hop = 43 Fig. 7.2 Two lookups on the Chord ring second, 42 is looked up starting from node 46. The figure shows the nodes visited by the queries in each case, and also the interval (part of the Chord space) each node is responsible for. In practice, Chord nodes inherit most of their routing table from their neighbors (and avoid the O.log2 n/ work to populate tables). Nodes periodically search the ring for “better” finger table entries. As nodes leave and rejoin, the Chord ring is kept consistent using a stabilize protocol, which ensures eventual consistency of successor pointers. More details about Chord, including the details of the stabilization protocol, can be found in [60]. 7.2.2 LMS: Lookup on Given Topologies As we saw in the previous section, Chord imposes the overlay topology on its nodes that is stipulated by node IDs, and lookup queries traverse routes in this topology. Such networks are often referred to as structured. In contrast, some overlay networks allow participating nodes to form arbitrary topologies, irrespective of their node IDs. These networks are called unstructured. The simplest form of lookup on an unstructured topology is to flood the query. Flooding searches, while adequate for small networks, quickly become infeasible as networks grow larger. LMS (Local Minima Search [43]) is a protocol designed for unstructured networks that scale better than flooding. In LMS, the owner of each object places replicas of the object on several nodes. Like in a DHT, LMS places replicas onto 7 Overlay Networking and Resiliency 229 nodes which have IDs “close” to the object. Unlike in a DHT, however, in an unstructured topology there is no deterministic mechanism to route to the node, which is the closest to an item. Instead, LMS introduces the notion of a local minimum: a node u is a local minimum for an object if and only if the ID of u is the closest to the item’s ID in u’s neighborhood (those nodes within h hops of u in the network, where h is a parameter of the protocol, typically 1 or 2). In general, for any object there are many local minima in a graph, and replicas are placed onto a subset of these. During a search, random walks are used to locate minima for a given object, and a search succeeds when a local minimum holding a replica is located. While DHTs typically provide a worst-case bound of O.log n/ steps for lookups in a network of size n, LMS provides a worst-case bound of O.T .G/ C log n/, where T .G/ is the mixing time of G (the time by which a random walk on the topology G approaches its stationary distribution). T .G/ is O.log n/ or polylogarithmic in n for a wide range of randomly-grown topologies. This “O.T .G/ C log n/” is typically in the 6–15 range in networks of size up to 100; 000. Let dh be the minimum size of the h-hop p neighborhood of any node in G. LMS achieves its performance byp storing O. n=dh / replicas, and with a message complexity (in its lookups) of O. n=dh .T .G/ C log n//. This is notably worse than DHTs, but is a considerable improvement over other (essentially linear-time) lookup techniques in networks that cannot support a structured protocol, and a vast improvement over flooding-based searches [43]. The use of local minima in LMS provides a high assurance that object replicas are distributed randomly throughout the network. This means that even if the lookup part of the LMS protocol is not used (such as for searches on object attributes that consequently cannot use the virtualized object identifier), flooding searches will succeed with high probability even with relatively small bounded propagation distances. Finally, LMS also provides a high degree of fault-tolerance. 7.2.3 Case Study: OpenDHT Since many distributed applications can benefit from a lookup facility, a logical step is to develop a DHT substrate. OpenDHT is an example of such a substrate[53]. An application using a DHT may need to execute application-specific actions at each node along DHT routing paths or at the node responsible for a given key. However, to satisfy a range of applications, OpenDHT takes a minimalist approach: it only allows applications to associate a data item with a given key and store it in the substrate (at a node or nodes that OpenDHT selects to be responsible for this key) as well as retrieve it from the substrate. The DHT routing is done “under covers” within the substrate and is not exposed to the application. In other words, OpenDHT is an external storage platform for third-party applications. While OpenDHT in itself is a peer-to-peer overlay network, application end-hosts do not participate in it directly. Instead, it runs on PlanetLab [16] nodes; applications that use OpenDHT may or may not use PlanetLab. 230 B. Bhattacharjee and M. Rabinovich OpenDHT provides two simple primitives to applications: put(key, data) which is used to store a data item and an associated key, and get(key) which retrieves previously stored data given its key.1 Multiple puts with the same key append their data items to the already existing ones, so a subsequent get would retrieve all these data. OpenDHT, therefore, implements an application-agnostic shared storage facility. Due to its open nature, OpenDHT includes special mechanisms to prevent resource hoarding by any given user. It also limits the size of data items to 1 KB and times out deposited data items that are not explicitly renewed by the application. Renewal is done by issuing an identical “put” before the original data item expires. The shared storage provided by OpenDHT allows end-hosts in a distributed application to conveniently share state, without any administrative overhead. This capability turned out to be powerful enough to support a growing number of applications. In fact, OpenDHT primitives can be used to implement an application that employs its own DHT routing among the application’s end-hosts [53]. While a great deal of engineering ingenuity ensures that OpenDHT nodes’ resources are shared fairly among competing applications, OpenDHT’s resiliency and scalability come from its overlay network architecture. Besides demonstrating these benefits of overlays, OpenDHT has shown the generality of the DHT concept by using it as a foundation of a substrate that has proved useful for a number of diverse applications. 7.2.4 Securing DHTs Chord and LMS are only two of many different contemporary lookup protocols. These two protocols assume that nodes are cooperative and altruistic. While these protocols are highly resilient to random component failures, it is more difficult to protect them against malicious attacks. This is especially a concern since DHTs may be built using public, non-centrally administered nodes, some of which may be corrupt or compromised. There are several ways in which adversarial nodes may attempt to subvert a DHT. Malicious nodes may return incorrect results, may attempt to route requests to other incorrect nodes, provide incorrect routing updates, prevent new nodes from joining the system, and refuse to store or return items. There are several DHT design that provide resilience to these types of attacks. We describe one in detail next. 7.2.5 Case Study: NeighborhoodWatch The NeighborhoodWatch DHT [11] provides security against malicious users that attempt to subvert a DHT instance by misrouting or dropping queries, 1 The actual API includes additional primitives and parameters, which are beyond the scope of our discussion. 7 Overlay Networking and Resiliency 231 refusing to store items, preventing new nodes from joining, and similar attacks. NeighborhoodWatch employs the same circular ID space as Chord [59], and also maps its nodes into neighborhoods as in [20]. However, in NeighborhoodWatch, each node has its own neighborhood that consists of itself, it’s k successors, and k predecessors, where k is a system parameter. NeighborhoodWatch’s security guarantees hold if and only if for every sequence of k C 1 consecutive DHT nodes, at least one is alive and honest. NeighborhoodWatch employs an on-line trusted authority, the Neighborhood Certification Authority (NCA) to attest to the constituents of neighborhoods. The NCA has a globally known public key. The NCA may be replicated, and the state shared between NCA replicas is limited to the NCA private key, a list of malicious nodes, and a list of complaints of non-responsive nodes. The NCA creates, signs, and distributes neighborhood certificates, or nCerts, to each node. Nodes need a current and valid nCert in order to participate in the system. Upon joining, nodes receive an initial nCert from the NCA. nCerts are not revoked; instead nodes must renew their nCerts on a regular basis by contacting the NCA. nCerts list the current membership of a neighborhood, accounting for any recent changes in membership that may have occurred. Using signed nCerts, any node can identify the set of nodes that are responsible for storing an item with a given ID. NeighborhoodWatch employs several mechanisms that detect and prove misbehavior (described in detail in [11]). The NCA removes malicious nodes from the DHT by refusing to sign a fresh nCert for that node. Nodes maintain and update their finger tables as in Chord. The join procedure is shown in Fig. 7.3. For each of node n’s successors, predecessors, and finger table n p3(n) p 2(n) p (n) n p3(n) p 2(n) p (n) n.id n.id s(n) s(n) s 2(n) 2 s (n) s 3(n) NCA (1) Node n requests to join by contacting an NCA replica. 3 p (n) p 2(n) p (n) s 3(n) NCA (2) NCA returns an nCert to n, who uses it to find owner (n.id ). n 3 n p (n) p 2(n) p (n) n.id n.id s(n) s (n) 2 2 s (n) s (n) 3 s (n) NCA (3) Node n returns nCertowner(n.id) to NCA. 3 n p (n) p 2(n) p (n) (4) NCA requests neighborhood certificates from k predecessors and k successors of n p3(n) p 2(n) p (n) n.id n s (n) s (n) k s 2(n) 2 s (n) NCA s 3(n) NCA s 3(n) (5) Nodes return current certificates and the NCA verifies their consistency NCA (6) NCA issues fresh certificates to all affected nodes Fig. 7.3 The join process in the NeighborhoodWatch DHT [11]. Here k D 3 3 s (n) 232 B. Bhattacharjee and M. Rabinovich entries, n stores a full nCert (instead of only the node ID and IP address as in Chord). When queried as part of a lookup operation, nodes return nCerts rather than information about a single node. Routing is iterative: if a node on the path fails (or does not answer), the querier can contact another node in the most recently obtained nCert. Recall that NeighborhoodWatch assumes that every sequence of k C 1 consecutive nodes in the DHT contains at least one node that is alive and honest. The insight is that if nodes cannot choose where they are placed in the DHT, malicious nodes would have to corrupt a large fraction of the nodes in the DHT in order to obtain a long sequence of consecutive, corrupt nodes. By making routing depend on long sequences of nodes (neighborhoods), nodes are guaranteed to know of at least one other honest node that is “near” a given point in the DHT. In order to protect against a given fraction f of malicious nodes, the system operator chooses a value of k such that this assumption holds with high probability. Items published to the DHT are self-certifying. In addition, when a node stores an item, it returns a signed receipt to the publisher. This receipt is then stored back in the DHT. This prevents nodes from lying about whether they are storing a given item: if a querier suspects that a node is refusing to return an item, it can look for a receipt. If it finds a receipt, it can petition the NCA to remove the misbehaving node from the DHT. 7.2.6 Summary and Further Reading In this section, we have described the basic functionality provided by DHTs, and provided case studies that demonstrate different flavors of DHTs and lookup protocols. We have described how DHTs attain their lookup performance, and also described how DHT protocols can be subverted by attackers. Finally, we have presented a DHT design that is more resilient to noncooperative and malicious behavior. Our review is not comprehensive; there are many other interesting DHT designs. We point the interested reader to [12, 20, 23, 41, 50, 51, 54, 66]. 7.3 Resilient Overlay-Based Streaming Media Overlay-based streaming media systems can be decomposed into three broad categories depending on their data delivery mechanism (Fig. 7.4). Participants in a single-tree system arrange themselves into a tree. By definition, this implies that there is a single, loop-free, path between any two tree nodes. The capacity of each tree link must be at least the streaming rate. Content is forwarded (i.e., pushed) along the established tree paths. The source periodically issues a content packet to its children in the tree. Upon receiving a new content packet, each node immediately forwards a copy to its children. The uplink bandwidth of leaf nodes remains unused (except by recovery protocols) in a single tree system. 7 Overlay Networking and Resiliency Fig. 7.4 Decomposition of Streaming Media Protocols 233 Streaming Media Protocols Single-Tree Single-Tree Mesh Hybrid Mesh Multi-Tree Multi-Tree Mesh Hybrid Examples of single-tree systems include ESM [15], Overcast [26], ZIGZAG [61], and NICE [8]. In a multi-tree system, each participating node joins k different trees and the content is partitioned into k stripes. Each stripe is then disseminated in one of the trees, just as in a single-tree system. In a multi-tree protocol, each member node can be an interior node in some tree(s) and a leaf node in other trees. Further, each stripe requires only 1=kth the full stream bandwidth, enabling multi-trees to utilize forwarding bandwidths that are a fraction of the stream rate. These two properties enable multi-tree systems to utilize available bandwidth better than a single-tree. SplitStream [13], CoopNet [45], and Chunkyspread [62] are examples of multi-tree systems. In mesh-based or swarming overlays, the group members construct a random graph. Often, a node’s degree in the mesh is proportional to the node’s forwarding bandwidth, with a minimum node degree (typically five [69]) sufficient to ensure that the mesh remains connected in the presence of churn. The source periodically makes a new content block available, and each node advertises its available blocks to all its neighbors. A missing block can then be requested from any neighbor that advertises the block. Examples of mesh-based systems are CoolStreaming [69], Chainsaw [46], PRIME [39], and PULSE [47]. As Fig. 7.4 shows, the base dataplanes can be combined to form hybrid dataplanes. Hybrid dataplanes combine tree- and mesh-based systems by employing a tree backbone and an auxiliary mesh structure. Typically, blocks are “pushed” along the tree edges (as in a regular tree protocol) and missing blocks are “pulled” from mesh neighbors (as in a regular mesh protocol). Prototypical examples of single-tree-mesh systems are mTreeBone [65] and Pulsar [37]. Bullet [29] is also a single-tree mesh but instead of relying on the primary tree backbone to deliver the majority of blocks, random subsets of blocks are pushed along a given tree edge and nodes recover the missing blocks via swarming. PRM [9] is a probabilistic single-tree mesh system. Chunkyspread [62], GridMedia [68], and Coolstreaming+ [33, 34] are multi-tree-mesh systems. CPM [22] is a server-based system that combines server multicast and peer-uploads. 234 B. Bhattacharjee and M. Rabinovich 7.3.1 Recovery Protocols Tree-based delivery is fragile, since a single failure disconnects the data delivery until the tree is repaired. Existing protocols have added extra edges to a tree (thus approximating a mesh) for reducing latency [40] and for better failure recovery [9, 67]. These protocols are primarily tree-based, but augment tree delivery (or recovery) using links. Multi-tree protocols are more resilient, since a single failure often affects only one (of k) trees. Mesh delivery is robust by design; single node or even multiple failures are not of high consequence since the data is simply pulled along surviving mesh paths. We next describe in detail different delivery protocols with a focus on their recovery behavior. 7.3.2 Case Study: Recovery in Trees Using Probabilistic Resilient Multicast (PRM) PRM [10] introduces three new mechanisms – randomized forwarding, triggered NAKs and ephemeral guaranteed forwarding – to tree delivery. We discuss randomized forwarding in detail. In randomized forwarding, each overlay node, with a small probability, proactively sends a few extra transmissions along randomly chosen overlay edges. Such a construction interconnects the data delivery tree with some cross edges and is responsible for fast data recovery in PRM under high failure rates of overlay nodes. We explain the details of proactive randomized forwarding [10] using the example shown in Fig. 7.5. In the original data delivery tree (Panel 0), each overlay node forwards data to its children along its tree edges. However, due to network losses on overlay links (e.g., hA; Di and hB; F i) or failure of overlay nodes (e.g., C , L, and Q), a subset of existing overlay nodes do not receive the packet (e.g., D; F; G; H; J; K and M ). We remedy this as follows. When any overlay node receives the first copy of a data packet, it forwards the data along all other tree edges (Panel 1). It also chooses a small number (r) of other overlay nodes and forwards 0 B A E C F D 1 B A D Q G H J K L M N P F E C T T Q G H J K L M N P Fig. 7.5 The basic idea behind PRM. The circles represent the overlay nodes. The crosses indicate link and node failures. The arrows indicate the direction of data flow. The curved edges indicate the chosen cross overlay links for randomized forwarding of data. [10] 7 Overlay Networking and Resiliency 235 data to each of them with a small probability, ˇ. For example, node E chooses to forward data to two other nodes using cross edges F and M . Note that as a consequence of these additional edges some nodes may receive multiple copies of the same packet (e.g., node T in Panel 1 receives the data along the tree edge hB; T i and cross edge hP; T i). Therefore, each overlay node needs to detect and suppress such duplicate packets. Each overlay node maintains a small duplicate suppression cache, which temporarily stores the set of data packets received over a small time window. Data packets that miss the latency deadline are dropped. Hence the size of the cache is limited by the latency deadline desired by the application. In practice, the duplicate suppression cache can be implemented using the playback buffer already maintained by streaming media applications. It is easy to see that each node on average sends or receives up to 1 C ˇr copies of the same packet. The overhead of this scheme is ˇr, where we choose ˇ to be a small value (e.g., 0.01) and r to be between 1 and 3. In PRM, nodes discover other random nodes by employing periodic random walks. It is instructive to understand why such a simple, low-overhead randomized forwarding technique is able to increase packet delivery ratios with high probability, especially when many overlay nodes fail. Consider the example shown in Fig. 7.6, where a large fraction of the nodes have failed in the shaded region. In particular, the root of the subtree, node A, has also failed. So if no forwarding is performed along cross edges, the entire shaded subtree is partitioned from the data delivery tree. No overlay node in this entire subtree would receive data packets until the partition is repaired. However, using randomized forwarding along cross edges a number of nodes from the unshaded region will have random edges into the shaded region as shown (hM; X i; hN; Y i and hP; Zi). The overlay nodes that receive data along such randomly chosen cross edges will subsequently forward data along regular tree edges and any chosen random edges. Since the cross edges are chosen uniformly at random, a large subtree will have a higher probability of cross edges being incident on it. Thus as the size of a partition increases, so does its chance of repair using cross edges. Triggered NAKs are the reactive components of PRM. An overlay node can detect missing data using gaps in received sequence numbers. This information is used to trigger NAK-based retransmissions. PRM further includes a Ephemeral Guaranteed Forwarding technique, which is useful for providing uninterrupted data service M Fig. 7.6 PRM provides successful delivery with high probability because large subtrees affected by a node failure get randomized recovery packets with high probability. [10] A X N Y Z P Overlay subtree with large number of node failures 236 B. Bhattacharjee and M. Rabinovich when the overlay construction protocol is detecting and repairing a partition in the data delivery tree. Here, when the tree is being repaired, the root of an affected subtree receives a stream of data from a “random” peer. More details about PRM are available in [10]. 7.3.3 Case Study: Multi-Tree Delivery Using Splitstream In Splitstream, the media is divided into k stripes, using a coding techniques such as multi-descriptive coding (MDC). All of the stripes in aggregate provides perfect quality, but each stripe can be used independent of the others and each received stripe progressively improves the stream quality. Splitstream forms k trees, such that, ideally, each node is an interior node in only one tree. The source multicasts stripes onto different trees, and each node receives all stripes and forwards only one stripe. When a node departs, at most one tree is affected since every node is a leaf in all but one tree. Therefore, node departures do not affect delivery quite as much as a single tree system. Further, the forwarding bandwidth of every node is now used, since each node is an interior node in at least one stripe tree. Finally, since each stripe is approximately 1=kth the bandwidth of the original stream, each node can serve more children, which results in a shorter tree (higher average outdegree) and lower latency. Splitstream is built atop Scribe, which itself is an overlay multicast protocol built using the Pastry DHT. Due to bandwidth constrains on individual nodes, it is not always feasible to form the ideal interior-disjoint trees such that each node is an interior node in only one tree. In particular, a stripe tree may run out of forwarding bandwidth (because all of its leaf nodes are interior nodes in some other tree). To solve this problem, Splitstream maintains a “Spare Capacity Group (SCG),” which contains nodes with extra capacity that can forward onto more than one stripe. In bandwidth-scare deployments, nodes may have to use the SGC to locate a parent. In extreme cases, it may be impossible to form a proper Splitstream forest; however, this condition is rare and analysed in detail in [13]. 7.3.4 Case Study: Recovery Using a Mesh in CoolStreaming/DONet In Coolstreaming, a random mesh connects the members of the data overlay, and random blocks are “pulled” from different mesh neighbors. Each node maintains an mCache, which is a partial list of other active nodes in the overlay. A new node initially contacts the source; the source selects a random “deputy” from its mCache, and the deputy supplies the new node with currently active nodes. Each 7 Overlay Networking and Resiliency 237 node periodically percolates a message (announcing itself) onto the overlay using a gossip protocol. The media stream is divided into fixed sized segments; each segment has a sequence number and each node maintains a bitmap, called the buffer map, to represent the availability of segments. In CoolStreaming, the default buffer map contains 120 bits. Each node maintains neighbors (called partners) proportional to its forwarding bandwidth, while still maintaining a minimum number of partners (typically 5). Nodes periodically (usually every second) exchange their buffer maps with their partners, and use a scheduling heuristic to exchange blocks. The scheduling algorithm must select a block to request, and an eligible node to request the block from. The block requested is the scarcest block (supplied by least number of nodes). The node from which this block is requested is the eligible node (which has advertised the scarce block) with the most bandwidth. The origin node serves only as a supplier and publishes a new content block every second. Partners can be updated from the node’s mCache as needed, and the mCache is updated using the periodic gossip. Individual node failures have very little effect on the delivery since a node can simply select a different partner to receive a block. However, the trade-off is control overhead (bitmap exchange) and latency (which is now proportional to the product of buffer map size and overlay diameter). 7.4 Web Content Delivery Networks Resource provisioning is a fundamental challenge for Internet content providers. Too much provision and the infrastructure will simply depreciate without generating return on investment; too little provision and the web site may lose business and potentially steer users to competitors. A content delivery network (CDN) offers a service to content providers that helps address this challenge. A typical CDN provider deploys a number of CDN servers around the globe and uses them as a shared resource to deliver content from multiple content providers that subscribe to the CDN’s service. The CDN servers are also known as edge servers because they are often located at the edges of the networks in which they are deployed. Content delivery networks represent a type of overlay network because they route content between the origin sites and the clients through edge servers. A CDN improves resiliency and performance of subscribing web sites in several ways. As already mentioned in Section 7.1.1, a CDN can reuse capacity slack to absorb demand peaks for different content providers at different times. By sharing a large slack across a diverse pool of content providers, CDNs improve resiliency of the subscribing web sites to flash crowds. 238 B. Bhattacharjee and M. Rabinovich A CDN promises a degree of protection against denial of service attacks because the enormous capacity the attacker would need to saturate to exert any noticeable performance degradation. A CDN improves the performance of content delivery under normal load because it can process client requests from a nearby edge server. CDNs are used to deliver a variety of content, including static web objects, software packages, multimedia files, and streaming content – both video-on-demand and live. For video-on-demand, edge servers deliver streams to viewers from their cached files; typically, these files are pre-loaded to the edge server caches from origin sites as they become available. However, if a requested file is not cached, the edge server will typically obtain the stream from the origin and forward it to the viewer, while also storing the content locally for future requests. In the case of live streaming (“Webcasts”), content flows form a distribution tree, with viewers as leaves, edge servers as intermediate nodes, and the origin as the root. Often, however, CDN servers form deeper trees. In either case, Webcast delivery through a CDN can benefit from various tree-based approaches to streaming media systems such as those discussed in Section 7.3. In the rest of this section, we will limit our discussion to how CDNs deliver static files, including static web objects, software packages, multimedia files, etc. 7.4.1 CDN Basics A CDN must interpose its infrastructure transparently between the content provider and the user. Furthermore, unlike P2P networks where users run specialized peer software, a CDN must serve clients using standard web browsers. Thus, a fundamental building block in a CDN is a mechanism to transparently reroute user requests from the content provider’s site (known as the “origin size” in the CDN parlance) to the CDN platform. The two main techniques that have been used for this purpose are DNS outsourcing and URL rewriting. Both techniques rely on the domain name system (DNS), which maps human-readable names, such as www.firm-x.com, to numeric Internet protocol (IP) addresses. A browser’s HTTP request is preceded by a DNS query to resolve the host name from the URL. The DNS queries are sent by browsers’ local DNS servers (LDNS) and processed by the web sites’ authoritative DNS servers (ADNS). In URL rewriting, a content provider rewrites its web pages so that embedded links use host names belonging to the CDN domain. For example, if a page www.firm-x.com contains an image picture.jpg that should be delivered by the CDN, the image URL would be rewritten to a form such as http://images.firmx.com.cdn-foo.net/real.firm-x.com/picture.jpg. In this case, the DNS query for images.firm-x.com.cdn-foo.net would arrive to CDN’s DNS server in a normal way, without redirection from firm-x.com’s ADNS. Note that URL rewriting only works for embedded and hyperlinked content. The container pages (i.e., the entry points to the web sites) would have to be delivered from the origin site directly. 7 Overlay Networking and Resiliency 239 CDN 135.207.24.10 5 Client 1 4 135.207.24.11 135.207.24.11 Images.firm-x.com? 135.207.24.12 6 Firm-x.com 192.15.183.17 CDN_DNS 135.207.25.01 3 Auth DNS Local DNS 135.207.24.13 Images.firm-x.com? 2 “Ask 135.207.25.01” Fig. 7.7 A high-level view of a CDN architecture DNS outsourcing refers to techniques that exploit mechanisms in the DNS protocol that allow a query to be redirected from one DNS server to another. Beside responses containing IP addresses, the DNS protocol allows two response types that can be used for redirection. An NS-type response specifies a different DNS server that should be contacted to resolve the query. A CNAME-type response specifies a canonical name, a different host name that should be used instead of the name contained in the original query. Either response type can be used to implement DNS outsourcing. Figure 7.7 depicts a high-level architecture of a CDN utilizing DNS outsourcing. Consider a content provider – firm-x.com in the example – that subscribes to CDN services to deliver its content from the images.firm-x.com subdomain. (Content from other subdomains, such as www.firm-x.com might be delivered independently, perhaps by the provider’s origin server itself.) When a client wants to access a URL with this hostname, it first needs to resolve this hostname into the IP address of the server. To this end, it sends a DNS query to its LDNS (step 1), which ultimately sends it to the ADNS server for firm-x.com (step 2). ADNS now engages the CDN by redirecting LDNS’s query to the DNS server operated by the CDN provider (CDN DNS in the figure). ADNS does it by returning, in the exchange of step 2, an NS record specifying CDN DNS. LDNS now sends the query for images.firm-x.com to CDN DNS, which can now choose an appropriate edge server and return its IP address to LDNS (step 3). The LDNS server forwards the response to the client (step 4), which now downloads the file from the specified server (step 5). When the request arrives at the edge server, the server may or may not have the requested file in its local cache. If it does not, it 240 B. Bhattacharjee and M. Rabinovich obtains the file from the origin server (step 6) and sends it to the client; the edge server can also cache this file for future use, depending on the cache-controlling headers that came with the file from the origin server. With either DNS outsourcing or URL rewriting, when a DNS query arrives at CDN’s DNS server, the latter has the discretion to select the edge server whose IP it would return in the DNS response. This provides the CDN with an opportunity to spread the content delivery load among its edge servers (by resolving different DNS queries to different edge servers) and to localize content delivery (by resolving a given DNS query to an edge server that close to the requesting client, according to some metric). There are a number of sometimes contradicting factors that can affect edge server selection. The mechanisms and policies for server selection is a large part of what distinguishes different CDNs from one another. The much-simplified architecture described above is fully workable except for one detail: how does the edge server receiving a request know which origin server to contact for the requested file? CDNs use two basic approaches to this issue. In the example of Fig. 7.7, assuming the client uses HTTP 1.1, the client will include an HTTP Host header “Host:images.firm-x.com” with its request to the edge server. This gives the edge server the necessary information. Another approach, which does not rely on the host header, involves embedding provider identity into the path portion of the URL. This technique is used in particular with URL rewriting. For example, with the above URL http://images.firmx.com.cdn-foo.net/real.firm-x.com/picture.jpg, the client’s request to the edge serve will be for file “real.firm-x.com/picture.jpg”, providing edge server with the information about the origin server. 7.4.2 Bag of DNS Tricks Looking at Fig. 7.7, an immediate concern with this architecture is the CDN DNS server. First, it is a centralized component that can become the bottleneck in the system. Second, it undermines localized data delivery to some degree because all DNS queries must travel to this centralized component no matter where they come from. These issues are exacerbated by the fact that, in order to retain fine-grained control over edge server selection, CDN DNS must limit the amount of time its responses can be cached and reused by clients. It does so by assigning a low timeto-live (TTL) value to its responses, a standard DNS protocol feature for controlling response caching. This increases the volume of DNS queries that CDN DNS must handle. Moderate-sized CDNs sometimes disregard these concerns because DNS queries usually take little processing, with a single server capable of handling several thousand queries per second. With additional consideration that DNS server load is easily distributed in a server cluster, the centralized DNS resolution can handle large amounts of load before becoming the bottleneck in practice. Furthermore, the overhead of nonlocalized DNS processing only becomes noticeable in practice 7 Overlay Networking and Resiliency 241 for delivering small files. For large file downloads, such as software packages or multimedia files, a few hundred millisecond of initial delay will be negligible compared to several minutes of the download itself. Large CDNs, however, deal with extraordinary total loads and provide content delivery services for all file sizes. Thus, they implement their DNS service as a distributed system in its own right. One approach to implement a distributed DNS service again utilizes DNS redirection mechanisms. For example, the Akamai CDN [1] implements a two-level DNS system. The top-level DNS server is a centralized component and is registered as the authoritative DNS server for the accelerated content. Thus, initial DNS queries arrive at this server. The top-level DNS server responds to queries with an NS-type response, redirecting the requester to a nearby low-level DNS server. Moreover, these redirecting responses are given a long TTL, in effect pinning the requester to the selected low-level DNS server. The actual name resolution occurs at the low-level DNS servers. Because most DNS interactions occur between clients and low-level CDN DNS servers, the DNS load is distributed and the interactions are localized in the network. Another approach uses a flat DNS system, and utilizes IP anycast to spread the load among them. A CDN using this approach deploys a number of CDN DNS servers in different Internet locations but assigns them the same IP address. Then, it relies on the underlying IP routing infrastructure to deliver clients’ DNS queries destined to this IP address to the closest CDN DNS server. In this way, DNS processing load is both distributed and localized among the flat collection of DNS servers. The Limelight CDN [35] utilizes this technique. Beside DNS service scalability, Limelight further leverages the above technique to sidestep the decision about which of the data centers would be the closest to the client. In particular, Limelight deploys a DNS server in every data center; then each given request will be delivered by the anycast mechanism to its closest data center. The DNS server receiving a request then simply picks one of the edge servers co-located in the same data center for the subsequent download. This approach, however, is not without drawbacks. One limitation is that it relies exclusively on the proximity notion reflected in Internet routing; there are other considerations, such as network congestion and costs. Another limitation is due to the originator problem discussed in the next subsection. 7.4.3 Issues The basic idea behind CDNs might seem simple, but many technical challenges lurk. An obvious challenge is server selection, which is an open-ended issue. There are a number of factors that may affect the selection. A basic factor is proximity: one of the key promises of CDN technology is that they can deliver content from a nearby network location. But what does “nearby” mean? To start with, there are a number of proximity metrics one could use, which 242 B. Bhattacharjee and M. Rabinovich differ in how closely they correlate with end-to-end performance and how hard they are to obtain. Geographical distance, autonomous system hops, and router hops, could be used as relatively static proximity metrics. Static metrics may incorporate domain knowledge, such as maps of private peering points among network providers, since private peering points can be more reliable than public network access points. Then, one could consider dynamic path characteristics, such as packet loss, network packet travel delay (one-way or round-trip), and available path bandwidth. Obtaining these dynamic metrics and keeping them fresh is much more challenging. Further, a CDN may account for economic factors, such as the preference of utilizing certain network carriers even at the expense of a slight performance degradation. Once the proximity metrics are figured out, the next question is how to combine them with server load metrics, since in the end we need to pick a certain edge server for a given request. Server loads are inherently dynamic. They raise a number of questions of their own, with their own research literature. How long a history of past data to consider, and which load characteristics to measure? One can consider a variety of characteristics, including CPU usage, network utilization, memory, and disk IO. How frequently to collect load measurements, and how frequently to recompute load metrics? How to avoid a “herd effect” [19], where a CDN sends too much the demand to an underloaded server, only to overload it in the next cycle? The next set of questions is architectural in nature. As we discussed earlier, the prevalent mechanism in CDNs for routing requests to a selected edge server is based on DNS. DNS-based routing raises so-called originator and hidden load problems [49]. The originator problem is due to the fact that CDN proximity-based server selection can only be done relative to the originator of the DNS query, which is the client’s DNS server, and not the actual host that will be downloading the content. Thus, the quality of any proximity-based server selection is limited by how close the actual client is to the LDNS it is using. While there has been some work on determining the distance between clients and their LDNSs [42, 57], the end-to-end effect of this issue on user-perceived performance is not yet fully known. One way to sidestep the originator problem is to utilize IP anycast for the HTTP interaction [2]. Similar to anycast-based DNS interactions considered previously, different edge servers in this case would advertise the same IP address. This address would be returned to the clients by CDN DNS, and packets from a given browser machine would be delivered to the closest edge server naturally thanks to IP routing. Anycast was previously considered unsuitable for HTTP downloads for two reasons. First, unlike DNS that uses the UDP transport protocol by default, HTTP runs on top of TCP. TCP is a stateful connection-oriented protocol, and if a routing path changes in the middle of the ongoing download, the edge server browser may attempt to continue the download from a different edge server, leading to a broken TCP connection. Second, IP anycast selects among end-points for packet delivery without consideration for the routing path quality or end-point load. However, recent insights into the anycast behavior [7] and network traffic engineering [63] 7 Overlay Networking and Resiliency 243 alleviate these concerns, especially when a CDN is deployed within one autonomous system. ICDS – a CDN service by AT&T [5] – is currently pursuing a variant of this approach. The hidden load problem arises because of drastically different number of clients behind different LDNS servers. A large ISP likely has thousands of clients sharing the same LDNS. Then, a single DNS query from this LDNS can result in a large amount of demand for the selected edge server. At the same time, a single query from the LDNS of a small academic department will impose much smaller load. Because a CDN distributes load at the granularity of DNS queries, potentially drastic and unknown imbalances of load resulting from single queries complicate proper load balancing. Another architectural issue relates to the large number of edge servers a CDN maintains. When new popular content appears and generates a large number of requests, these requests will initially miss in the edge server caches and will be forwarded to the origin server. These missed requests may overload the origin server in the beginning of a flash crowd, until edge servers populate their caches [27]. CDNs often pre-load new content to the edge servers when the content is known to be popular. However, unpredictable flash crowds present a danger. Consequently, CDNs sometimes deploy peer-to-peer cooperation among their edge servers, with edge servers forwarding missed requests to each other rather than directly to the origin server. This gives rise to more complex overlay network topologies than the one-hop overlay routing in the basic CDN architecture described here. In fact, the underlying mechanisms can be even more complex: the complex overlay topologies add overhead due to application-level processing at each hop. Thus, one could try to use simple one-hop topology under normal load and add more complex request routing dynamically once the danger of a “miss storm” is detected. This in turn opens a range of interesting algorithmic questions involved in deciding when to start forming a complex topology and how to form it. This overview is necessarily brief. Its goal is only to convey the fact that content delivery networks represent an important aspect of Internet infrastructure and a rich environment for research and innovation. We refer the reader to more targeted literature, such as [24, 49, 64] 7.5 Attack-Resilient Services We have seen that overlay systems provide resilience by design: the lack of centralized entities naturally provides a measure of resilience against component failures. Overlay systems can also form the building block for systems that are resilient to malicious attack. SOS [28] and a subsequent derivative, Mayday [3], are the two overlay systems that provide denial-of-service protection for Internet services. We discuss SOS next. 244 B. Bhattacharjee and M. Rabinovich 7.5.1 Case Study: Secure Overlay Service (SOS) Secure Overlay Services (SOS) is an overlay network designed to protect a server (the target) from distributed denial of service attacks. SOS enables a “confirmed” user to communicate with the protected service. Conceptually, the service is protected by a “ring” of SOS overlay nodes, which are able to confirm incoming requests as valid. Once a request is validated, it is forwarded on to the service. Users, by themselves, are not able to directly communicate with the service (initially); in fact, the protected server’s address may be hidden or changing. SOS forms a distributed firewall around the target server. The server advertises the SOS overlay nodes (called Service Overlay Access Points [SOAPs])) as its initial point of contact. Users initiate contact to the server by connecting to one of the SOS overlay nodes. Malicious users may attack overlay nodes, but by assumption are not able to bring down the entire overlay. The server’s ISP filters all packets to the server’s address, except for a chosen few (who are allowed to traverse this firewall). These privileged nodes are called “secret servlets”. Secret servlets designate a few SOAP nodes (called Beacons) as the rendezvous point between themselves and incoming connections. Regular SOAP nodes use an overlay routing protocol (such as Chord) to route authenticated requests to the Beacons. Beacons know of and forward requests to the secret servlets. Only secret servlets are allowed through the ISP firewall around the target, and the servlets finally forward the authenticated request to the protected server. 7.6 File-Sharing Peer-to-Peer Networks Consider the task of distributing a large file (e.g., in the order of hundreds of MB) to a large number of users. We already discussed one overlay approach – CDNs – targeting such an application. However, the CDN approach requires the source of the file to subscribe to CDN services (and pay the resultant service fees). Furthermore, this approach requires a CDN company to be vigilant in provisioning enough resources to keep up with the potential scale of downloads involved. Peer-to-peer networks provide an appealing alternative, which organizes users themselves into an overlay distribution platform. This approach is appealing to content providers because it does not require a CDN subscription. It also scales naturally with the popularity of a download: the more users are downloading a file, the more resources take part in the overlay distribution network adding the capacity to the delivery platform. Some peer-to-peer networks also provide administrative resiliency, as they have no special centralized administrative component. In fact, the utilization of the client upload bandwidth and CPU capacity in content delivery can also make P2P techniques interesting as an adjunct (rather than an alternative) to a CDN service. 7 Overlay Networking and Resiliency 245 In this section, we will concentrate on unique challenges that arise when the P2P system downloads a large (e.g., on the order of 100s of MB) file. In particular, we will consider the following two challenges: Block Distribution Imagine a flash crowd downloading a 100 MB software package. A naive approach (pursued by early P2P networks) would let each peer download the entire file and then make itself available as a source of this file for other peers. This approach, however, would not be able to sustain a flash crowd. Indeed, each peer would take a long time – tens of minutes over a typical residential broadband connection – to download this file and in the meantime the initial file source would have no help in coping with the demand. The solution is to chop the file into blocks and distribute different blocks to different peers, so that they can start using each other faster for block distribution. But this creates an interesting challenge. Obviously, the system needs to make a diverse set of blocks available as quickly as possible, so that each peer has a better chance of finding another peer from which to obtain missing blocks. But achieving this diversity is difficult when no peer possesses global knowledge about block distribution at any point in time. Free Riders A particularly widespread phenomenon is that of selfish peers: peers that attempt to make use of the peer content delivery without contributing their own resources. These peers are called “free riders”. More generally, a peer may try to bypass fairness mechanisms in the P2P network and obtain more than its share of resources, thus getting better service at the expense of other users. We will consider these two challenges in the context of the mesh model of content distribution. Using the terminology of BitTorrent – a popular P2P network – the key components of a mesh P2P network are seeds, trackers, and peers (or leechers). Originally the file exists at the source server (or servers) called seeds. There is a special tracker node that keeps track of at least some subset of the peers who are in the process of downloading the file. A new peer joins the download (a swarm) by contacting the tracker, obtaining a random subset of existing peers, and establishing P2P connections (i.e., overlay network links) with them. The download makes collective progress by peers exchanging missing blocks along the overlay edges. Having completed the download, a peer may stay in the swarm as a seed, uploading without downloading anything more in return. 7.6.1 Block Distribution Problem BitTorrent attempts to achieve a uniform distribution of blocks (or “pieces”: a set of blocks in BitTorrent) among the peers through localized decisions. Neighboring peers exchange lists of blocks that they already have. A peer determines which of the blocks it is missing are the rarest in its local neighborhood and requests these blocks first. Because the neighborhoods in the BitTorrent protocol evolve over time, 246 B. Bhattacharjee and M. Rabinovich the rarest-first block distribution leads to more uniform distribution of blocks in the network and to better chance of a peer finding a useful block without contacting the source. Recently, an ingenious alternative to the BitTorrent protocol has been proposed, which removes the issue of choosing the blocks completely [21]. This new approach, called Avalanche, follows the same mesh model with seeds, trackers, and peers, as BitTorrent. However, Avalanche makes virtually every block useful to any peer through network coding as follows. Peers no longer choose a single, original block to download from their neighbors at a time. Instead, every time a peer uploads a block to a neighbor, it simply computes a linear combination of all the blocks it currently has from a given file using random coefficients, and uploads the result along with auxiliary information, derived from the coefficients it used and those previously received with its own downloaded blocks. Once a peer collects enough encoded blocks (usually the same number as the number of blocks in the file), it can reconstruct the original file by solving a system of linear equations. A system implementing these ideas has been publicly available as Microsoft Secure Content Downloader since 2007, although the original author of BitTorrent raised questions about the importance of the removal of the block distribution problem in practice and the possible performance overhead involved [17]. These concerns have been reflected in recent empirical studies demonstrating that BitTorrent’s rarest-first piece selection strategy effectively provides block uniformity [30]. 7.6.2 Free Riders Problem: Upload Incentives To improve its resiliency to free riding, BitTorrent utilizes an incentives mechanism. The goal of this mechanism is to ensure that peers who contribute more to content upload receive better download service. Just like its approach to block distribution problem, BitTorrent implements its incentives mechanism largely through localized decisions by each peer using a round-based unchoking algorithm to decide how much to send to its neighbors. When a peer learns a set of other peers from the file’s tracker (usually around 30–50), the peer starts by establishing connections to these peers, some of which will agree to send blocks to the peer. At the end of every unchoking round (10 s in most BitTorrent clients), the peer decides which of the peers it should upload blocks to in the next round. To this end, the peer considers the throughput of its download from the peers in the previous round and selects a small number (four in Azureus, a popular BitTorrent client implementation) of peers to which it will upload blocks in the next round. Selecting a peer for uploading is called “unchoking” a peer. In addition to unchoking the top four peers who have given in the past, a peer also unchokes another peer at random in each round. This helps the peer to bootstrap new peers, to discover potentially higher-performing peers, and to ensure that every peer, even with poor connectivity, makes some progress; without this “optimistic 7 Overlay Networking and Resiliency 247 unchoking,” these impoverished peers would end up choked by everybody. Except for optimistic unchoking, a peer only uploads to other peers if they have blocks that it does not. If two peers have blocks that the other lacks, the peers are said to be interested in one another. This protocol works because a free rider will end up being choked by most of its neighbors, only relying on random unchokes to make any progress. However, recent work [48] has found that the BitTorrent protocol penalizes high-capacity peers: as the upload performance of a peer increases, its download performance grows but less than proportionally to the upload contribution. In other words, the protocol is not entirely tit-for-tat in a usual sense of the word. Consequently, a new BitTorrent client called BitTyrant has been implemented that improves the download performance of high-capacity peers [48]. BitTyrant achieves this goal by exploiting the following observation. Regular BitTorrent peers allocate their upload capacity equally among their unchoked neighbors. Because of this, a strategic peer does not need to upload to regular peers at its maximum capacity: it only needs to upload faster than most of its peers’ other neighbors, so that its peers would keep it unchoked. Thus, the key idea behind the BitTyrant client is to keep an estimate of the individual upload rates to its neighbors that is sufficient to stay in the neighbors’ unchoked set most of the time, and to upload to each neighbor at just that rate. Then, BitTyrant uses the spared upload capacity to unchoke more peers and hence to increase its download performance. Furthermore, the BitTyrant client selects only the peers with the highest return-on-investment: those peers whose data capacity can be obtained “cheaply.” The authors of BitTyrant observed significant reduction in file download times by their modified client. However, if all clients adopted selfish BitTyrant behavior with cut-off of expensive peers as mentioned above, the overall performance for all clients would decrease, especially for low-capacity clients. Thus, while discouraging free riding, BitTorrent still relies on altruistic contribution of high-capacity peers to achieve its performance. Although BitTorrent’s unchoking algorithm of giving to the top-four contributors has been broadly described as being tit-for-tat, recent work has shown that it is more accurately represented as an auction [32]. Each unchoking round can be viewed as an auction, where the “bids” are other peers’ uploads in previous rounds, and the “good” being auctioned is the peer’s upload bandwidth. Viewed this way, BitTyrant’s strategy of “coming in the last (winning) place” is easily seen as the clear winning strategy. Also by reframing BitTorrent as an auction, a solution to strategic attempts like BitTyrant arises: change the way peers “clear” their auction. A new client has been introduced that replaces BitTorrent’s top-four strategy with a proportional share response. Proportional share is a simple strategy: if a peer has given some fraction, say 10%, of all of the blocks you received in the previous round, then allot to that peer the same fraction, 10%, of your upload bandwidth. Note that this does not necessarily result in peers providing the same number of blocks in return, rather the same fraction of bandwidth. This results in what turns out to be a very robust form of fairness: the more a peer gives, the more that peer gets. Even highly provisioned peers therefore have incentive to contribute as much of 248 B. Bhattacharjee and M. Rabinovich their bandwidth as possible. The authors of this PropShare client have demonstrated that proportional share is resilient to a wide array of strategic manipulation. Further, PropShare outperforms BitTorrent and BitTyrant, and as more users adopt the PropShare client, the overall performance of the system improves.This work demonstrates the importance of an accurate model of incentives in a complex system such as BitTorrent. A strategic peer can achieve higher download performance by manipulating the list of blocks it announces to its neighbor [32]. Suppose node p in a BitTorrent swarm possesses some rare blocks. Since p has rare blocks, it is going to be interesting to many of its neighbors, who will all want to upload blocks to p in exchange for these rare blocks. However, once p announces these rare blocks, p’s neighbors will download these blocks from p and exchange them amongst themselves. Node p can sustain interest amongst its neighbors longer by under-reporting its block map, in particular, by strategically revealing the rare blocks one by one. This strategy guarantees p remains interested for longer since p’s neighbors, who all get the same rare block from p, cannot benefit by exchanging amongst themselves. This observation suggests a general under-reporting strategy. A node can remain interesting to its neighbors longest by announcing only the blocks necessary to maintain interest but no more. Similar to an all-BitTyrant strategy, when all peers strategically under-report their blocks in this manner [32], the overall performance of the system degrades. In general, BitTorrent’s incentives mechanisms have come under intense scrutiny. Through rich empirical studies and analyses that incorporate various economic principles, BitTorrent continues to grow more robust to cheating clients. Whether a system as complex as BitTorrent can be made fully robust to such users remains open. 7.7 Conclusion This chapter considers ways by which overlays-based techniques improve application resiliency. We have described how applications can utilize overlay networks to better cope with challenges such as flash crowds, the need to scale to often unpredictable loads, network failures and congestion, and denial of service attacks. We have considered a representative sample of these applications, focusing on their use of overlay network concepts. This sample included distributed hash tables, network storage, large file distribution by peer-to-peer networks, streaming content delivery, content delivery networks, and web services. It is simply not feasible to comprehensively cover overlay applications and research within one chapter. Instead, we hope that this chapter conveys sufficient information to give the reader a sampling of the various application domains where overlays are useful, and a sense for the flexibility that overlay networks provide to an application designer. 7 Overlay Networking and Resiliency 249 Acknowledgments The authors thank Katrina LaCurts, Dave Levin, and Adam Bender for their comments on this chapter. The authors are grateful to the editors, Chuck Kalmanek and Richard Yang, for their comments and encouragement. References 1. Akamai Technologies. Retrieved from http://www.akamai.com/html/technology/index.html 2. Alzoubi, H. A., Lee, S., Rabinovich, M., Spatscheck, O., & Van der Merwe, J. (2008). Anycast cdns revisited. In Proceedings of WWW ’08 (pp. 277–286). New York, NY: ACM. DOI http://doi.acm.org/10.1145/1367497.1367536 3. Andersen, D. G. (2003). Mayday: Distributed filtering for Internet services. In USITS. 4. Andersen, D. G., Balakrishnan, H., Kaashoek, M. F., & Morris, R. (2001). Resilient overlay networks. In Proceedings of 18th ACM SOSP, Banff, Canada. 5. ATT ICDS: Retrieved from http://www.business.att.com/service fam overview.j-sp?serv fam=eb intelligent content distribution 6. Balakrishnan, H., Lakshminarayanan, K., Ratnasamy, S., Shenker, S., Stoica, I., & Walfish, M. (2004). A layered naming architecture for the Internet. In Proceedings of the ACM SIGCOMM, Portland, OR. 7. Ballani, H., Francis, P., & Ratnasamy, S. (2006). A measurement-based deployment proposal for IP anycast. In Proceedings of the ACM IMC, Rio de Janeiro, Brazil. 8. Banerjee, S., Bhattacharjee, B., & Kommareddy, C. (2002). Scalable application layer multicast. In Proceedings of ACM SIGCOMM, Pittsburg, PA. 9. Banerjee, S., Lee, S., Bhattacharjee, B., & Srinivasan, A. (2003). Resilient multicast using overlays. In Proceedings of the Sigmetrics 2003, Karlsruhe, Germany. 10. Banerjee, S., Lee, S., Bhattacharjee, B., & Srinivasan, A. (2006). Resilient overlays using multicast. IEEE/ACM Transactions of Networking, 14(2), 237–248. 11. Bender, A., Sherwood, R., Monner, D., Goergen, N., Spring, N., & Bhattacharjee, B. (2009). Fighting spam with the NeighborhoodWatch DHT. In INFOCOM. 12. Castro, M., Druschel, P., Ganesh, A. J., Rowstron, A. I. T., & Wallach, D. S. (2002). Secure routing for structured peer-to-peer overlay networks. In OSDI. 13. Castro, M., Druschel, P., Kermarrec, A., Nandi, A., Rowstron, A., & Singh, A. (2003). Splitstream: High-bandwidth multicast in a cooperative environment. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), Lake Bolton, NY. 14. Castro, M., Druschel, P., Kermarrec, A. M., & Rowstron, A. (2002). Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE Journal on Selected Areas in Communication, 20(8), 1489–1499. DOI 10.1109/JSAC.2002.803069 15. Chu, Y., Ganjam, A., Ng, T., Rao, S., Sripanidkulchai, K., Zhan, J., & Zhang, H. (2004). Early experience with an Internet broadcast system based on overlay multicast. In Proceedings of USENIX Annual Technical Conference, Boston, MA. 16. Chun, B., Culler, D., Roscoe, T., Bavier, A., Peterson, L., Wawrzoniak, M., & Bowman, M. (2003). Planetlab: An overlay testbed for broad-coverage services. SIGCOMM Computer Communication Review, 33(3), 3–12. 17. Cohen, B. Avalanche. Retrieved from http://bramcohen.livejournal.com/20140.html 18. Dabek, F., Kaashoek, M. F., Karger, D. R., Morris, R., & Stoica, I. (2001). Wide-area cooperative storage with cfs. In SOSP (pp. 202–215). 19. Dahlin, M. (2000). Interpreting stale load information. IEEE Transactions on Parallel and Distributed Systems, 11(10), 1033–1047. 20. Fiat, A., Saia, J., & Young, M. (2005). Making chord robust to Byzantine attacks. In ESA. 21. Gkantsidis, C., & Rodriguez, P. (2005). Network coding for large scale content distribution. In INFOCOM (pp. 2235–2245). 250 B. Bhattacharjee and M. Rabinovich 22. Gopalakrishnan, V., Bhattacharjee, B., Ramakrishnan, K. K., Jana, R., & Srivastava, D. (2009). Cpm: Adaptive video-on-demand with cooperative peer assists and multicast. In Proceedings of INFOCOM, Rio De Janeiro, Brazil. 23. Gupta, I., Birman, K. P., Linga, P., Demers, A. J., & van Renesse, R. (2003). Kelips: Building an efficient and stable p2p dht through increased memory and background overhead. In IPTPS (pp. 160–169). 24. Hofmann, M., & Beaumont, L. R. (2005). Content networking: Architecture, protocols, and practice. San Francisco, CA: Morgan Kaufmann. 25. Iyer, S., Rowstron, A. I. T., & Druschel, P. (2002). Squirrel: A decentralized peer-to-peer web cache. In PODC (pp. 213–222). 26. Jannotti, J., Gifford, D., Johnson, K. L., Kaashoek, M. F., & Jr., J. W. O. (2000). Overcast: reliable multicasting with an overlay network. In Proceedings of the Fourth Symposium on Operating System Design and Implementation (OSDI), San Diego, CA. 27. Jung, J., Krishnamurthy, B., & Rabinovich, M. (2002). Flash crowds and denial of service attacks: Characterization and implications for cdns and web sites. In WWW (pp. 293–304). 28. Keromytis, A. D., Misra, V., & Rubenstein, D. (2002). SOS: Secure overlay services. In SIGCOMM. 29. Kostic, D., Rodriguez, A., Albrecht, J., & Vahdat, A. (2003). Bullet: High bandwidth data dissemination using an overlay mesh. In Proceedings of SOSP (pp. 282-297), Lake George, NY. 30. Legout, A., Urvoy-Keller, G., & Michiardi, P. (2006). Rarest first and choke algorithms are enough. In IMC. 31. Lemos, R.: Blue security folds under spammer’s wrath. http://www.securityfocus.com/news/ 11392 32. Levin, D., LaCurts, K., Spring, N., & Bhattacharjee, B. (2008). Bittorrent is an auction: Analyzing and improving bittorrent’s incentives. In SIGCOMM (pp. 243–254). 33. Li, B., Xie, S., Qu, Y., Keung, G., Lin, C., Liu, J., & Zhang, X. (2008). Inside the new coolstreaming: Principles, measurements and performance implications. In Proceedings of the INFOCOM 2008, Phoenix, AZ (pp. 1031–1039). 34. Li, B., Yik, K., Xie, S., Liu, J., Stoica, I., Zhang, H., & Zhang, X. (2007). Empirical study of the coolstreaming system. Proceedings of the IEEE Journal on Selected Areas in Communication (Special Issues on Advance in Peer-to-Peer Streaming Systems), 25(9), 1627-1639. 35. http://www.limelightnetworks.com/network.htm 36. Linga, P., Gupta, I., & Birman, K. (2003). A churn-resistant peer-to-peer web caching system. In 2003 ACM Workshop on Survivable and Self-Regenerative Systems (pp. 1–10). 37. Locher, T., Meier, R., Schmid, S., & Wattenhofer, R. (2007). Push-to-pull peer-to-peer live streaming. In Proceedings of the International Symposium of Distributed Computing, Lemesos, Cyprus. 38. Lumezanu, C., Baden, R., Levin, D., Spring, N., & Bhattacharjee, B. (2009). Symbiotic relationships in internet routing overlays. In Proceedings of NSDI, Boston, MA. 39. Magharei, N., & Rejaie, R. (2007). PRIME: Peer-to-peer receiver-drIven MEsh-based streaming. In Proceedings of the INFOCOM 2007, Anchorage, Alaska (pp. 1424–1432). 40. Magharei, N., Rejaie, R., & Guo, Y. (2007). Mesh or multiple-tree: A comparative study of live p2p streaming approaches. In Proceedings of the INFOCOM 2007, Anchorage, Alaska. 41. Malkhi, D., Naor, M., & Ratajczak, D. (2002). Viceroy: A scalable and dynamic emulation of the butterfly. In PODC (pp. 183–192). 42. Mao, Z. M., Cranor, C. D., Douglis, F., Rabinovich, M., Spatscheck, O., & Wang, J. (2002). A precise and efficient evaluation of the proximity between web clients and their local dns servers. In USENIX Annual Technical Conference (pp. 229–242). 43. Morselli, R., Bhattacharjee, B., Marsh, M. A., & Srinivasan, A. (2007). Efficient Lookup on Unstructured Topologies. IEEE Journal on Selected Areas in Communications, 25(1), 62–72. 44. Nakao, A., Peterson, L., & Bavier, A. (2006). Scalable routing overlay networks. SIGOPS Operating Systems Review, 40(1), 49–61. 45. Padmanabhan, V., Wang, H., Chou, P., & Sripanidkulchai, K. (2002). Distributing streaming media content using cooperative networking. In NOSSDAV, Miami Beach, FL, USA. 7 Overlay Networking and Resiliency 251 46. Pai, V., Kumar, K., Tamilmani, K., Sambamurthy, V., & Mohr, A. (2005). Chainsaw: Eliminating trees from overlay multicast. In IPTPS 2005, Ithaca, NY, USA. 47. Painese, F., Perino, D., Keller, J., & Biersack, E. (2007). PULSE: An adaptive, incentive-based, unstructured p2p live streaming system. IEEE Trans. on Multimedia 9(8), 1645–1660. 48. Piatek, M., Isdal, T., Anderson, T. E., Krishnamurthy, A., & Venkataramani, A. (2007). Do incentives build robustness in bittorrent? (awarded best student paper). In NSDI. 49. Rabinovich, M., & Spatscheck, O. (2001). Web caching and replication. Reading, MA: Addison-Wesley, Longman Publishing Co., Inc. Boston, MA, USA. 50. Ramasubramanian, V., & Sirer, E. G. (2004). Beehive: O(1) lookup performance for power-law query distributions in peer-to-peer overlays. In NSDI (pp. 99–112). 51. Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (2001). A scalable contentaddressable network. In SIGCOMM. 52. Rhea, S., Geels, D., Roscoe, T., & Kubiatowicz, J. (2004). Handling churn in a dht. In USENIX Annual Technical Conference. 53. Rhea, S. C., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., & Yu, H. (2005). Opendht: A public dht service and its uses. In SIGCOMM (pp. 73–84). 54. Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM Middleware 2001, Heidelberg, Germany. 55. Savage, S., Anderson, T., Aggarwal, A., Becker, D., Cardwell, N., Collins, A., Hoffman, E., Snell, J., Vahdat, A., Voelker, G., & Zahorjan, J. (1999). Detour: A case for informed internet routing and transport IEEE Micro, 19(1), 50–59. 56. Savage, S., Collins, A., Hoffman, E., Snell, J., & Anderson, T. (1999). The end-to-end effects of Internet path selection. In SIGCOMM. 57. Shaikh, A., Tewari, R., & Agrawal, M. (2001). On the effectiveness of DNS-based server selection. In Proceedings of IEEE Infocom, Anchorage, Alaska. 58. Stoica, I., Adkins, D., Zhuang, S., Shenker, S., & Surana, S. (2002). Internet indirection infrastructure. In SIGCOMM (pp. 73–86). 59. Stoica, I., Morris, R., Karger, D. R., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM (pp. 149–160). 60. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R., Kaashoek, M. F., Dabek, F., & Balakrishnan, H. (2003). Chord: A scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking, 11(1), 17–32. 61. Tran, D., Hua, K., & Do, T. (2003). ZIGZAG: An efficient peer-to-peer scheme for media streaming. In Proceedings of the INFOCOM 2003, San Francisco, CA. 62. Venkataraman, V., Francis, P., & Calandrino, J. (2006). Chunkyspread: Multi-tree unstructured peer-to-peer multicast. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS ’06), Santa Barbara, CA. 63. Verkaik, P., Pei, D., Scholl, T., Shaikh, A., Snoeren, A., & Van der Merwe, J. (2007). Wresting control from BGP: Scalable fine-grained route control. In 2007 USENIX Annual Technical Conference. 64. Verma, D. C. (2001). Content distribution networks: An engineering approach. New York: Wiley. 65. Wang, F., Xiong, Y., & Liu, J. (2007). mTreebone: A hybrid tree/mesh overlay for applicationlayer live video multicast. In Proceedings of the ICDCS 2007, Toronto, Canada. 66. Wang, P., Hopper, N., Osipkov, I., & Kim, Y. (2006). Myrmic: Secure and robust DHT routing. Technical Report, University of Minnesota. 67. Yang, M., & Fei, Z. (2004). A proactive approach to reconstructing overlay multicast trees. In Proceedings of the IEEE Infocom 2004, Hong Kong. 68. Zhang, M., Luo, J., Zhao, L., & Yang, S. (2005). A peer-to-peer network for live media streaming – Using a push-pull approach. In Proceedings of the ACM Multimedia, Singapore. 69. Zhang, X., Liu, J., Li, B., & Yum, T. (2005). Donet: A data-driven overlay network for efficient live media streaming. In Proceedings of the INFOCOM 2005. Miami, FL. Part IV Configuration Management Chapter 8 Network Configuration Management Brian D. Freeman 8.1 Introduction This chapter will discuss network configuration management by presenting a high-level view of the software systems that are involved in managing a large network of routers in support of carrier class services. It is meant to be an overview, highlighting the major areas that a network operator should assess while designing or buying a configuration management system, and not the final source of all information needed to build such a system. When a service and its network are small, network configuration management is typically done manually by a knowledgeable technician with some form of workflow to get the data needed to perform their configuration tasks from the sales group. Inventory tracking may be handled by simply inserting comments into the interface description fields on the router and perhaps by maintaining some spreadsheets on a file server. The technician might or might not use an element management system (EMS) to do the configuration changes. If the network is new, for example, supporting the needs of a small company or the network needs of an “Internet startup,” most of the configuration tasks represent a “new order.” Configuration requests occur at low volume and the technician probably has a great deal of flexibility in how he or she goes about meeting the needs of the new network service. As the number of users of the service grows, the expectations placed on the network operator to meet a certain level of reliability and performance grows accordingly. In time, because of growth in the sheer volume of orders, the single knowledgeable worker becomes a department, and “change orders” that modify the configuration associated with an existing customer of the network start becoming a larger and larger share of the effort. At this point, the network may contain multiple types of routers purchased from different vendors, each of which has different features and resource limits. Changes made to a router configuration to support one customer can now affect another customer. For example if one customer’s B.D. Freeman () AT&T Labs, Middletown, NJ, USA e-mail: bdfreeman@att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 8, c Springer-Verlag London Limited 2010 255 256 B.D. Freeman configuration change causes a router resource such as table size to be exceeded, multiple customers might be affected. In addition, other departments or areas within the business now need data on the installed inventory to drive customer reporting, usage-based billing or ticketing, etc. Finally, as the volume grows, there is a need for automation or “flow through provisioning” to both reduce cost/time and protect against mistakes. The simple, manual approaches no longer work: an end-to-end view is needed for network configuration management so that all the pieces required to support the business can be integrated. This chapter provides an overview of the elements of a robust network configuration management system. There are many goals for such a system, but the primary goal of any network configuration management system is to protect the network while providing the ordered service for the customers. Since changing the network configuration can cause outages if not done correctly, a key requirement of a network configuration management system is to ensure that the configuration changes do not destabilize the network. The system must provide the ordered service for the customer without affecting other customers, other ports associated with the customer being provisioned or the network at large. The network configuration management system is also typically the primary source of data – the source of truth – used by many business systems and processes that surround the network. The functions that depend on configuration data are as mundane as trouble ticketing and spare part tracking, to more sophisticated capabilities like traffic reporting, for which the association of ports to customers must be obtained so that traffic reports can be properly displayed on the customer service portal. Finally, the network configuration management system is the enforcer of the engineering rules that specify the maximum safe resources to be consumed on the routers for various features. As such, in addition to protecting the network, the system also impacts profitability, since inventory is either used efficiently or inefficiently. This depends on how good the configuration management system is at implementing the engineering rules as well as how good it is at processing service cancellation or disconnect requests in a timely fashion. If the configuration management system does not properly return a port that is no longer in service to the inventory available for new requests, expensive router hardware can be stranded indefinitely. In summary, the primary goal of a network configuration management system is to manage router configurations to support customer service, subject to three key secondary goals: Protect the network Be the source of truth about the network Enforce the business and engineering rules To explore this topic further, we will first review some key concepts to help structure the types of data items the system must deal with in Section 8.2. Section 8.3 describes the subcomponents of the system and the unique requirements of each subcomponent. This section also discusses the two approaches that are commonly 8 Network Configuration Management 257 used for router configuration – policy-based and template-based approaches – since this is a key aspect of the problem to be solved. Section 8.3 also touches on the differences between provider-edge (PE) and customer-edge (CE) router configuration tasks and the differences between consumer and enterprise IP router services in their typical approaches to configuration management. We present a brief overview of provisioning audits, which is discussed in more detail in Chapter 9. Provisioning audits are important to ensure that the network configuration management system stays as a good source of truth for the other systems and business processes that need data about the network. Finally, one of the key challenges in a large network is handling changes, ranging from an isolated change to a setting on an individual customer’s interface, to more complex changes such as bulk changes to a large number of routers and interfaces. To illustrate these issues, Section 8.4 discusses the data model and process issues associated with moving a working connection from one configuration to the next. This section also touches on some typical network maintenance activities that impact a system in different ways than a customer provisioning focus. Section 8.5 shows a complete step by step example of provisioning a port order. 8.2 Key Concepts There are two important types of data that a network configuration management system must handle: physical inventory data and logical inventory data. In addition to these data types, the system has to be designed to appropriately handle and resolve data discords between the state of the network (“What it is”) and the view of the network that is contained in the network inventory database (“What it should be”). This section introduces these concepts. 8.2.1 Physical Inventory The physical inventory database, as the name implies, contains the network hardware that is deployed in the field. The basic unit is usually a chassis with a set of components, including common elements like route processor cards or power supplies, and line cards with transport interfaces that support one or more customer “ports.” These ports are what carry the customer-facing and backbone-facing traffic. Line cards that support multiple customer ports are often referred to as channelized interfaces (e.g. channelized T3 cards or channelized OC48 cards). The physical inventory database keeps track of whether the subchannels on these line cards are assigned to a customer with a state for each channel of “assigned” or “unassigned.” The data model for physical inventory often reflects the physical world in which cards are contained in a chassis and a chassis is contained in a cabinet. Each customer port is associated with a subchannel on a physical interface. 258 B.D. Freeman 8.2.2 Logical Inventory The logical inventory database includes the inventory data that are not physical. This is a broad and less rigid category of information, since it includes multiple database entities with ephemeral ties to the physical inventory. An IP address is a good example of a database entity with an ephemeral tie. IP addresses exist on an interface, but we can move addresses to ports on another router; hence, an address is not permanently tied to a single piece of physical equipment. Many logical components are inventoried as database entities and assigned as needed by the carrier. IP addresses, VLAN tags, BGP community strings [1], and Autonomous System Numbers (ASNs) [2] are all examples of logical data that need to be tracked and managed. Generally, logical inventory assigned to a customer is associated with a particular piece of physical inventory. However, the association can change over time. A good example of a change in the association between physical and logical inventory occurs when a customer’s connection is upgraded from a T1 to a T3. The physical inventory will change drastically but the logical inventory in terms of the IP address, BGP routing, and QoS settings may not change. It is also useful to understand that some logical inventory is associated with a single piece of equipment like an IP address while other logical inventory is “network wide” and is associated with multiple pieces of equipment like MPLS Route Distinguishers and Route Targets. 8.2.3 Discords: What It Is Versus What It Should Be? Data discords are a fact of life in production systems. Through a variety of means, the data in the network and the data in the inventory system get out of synch. In plain language, a situation is created where the inventory view of the world, “what it should be,” does not match with truth or the network view of the world, “what it is.” Both physical and logical inventory can contain discords. Generally, the physical inventory discords occur because of card replacements and initial installation errors that occur without a corresponding update of the database. For example, a discord would occur if a 4-port Ethernet card was replaced with an 8-port Ethernet card, but the database was not updated. Autodiscovery of hardware components can greatly assist in reducing the data discords in the physical inventory. Many production systems back up the router configuration daily and use commands from the vendor to collect detailed firmware and hardware data from the equipment. The command “show diag” dumps this kind of detailed information and the output can be saved to a file. Very accurate physical inventory information can be obtained by parsing the output of commands run on the router to obtain hardware information like the “show diag” command or various SNMP MIB queries. Automatic discovery of physical inventory can reduce the physical discords to zero. Many spare part tracking processes are dependent on the ability to automatically discover changes 8 Network Configuration Management 259 in serial numbers on components so that failure rates on cards can be tracked and replacement parts restocked as needed. Maintaining control on “What it is” is part of the physical inventory audit process. Logical inventory discords also happen frequently but are harder to resolve. As an example, if a customer port that is running in the network has static routing and the inventory database indicates that it should be BGP routing, which is correct? Another example of logical inventory discord is the mismatch between the service that the customer currently has and the ordered service. In general, it is easier to detect logical inventory discords than to resolve them. Given their impacts on the external support processes and billing, detection, reporting, and correcting these situations is important. Another key concept that the industry uses is that “the network is the database.” This concept results from a desire by network operators to use the network configuration as ground truth to drive processes. Most equipment has some mechanism for querying for configuration data. However, practical matters require externally accessible views of those data. Fault management, for example, cannot query the network in real time on every SNMP trap that gets generated (this can be thousands per second); so a copy of the configuration data has to reside in a database and consequently a process/program to audit and synchronize that data with the network has to be part of the overall network configuration management system. With these key concepts in mind, we will discuss the elements of a network configuration management system. 8.3 Elements of a Network Configuration Management System Figure 8.1 provides a high-level view of the elements that make up a Network Configuration Management System. The external interfaces are to technicians and Operating Support Systems/Business Support Systems (OSS/BSS) on the top and the Network Elements at the bottom. Each of the major elements inside the system will be addressed in subsequent sections. 8.3.1 Inventory Database A database of the physical and logical inventory is the core of the system. This database will consist of both the real assets purchased and deployed by the corporation (the physical inventory discussed in Section 8.2.1) and the logical assets that need to be tracked (e.g., WAN and LAN IP address assignments, number of QoS connections per router, max assigned Virtual Route Forwarding (VRF) tables [3] on the router, etc.). The database entities have parent/child relationships that form a tree as you place items in the schema. For example, a complex is a site with a set of cabinets. A cabinet within a site may have multiple chassis or routers. A router has multiple cards, 260 B.D. Freeman Technicians OSS/BSS (Ordering) GUI API OSS/BSS (Maintenance/Inventory) Reports and Feeds Design & Assign Physical Inventory Management Logical Inventory Management Router Audit Mediation Layer Router Configuration Mediation Layer Inventory Data base Network Elements Fig. 8.1 High-level view of network configuration management system each in a slot on the chassis. A card can have multiple ports. When viewed graphically, this parent/child relationship is a tree with the single item complex at the top and the ports at the “leaves” of the tree. A robust inventory database will have a schema with multiple “regions” of data with linkages between them as needed. One major ISP has an inventory database with over 1,000 tables to handle the inventory and the various applications that deal with the inventory. The two main regions are the physical equipment tree of data (e.g., complex/cabinet/router/slot/port) and logical inventory tree of data (e.g., customer, premise, service, and connection). The service database entity (one node up from the connection entity in the tree) typically contains the linkage to other logical assignments like Serial IP address, VRF labels, Route Distinguishers [3], Route Targets [3], etc. The reason the data are separated into these regions is to permit the movement of logical assets to different ports (i.e., connections) and to support changes in the physical assets associated with a customer as a result of changes in technology or network-grooming activities. Changes in technology, such as a new router with lower port costs, and network grooming, moving connections from one router/circuit to another to improve efficiency, are examples of carrier changes that may also affect the data model. These carrier decisions are sometimes even more complex than the customer-initiated changes to deal with correctly in the inventory database. Without separation of the regions, the ongoing life-cycle management of the service is difficult. For example, at points in time, we need to have multiple assets available for testing and move the “active” connection to the new assets only after satisfactory testing has completed. This means that we maintain multiple “services” for the same physical port, both the old service and the future service. The inventory database stores the “What should be” for the corporation and the current and future state of the equipment and connections for a customer. 8 Network Configuration Management 261 Many subsystems of a configuration management system are dependent on the inventory database. One of the major dependencies is the audit subsystem. The audit subsystem must store information for the physical “What it is” form of the network in a schema. Typically, since audit or discovery starts with the physical assets, the physical inventory model at the router/component level is reused for the “What it is” model. It is interesting to note that cabinet and location of equipment data are typically not discoverable, so those are usually inferred through naming conventions like the encoding of the router hostname. For example, a router might have a hostname like “n54ny01zz1” where the “n54” indicates a particular office in New York City and “ny” is for New York State. The “01” indicates that it is the first router in the office and the “zz1” would indicate the type or model of the router. The encoding is not an industry standard, but most carriers use something similar. The logical “What it is” model is also based on the rich “What it should be” model. It is again interesting to note that the logical discovery does not have the nonnetwork data items like street address of the customer or other business information. A prudent network operation puts processes in place to encode pertinent information in the interface description line so that linkages to business support systems can be maintained and audited. For example, large carriers tend to automatically encode a customer name and pointers to location records to make it easier to manage events pertinent to the interface in customer care and ticketing systems. The example below shows an active port in maintenance (MNX), for a customer, ACME MARKETING that is located in ANYTOWN, NJ, on circuit DHEC.123456..ATI. Various database keys are also encoded. interface Serial4/0.11/8:0 description MNX j ACME TECHNICAL MARKETING j ANYTOWN j NJ j DHEC.123456..ATI j 19547 j 3933470 j 4151940 j USA j MIS j j The two main inputs to the inventory database are the physical and logical inventory on the router and the customer order data. The physical and logical router data are typically inserted through the GUI during network setup by the capacity management organization as assets are installed, tested, and made ready for service. Another practice in use is to install the equipment and then use the autodiscovery tools to “learn” the equipment’s physical inventory. Logical assets are entered into the system as appropriate since they are not necessarily tied to the equipment in all cases. The customer order data are created usually through an API from the OSS/BSS during the ordering phase of a customer’s request for service and updated as the order progresses through the business processes to move from an order to an installed and tested connection. A note of caution, the amount of customer order data that are replicated into the network configuration management system should be minimized. A good design incorporates just enough to make it easier for people to deal with problems encountered in provisioning and activities that the upstream OSS/BSS may not have the capability to manage like custom features. The more customer order data stored 262 B.D. Freeman in the network configuration management system, the more the management of that data alone becomes a problem. Customer contact data are an example of data that should not be in the network configuration management system, since they are volatile and in fact may pertain to broader applications than the network service. 8.3.2 Router Configuration Subsystem The second subsystem we will discuss is the Router configuration management system. This subsystem takes the information from the inventory database and creates configuration changes for the installed router. The inventory database typically provides data needed to drive configuration details like the types and versions of commands to use for configuration (these can vary by make and model), the IP addresses/hostnames and passwords for access to the routers, and the customer order data for the specific configuration. The generation of the specific router configuration commands is the more difficult aspect. There are numerous approaches to the creation of the configuration changes, but the two main ones large carriers use today are policy management and templates. 8.3.2.1 Policy Management Approach The policy management approach attempts to break down the router configuration into a set of conditions and actions (e.g., policies) and generates the combined configuration on the router by evaluating the conditions and action in a set of policies. For example, QoS settings fit nicely into the policy management approach, since the router typically has a configuration statement to define the condition and action for applying QoS. The configuration statement can be shared by multiple ports and any interface can be assigned to that policy. Creating a QoS policy that assigns 20% of the bandwidth to high-priority class (e.g., voice traffic) and the remainder to a best-effort class could be reused by many ports on a router. One condition/action definition (e.g., policy) reused multiple times is easy to implement and maintain. Some configurations are more difficult to implement in a policy management system since they do not adhere nicely to a condition/action policy format. An example of this is IP addressing (or address management), which typically uses fairly complex rules to determine which address to assign to an interface. Large policy management systems do exist, but the linkage between different policies can be subject to scaling issues when dealing with the application of a large number of network and customer policies as in a VPN with a large number (e.g., thousands) of end points. Configuration auditing (described later) in particular becomes difficult to manage in a policy management system because the policy view of the data sometimes is not readily apparent to the knowledgeable network engineer when looking at the more detailed CLI commands in the backup configuration file used for audits. Finally, testing of policy-based systems is complicated, since it is not always clear what the resulting policy-based configuration will be in the CLI. 8 Network Configuration Management 263 The number of test cases increases to make sure the policy engine generates all the configuration change options that the network certification process has confirmed as working correctly. 8.3.2.2 Template Management Approach Template management uses a more simplistic approach. The details of tested sets of configurations are documented in a template and the data to drive a particular template is pulled from the inventory database. The benefit of a template approach is that only the configurations that are known to be valid are put into the network. This approach is a more reliable method of ensuring that the network is always configured to operate in a configuration supported by the testing and certification program. Policy-management systems have a more difficult time ensuring that they are always configuring the router into a condition that matches the certified configurations. The challenge is building the template from the set of features ordered by the customer. Generally, the template languages have a nesting structure so that the range of templates can be kept under control. As the set of templates grows, there is some complication in applying the correct template, but the resulting router configuration tends to be cleaner and more optimized (since each template is a test case) than the policy-based configurations. Both approaches have merit and a growing set of functions can be handled more readily with policies; so the likely system for a large carrier is a mixture of these techniques with templates for the basic configurations like basic IP conditions and routing and policies for the more advanced functions like QoS configuration on CPE routers. Large ISPs will have hybrid approaches to provide the best fit tool for each problem. An important aspect of the router configuration subsystem is the interaction between the users of current inventory (processes like ticketing and fault management) and the need to deal with future changes. Growing from a 512 kb/s link to a full T1 or growing from a single T1 to Multilink PPP (MLPPP) [4] are examples with very different degrees of complexity but both have the need to track both the current connection data and the future connection data. The router configuration system has to be able to handle modifying the current configuration to move an active connection to the new connection configuration. To handle failure conditions properly, this subsystem has to deal with roll forward and roll back of the configuration. Sometimes, the template approach is cleaner, since the “before” configuration can be captured directly from the router and re-applied even if the original data for it are not readily available. There are some key differences in managing provider-edge and customer-edge configurations that influence the choice of template-based or policy-based configuration management that we discuss here. Provider-edge (PE) routers tend to have a large number of interfaces (100 or more) with many interfaces of the same basic type. Generally, the configurations are relatively simple since the router’s primary role is stability, reliability, and fast 264 B.D. Freeman packet forwarding. Since large carrier router configurations tend to be less variable, we tend to see template-based configuration management systems on the PE. However, since MPLS VPNs have the added complexity of multiple router configurations being involved to correctly implement the VPN, usage of policy-based configuration management is growing. Customer-edge (CE) routers tend to have a much smaller number of interfaces (less than 10) with a wide variation in configurations depending on the business/industry of the customer. For example, the CE router may need advanced traffic-shaping rules to ensure that performance-sensitive traffic has a priority on their internal network over the access to the internet proxy/firewall. Other customers might need to do video streaming for training and thus need QoS setting for video priority over other data traffic. Some customer may even be running internet applications that require prioritization of the http/ftp traffic to/from their router to provide service to their customers. The CE router is closer to the customer and thus gets the burden of handling more customer-specific applications like firewalls, packet shaping, and complicated internal routing policies. Policy-based router configuration management systems are commonly used on CE routers because that is a better fit to the disparate customer needs for the edge environment. Finally, for the network carrier it is important to understand the different challenges that a mass-market consumer broadband internet access service places on the configuration management system. Mass-market configuration tends to have a very small set of routing configuration options. The most obvious variable in the configurations is the access speed. While you might think setting up QoS and ACLs would tend to increase the configuration options, it really only adds complexity and not much variation, since the configurations tend to be similar across large sets of connections. Although the number of different configurations is small, the rate of change is large. Initial provisioning rates are not only much larger than the enterprise space but the volumes of change orders are large as well. An Enterprise Internet access service might typically need to process several thousand orders a week with a similar magnitude of change orders. A mass-market service might need to process thousands of orders per day and tens of thousands of change orders per day. Mass-market router configuration systems tend toward template-based approaches because of the simplicity of the configuration, the smaller range of features, and the performance advantages of the template approach for large-scale processing. 8.3.2.3 Mediation Layer Most service providers have multiple vendor platforms in their network, but even single vendor network will have multiple models and versions of the router operating system. The router configuration subsystem that writes data to the routers usually has a mediation layer to deal with the router-specific commands. The mediation layer also exists when reading data for the audit layer to turn the vendor-specific commands/output into a common syntax for use by the audit application. The mediation layer will also handle nuances of the security model for accessing the routers that may vary based on vendor and region of the globe. 8 Network Configuration Management 265 8.3.3 GUI/API The GUI/API subsystem deals with the typical functions of retrieval, display, and data input for the system. The technology of this subsystem is typical of large-scale systems. This subsystem uses HTTP Web server technology with an html-based GUI and a SOAP/XML-based API. A critical aspect for large carrier is that the API becomes the predominant flow into the system. At scale, the API is used to handle the large volume order flow from the business support systems (BSS), both to electronically transfer the data and trigger the various automated functions in the router configuration management system. The GUI is used infrequently for customer provisioning and is used primarily for correcting any fallout that might have occurred. Having a robust set of APIs is critical to business success. Obviously, the APIs must also keep pace, as new features are added to the router so that the automated processes can trigger them. The GUI comes into play for manual interaction and maintenance activities and various other tasks that are not economic to automate through APIs. The other important aspect of the GUI is the implementation of a robust authentication/authorization layer, since some user groups should not have access to the router configuration change functions to prevent unintended changes that could cause a service outage. One aspect of the GUI that is also worth mentioning is read access to the “What it is” state of the router. Typically, there are sets of read-only CLI commands that the customer care organization depends on for responding to customer-reported problems. Most router platforms have a limited set of connections, so it is problematic to give a large customer care team direct access to the router CLI. The solution large carriers typically use is to put a web-based GUI in place with a limited set of functions that can be selected by the customer care agents. The GUI then acts as a proxy through the router configuration subsystem to execute these commands on the router. These commands include the various “show” commands as well as options to run limited repair functions like “clear counters” and/or “shutdown”/“no shutdown” on the interface. Exposing these functions through the GUI reduces the impact on the router and provides a mechanism for the throttling and audit rules to be applied to prevent a negative impact. The edit checks that occur before commands are executed on the router also help one to prevent unintended effects. 8.3.4 Design and Assign This subsystem applies the engineering rules to select a port for a customer’s service and can accept or reject a request for service based on available inventory. The subsystem has an API that takes the service request parameters and other customer network information and generates an assignment to a particular port on a router. That assignment is typically called a Tie Down and the data set is Tie Down Information (TDI). The API can be called either through the GUI or directly by the BSS. Assignment is nontrivial, since the function must ensure that all engineering 266 B.D. Freeman rules that help protect the network are satisfied like finding a port on a card with sufficient resources while also satisfying the business rules, which seek to limit transport costs and latency by picking a router closer to the customer. For example, the engineering rules may limit the number of QoS configured ports on certain card types. As an example of router assignment, a poor assignment would be to pick a router in California for a customer in New York. The assignment function calculates both an optimal assignment and the current assignment. The optimal assignment is the first choice router location that minimizes backhaul cost (e.g., ideally a customer in Ohio will be homed to a router in Ohio). However, it could be that the Ohio router complex does not have a router with sufficient capacity (bandwidth, QoS ports, etc.). The design and assign function system needs to be designed to implement the appropriate business rules in this case. For example, the business rule in this case is to “home” the customer on an alternate router in a different location. Alternatively, the business rule could be to reject the order. Typically, the “reject the order” business rule applies in mass-market situations. Business rules for enterprise markets usually choose to have longer backhaul costs rather than reject the order. In the enterprise market, the business rule might select a router in an alternate location like Indiana if no routers in Ohio had sufficient resources. The business would like the flexibility to be able to move the port from the Indiana router to an Ohio router in the future without impacting the customer. Consequently, the “assign” function will allocate a Serial IP address from a logical inventory pool associated with Ohio’s router complex, assign it to the interface on the router in Illinois, and “exception route” that address to Indiana. This assignment permits the CE/PE connection to be re-homed from Ohio to Indiana without affecting the customer’s router configuration, since their WAN IP address would not change and then the exception route for Indiana can be removed to get to a more optimum network routing configuration as well as a reduced backhaul configuration. The tracking of the optimal and current assignment data adds complexity to this subsystem, the inventory database, and the router configuration system (for the exception routes), but it is a good example of the types of business decisions that can ripple back into the router configuration management system requirements. 8.3.5 Physical Inventory Management Physical inventory management deals with the entering and tracking of data about the router equipment. It deals not only with equipment configuration details like what cards are installed in the routers but also where those routers are located for maintenance dispatch. The physical database also contains the parameters for the engineering rules that vary by equipment make and model. These parameters come either from the router vendor documentation or from certification testing. The parameters and the associated rules can range from simple rules like maximum bandwidth per line card to complex rules like the maximum number of VPN routes 8 Network Configuration Management 267 with QoS on all line cards on the router with version 3 of the line card firmware. As new routers or cards are added to the network, this subsystem tracks all the associated data for these assets including tracking whether a router or port is “in service” and available for assignment. As ports are assigned to customers, the physical inventory removes those ports from the assets that are available for assignment. The physical inventory also deals with the tracking of serial numbers of cards so that as cards are replaced or upgraded, the new parameters can be used for the engineering rules. For example, a card with 256 MB of memory could be upgraded to 512 MB and thus be able to support more QoS connections. The physical inventory subsystem keeps track of these engineering parameters (sometimes called reference data) about vendor equipment for use by other subsystems. Here are a few of the typical parameters tracked: Maximum logical ports Maximum aggregate bandwidth Maximum card assignment Maximum PVCs Logical channel limits IDB limit VRF limit BGP limit COS limit Routes limit 8.3.6 Logical Inventory Management Logical inventory management deals with the entering and tracking of data about the logical assets (IP addresses, ACLs, Route Distinguishers, Route Targets, etc.). This can be a large subsystem depending on the different features available, but the hardest item in the category is the IP address management. IP address management deals with the assignment of efficient blocks to the various intended uses. Typically, the engineering rules require different blocks of addresses to be used for infrastructure connections, WAN IP address blocks, and customer LAN address blocks. This requires not only higher-level IP address block management functions so that access control lists can be managed efficiently but also functions to deal with external systems like the ARIN registry. Service Providers typically update the ARIN “Who Is” database through an API so that LAN IP blocks assigned to enterprise customers appear as being assigned to those customers. This aids the service provider in obtaining additional IP address blocks from the registrar if needed. The tracking of per router elements like ACL numbers is simpler but has its own nuances and complexity, since the goal is to reuse ACL numbers where it is possible to reduce the load on the router. Typically, memory is consumed for every ACL on a router. The ACLs for different ports for the same customer tend to be identical so that memory utilization (and processing time on the ACL) can be reduced by compressing the 268 B.D. Freeman disparate ACLs into a single ACL that can be shared among a custom’s ports. Numerous other items have to be tracked in logical inventory and assigned during the assignment function depending on the feature or service being provided and the logical inventory management system grows in complexity as more logical features are added to the service. 8.3.7 Reports and Feeds The reports and feeds subsystem is responsible for distributing inventory data to users and systems required to run the business. The main users of this subsystem are the fault/service assurance system and the ticketing system. The fault/service assurance system needs data about the in-service assets so that alarms can be processed correctly. Its source of truth is usually the “What it is” data from the inventory database. The ticketing system is more concerned with the data about the customer, since they get notification of an event from the fault/service assurance system and have need to understand for a given port/card/router problem which customer or customers are affected. Fault and ticketing systems tend to get feeds of the inventory data, since their query volume can be quite high and the load can best be managed with a local cache of the data rather than directly querying the inventory database. Generally, the inventory data does not change rapidly; so a local cache is sufficient and alarms/tickets do not need these data until after test and turn up of the interface. Other users need various reports and feeds from the inventory database, and generally these are pulled either as a report from the GUI or APIs. A GUI-based reporting application can easily be deployed on the inventory database for items like port utilization reports for capacity management. APIs can be created as needed for generating bulk files or responding to simple queries. 8.3.8 Router Audit The router audit subsystem is responsible for doing both the discovery of the “What it is” state of the router and comparing the “What it is” with the “What it should be” in the inventory database. The audit function described in this section is designed to detect differences with the inventory data. There are other mechanisms that can be applied to look at the larger set of configuration rules. Some of these are covered in Chapter 9. Discovery is typically done with an engine that parses the router configurations into database attributes. As described before, the parsed router configuration data are stored in the inventory database but in a separate set of tables from the physical and logical inventory. The schemas of the audit tables are similar to the physical and logical inventory tables, but they lack some attributes that do not exist in the router configuration; the major attributes are the same so that they can be compared 8 Network Configuration Management 269 with the “what it should be” tables. After storage, the compare or audit function does an item-by-item comparison, tracking any discords. The audit is CPU- and disk-intensive and typically is only done across the entire network data set on a daily basis. The discovery/audit process is also used to pick up changes like card replacements. It is typical for this audit function to take 4–6 hours to complete across a large network even when high-end servers are employed. The good news is that the process can typically be run using the backup copies of the router configuration files so that there is no impact on the network and limited impact on the users of the system. Incremental audits can also be done on a port or card basis on demand as part of the router configuration process. It is worth noting that the tracking of discords requires a historical view: when a discord was first detected and when was the last time it was detected. New discords could correlate with an alarm or customer-reported problem. Old discords might be indicative of data integrity error from a manual correction that was implemented to repair a customer problem but not appropriately reflected back into the inventory database. While perhaps less visible to the overall router configuration management process than other aspects of the configuration workflow, audit is a key step. Real-time validations must be implemented for a change order so that if there is a discord, the process will stop the change order from being applied to prevent a problem. It is important to subsequently find and fix these discords so that future change orders are not affected. 8.4 Dealing with Change An important aspect of a configuration management system is to deal with changes to an existing service. For example, the initial configuration of an interface can be done in various phases and with little concern for timing until the interface is moved from the shutdown state to the active state. However, an active interface has a different set of rules. Generally, the timing associated with configuration changes is more critical and the set of checks on the data and the configuration are more involved. First, a robust network configuration management system will validate the current configuration of the interface (“What it is”) against the “What it should be” data and if there is a mismatch it should stop the change. The reasons are probably obvious that unless the “What it is” and “What it should be” data sets are in agreement, we are running the risk of changing to a configuration that will not work for the customer because of a previous data inconsistency. For instance, if there have been problems with a previous re-home and the ACLs are not the same between the old configuration and the new configuration, it could prevent the customer from accessing their network services. Second, for the intended change, the configuration management system should validate the data set against the interface data, the global configuration of the router, and to the extent possible the larger network for the customer to ensure that the 270 B.D. Freeman change is consistent with other “What it is” data. This usually consists of a set of rules applied by the configuration management subsystem to ensure that a successful change will be applied. A good example is again a re-home. If the old port is still advertising its WAN IP address, you cannot bring up the same WAN IP address on a different router or instabilities can be introduced (duplicate IP address detection is an important validation rule). 8.4.1 Test and Turn Up Bringing up a new connection involves testing that the connection works correctly as ordered and then turning up the port for full service. Turning up a large connection like a 10 Gb Ethernet connection is something done carefully because if mis-configured it could either drive large amounts of traffic into a customer’s network before they are prepared for it or remove traffic from a customer’s network by mistake. For most changes against a running configuration, the process of applying the change has to be coordinated with a maintenance window1 since service could be impacted. Some changes may also require changes on the customer’s side of the connection; so proper scheduling with the customer’s staff is required. For changes that involve the physical connection (speed changes and re-homes), typically two ports are in assignment at the same time and operations would like to test all or parts of the new port before swinging the customer’s connection over. This “testing phase” creates database complexity, since the new port has to be reserved for the customer but it is not the “in-service” port from an alarming/ticketing standpoint. Both the old and new have to be tracked until the port is fully migrated to the new configuration. This requires the concept of “Pending” port assignments/connections and database transactions to move a port from “Pending” to “Active,” from “Active” to “Disconnected,” and finally the old record is deleted from the database. The router configuration system has to maintain the ability to generate router configurations for each of the interim steps in moving an active connection from one port to another. There are configurations to bring up the new interface on temporary information (e.g., temporary serial IP addresses and/or RD/RT/VRF information for testing), steps to “shutdown” the old interface, steps to “no shutdown” the new interface, and steps to reverse the entire process to roll back to the old interface. All these need to be able to be driven through the API for relatively straightforward changes with automated PE side re-homes that do not affect the customer premise router and via the GUI for those more complicated changes that require coordination with the customer. It is with dealing with change that the entire system is stressed the most to meet the needs of not only ensuring that the network is protected but also that the entire system responds fast enough to meet the human- or machine-driven process requirements. 1 The Maintenance Window is a time period when there is expected to be low traffic and is used by an operator for planned activities that could impact service. Usually it is in the late night/ early morning of the time zone of the router like 3–6 a.m. 8 Network Configuration Management 271 Another attribute of change that is worth mentioning is changes to active interfaces that are infrastructure connections (e.g., two or more backbone links that connect network routers). A routine task is to change the OSPF metric on one link to “cost it out”2 of use so that maintenance on the connection can be done. A problem exists if the state of this link is left in the “costed out” state. Failure of the now single primary link causes isolation, since one link could be hard failed and the other link is out of service by being “costed out.” A robust configuration management system also has maintenance functions to permit the operations staff to cost out a link, to record that the link is “costed out,” and to generate an alarm condition if the link stays “costed out” for a period of time. Finally, a type of change that is of growing importance in large networks is the ability to apply changes in bulk. The complexity of modern routers leads to situations where a latent bug or security vulnerability is found in a router that can only be repaired by changing the configuration on a large number of ports in the network. This requires special update processes to handle the updates in a bulk fashion. Typically, this is a customized application on the router configuration subsystem that is targeted at dealing with the bulk processing. The reason why this gets complicated is not only because of tracking that all the changes are applied (routers sometimes tend to refuse administrative requests under heavy load) but also throttling the updates to specific routers so as not to overload them. 8.5 Example of Service Provisioning This section will tie all the pieces together in an example of service provisioning for a simple Internet access service. Once all the order data are collected and optionally entered into an automated order management, the provisioning steps can occur including downloading the configurations to the router. The individual configurations are called configlets, since they are usually incremental changes to an interface or pieces of the global configuration, and not an entire router configuration. They are outlined below. 1. 2. 3. 4. 5. 6. 7. 8. 9. Create customer Create premise/site Create service instance Create connection and reserve inventory Download initial configuration Download loopback test configlet Download shutdown configlet Download final configlet with “no shutdown” Run daily audit 2 When OSPF costs on a set of links are adjusted to shift traffic off of one link and onto another link, the process is informally called “costing out” the link. 272 B.D. Freeman 1. Create customer This task is simply to group all the customer data into one high-level account by creating (or using a previously created) customer entity in the database. Sometimes, it relates to an enterprise but oftentimes because of mergers and acquisitions or even departmental billing arrangement the “customer” at this level does not uniquely identify a corporation. There can even be complicated arrangements with wholesalers that must be reflected in various customer attributes. 2. Create premise/site This task creates a database entity corresponding to the physical site that the access circuit terminates in at the customer’s site. Street address, city, state/province, country etc. are typical parameters. Corporations can have multiple services at an address so that we track the address partly not only to make it easier to work with the customer but also because these data will impact the selection of the optimum router to reduce backhaul costs. 3. Create service instance This task collects the parameters about the intended service on this connection. It will define the speed, any service options like quality of service, and all the other logical connection parameters. These data directly affect the set of engineering rules that will be applied to actually find an available port on an optimum router. 4. Create connection and reserve inventory This task combines the above data into an assignment. The selection of a router complex is done first using the parameters of address to look for a complex with a short backhaul. This is called “Homing.” After a preliminary complex is assigned, the routers in the complex are checked for available port capacity and if there is port capacity, the engineering rules for this connection on that router are tested. For example, a router may have available ports, but there may be insufficient resources for additional QoS or MPLS VPN routes on the cards. The system will recursively examine all routers in the complex to look for an available port that matches the engineering rules. If no router is found, the system will examine a next best optimum complex and repeat the search. This assignment function can take a substantial amount of system resources to complete and is not guaranteed to find a solution due to resource or other business rule constraints. Once a complex, router, and port has been selected, the logical inventory will be tied to the physical inventory and this Tie Down Information (TDI) will be returned to the ordering system so that it can order the layer 1 connection from the router to the customer premise. It is important to note that at this point the Inventory database must set a state of the port so that no other customer can use that router port. If the customer’s order is cancelled, the business process must ensure that the port assignment is deleted as well to avoid stranded inventory. At this point, the inventory database would show the port as “PENDING,” since the inventory has been assigned but it is not in service. All the logical data needed to configure the interface are in the database and any provider inventory items have been assigned (serial IP addresses, ACL numbers, etc.). 8 Network Configuration Management 273 5. Download initial configuration After the inventory has been assigned, an initial configuration of the port is downloaded to the router to define the basic interface. This configlet typically only includes the serial IP address and default routing and defines the interface in a shutdown state. This is also the first real-time audit step. This audit will confirm that the assigned port is not used by some other connection. While rare, data discords of this type do occur. This download need not occur in real time, since it will typically be some amount of time before the Layer 1 connection is ready. 6. Download the loopback test configlet This step depends on the layer 1 connection to be installed so that it can occur days, weeks, or months after step 5. In addition, after Layer 1 is installed, this step typically occurs 24 h before the scheduled turn up date for a customer. This configlet contains all the routing and configuration data for the connection. Downloading a configlet to do loopback testing on the network side of the connection provides a final check of the provider’s part of the work. Just before the configlet is downloaded, a series of real-time audits are again conducted, since the initial configlet audits could have been months ago. These audits check both the static order data against the running router and attributes on other ports on the router. For example, there is a verification that any new ACL number is not already in use on another port for another customer. This check makes sure that a manually configured port was not done in error. There is a verification that any new VRF does not already exist on the router to check and see if another order has been processed in parallel. There are numerous other validations as well. This real-time audit is more detailed than the audit done for the initial configlet, since it contains all the routing, QoS, and VPN data. If all validations are successful, the configuration is downloaded and activated for testing with Layer 1 in loopback. 7. Download shutdown configlet After successful pretesting, the router port is left in a shutdown state. It can remain in this configuration for some period of time but because routing instances may have been defined even though the port is shutdown typically operators do not leave a shutdown interface in the router configuration for more than 48 hours or so. A shutdown interface is still discoverable from an SNMP network management perspective so that a large number of admin down interfaces simply adds load to the fault management system without adding value. If it is not successfully turned up, the configuration will be rolled back to the initial configuration. While the Layer 1 circuit is being ordered/installed, there will likely be many daily audits that run. These audits will find the port in the router in shutdown state. The discord analysis will compare the “What it is” configuration and state with the “What it should be” configuration and state and report any problems. For our example, there is no problem but the audit might find that the port is in a “no shutdown” state in the network indicating that perhaps a test and turn up occurred but was not completed in the inventory database. The daily audit would also find if the router card had been replaced for some reason and update tracking data like serial numbers, etc. 274 B.D. Freeman 8. Download final configlet with “no shutdown” At activation, the system will download the final router configuration with “no shutdown” of the interface. Final testing may occur with the customer. The testing for single-link static routed interfaces is usually automated but for advanced configurations with multiple links or BGP routing, manual testing procedures are typical. It is at this point that the inventory database will update its status on the port to active and mark the port “In service” for downstream systems like the Fault Management and Ticketing systems. 9. Run daily audit The daily audit will find the new state of the port to be active and the “What it should be” state of “ACTIVE” matches the “What it is” state in the network. 8.6 Conclusion Hopefully, we have provided a useful overview of a robust router configuration management system and helped to tie the key functions and subsystems back to the business needs that drive complexity. From inventory management to provisioning the customer’s service to handling changes to dealing with bulk security updates, a large carrier cannot provide reliable service without a robust router configuration management system. Here is a summary of some “best practice” principles that will be helpful when designing a Network Configuration Management system. Recognize data discords as a fact of life. Separate “What it is” and “What it should be” data in the inventory database Configuration management is the source of truth for the business about the current network using the “What it is” data Protect the network through real-time validation and auditing of the run- ning network Design for change so that logical data are not permanently tied to physical data Separate the schema for physical inventory and logical inventory Use templates to make configuration, discord detection, and testing easier Track port history, and not just the current state Design for multiple configurations of a port to handle the current port configuration and the pending port configuration Design the system to support testing a port before it is turned up and rollback to an earlier configuration when tests fail Limit the amount of business data in the network-facing system so that you do not create a problem of maintaining consistency 8 Network Configuration Management 275 References 1. Chandra, R., Traina, R., & Li, T. IETF Request for Comments 1997, BGP Communities Attribute, August 1996. 2. Hawkinson, J., & Bates, T. IETF Request for Comments 1930, Guidelines for creation, selection, and registration, March 1996. 3. Rosen, E., & Rekhter, Y. IETF Request for Comments 4364, BGP/MPLS Virtual Private Networks, April 2006. 4. Sklower, K., Lloyd, B., McGregor, G., Carr, D., & Coradetti, T. IETF Request for Comments 1990, The PPP Multilink Protocol, August 1996. Chapter 9 Network Configuration Validation1 Sanjai Narain, Rajesh Talpade, and Gary Levin 9.1 Introduction To set up network infrastructure satisfying end-to-end requirements, it is not only necessary to run appropriate protocols on components but also to correctly configure these components. Configuration is the “glue” for logically integrating components at and across multiple protocol layers. Each component has configuration parameters, each of which can be set to a definite value. However, today, the large conceptual gap between end-to-end requirements and configurations is manually bridged. This causes large numbers of configuration errors whose adverse effects on security, reliability, and high cost of deployment of network infrastructure are well documented. For example: “Setting it [security] up is so complicated that it’s hardly ever done right. While we await a catastrophe, simpler setup is the most important step toward better security.” – Turing Award winner Butler Lampson [42]. “. . . human error is blamed for 50 to 80 percent of network outages.” – Juniper Networks [40]. “The biggest threat to increasingly complex systems may be systems themselves.” – John Schwartz [61]. “Things break and complex things break in complex ways.” – Steve Bellovin [61]. “We don’t need hackers to break systems because they’re falling apart by themselves.” – Peter Neumann [61]. S. Narain (), R. Talpade, and G. Levin Telcordia Technologies, Inc., 1 Telcordia Drive, Piscataway, NJ 08854, USA e-mail: narain@research.telcordia.com; rrt@research.telcordia.com; glevin@research.telcordia.com 1 This material is based upon work supported by Telcordia Technologies, and Air Force Research Laboratories under contract FA8750-07-C-0030. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Telcordia Technologies or of Air Force Research Laboratories. Approved for Public Release; distribution unlimited: 88ABW-2009-3797, 27 August 09. C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 9, c Springer-Verlag London Limited 2010 277 278 S. Narain et al. Thus, it is critical to develop validation tools that check whether a given configuration is consistent with the requirements it is intended to implement. Besides checking consistency, configuration validation has another interesting application, namely, network testing. The usual invasive approach to testing has several limitations. It is not scalable. It consumes resources of the network and network administrators and has the potential to unleash malware into the network. Some properties such as absence of single points of failure are impractical to test as they require failing components in operational networks. A noninvasive alternative that overcomes these limitations is analyzing configurations of network components. This approach is analogous to testing software by analyzing its source code rather than by running it. This approach has been evaluated for a real enterprise. Configuration validation is inherently hard. Requirements can be on connectivity, security, performance, and reliability and span multiple components and protocols. A real infrastructure can have hundreds of components. A component’s configuration file can have a couple of thousand configuration commands, each setting the value of one or more configuration parameters. In general, the correctness of a component’s configuration cannot be checked in isolation. One needs to evaluate global relationships into which components have been logically integrated. Configuration repair is even harder, since changing configurations to make one requirement true may falsify another. The configuration change needs to be holistic in that all requirements must concurrently hold. This chapter motivates the need for configuration validation in the context of a realistic collaboration network, proposes an abstract design of a configuration validation system, surveys current technologies for realizing this design, outlines experience with deploying such a system in a real enterprise, and outlines future research directions. Section 9.2 discusses the challenges of configuring a realistic, decentralized collaboration network, the vulnerabilities caused by configuration errors, and the benefits of using a validation system. Requirements on this network are complex to begin with. Their manual implementation can cause a large number of configuration errors. This number is compounded by the lack of a centralized configuration authority. Section 9.3 proposes a design of a system that can not only validate the above network but also evolve to validate even more complex ones. This design consists of four subsystems. The first is a Configuration Acquisition System for extracting configuration information from components in a vendor-neutral format. The second is a Requirement Library capturing best practices and design patterns that simplify the conceptualization of end-to-end requirements. The third is a Specification Language whose syntax simplifies the specification of requirements. The fourth is an Evaluation System for efficiently evaluating requirements, for suggesting configuration repair when requirements are false, and for creating visualizations of logical relationships. Section 9.4 discusses the Telcordiar IP Assure product [38] and the choices it has made to realize this design. It uses a parser generator for configuration acquisition. Its Requirement Library consists of requirements on integrity of logical 9 Network Configuration Validation 279 structures, connectivity, security, performance, reliability, and government policy. Its specification language is one of visual templates. Its evaluation system uses algorithms from graph theory and constraint solving. It computes visualizations of several types of logical topologies. Section 9.5 discusses logic-based techniques for realizing the above validation system design. Their use is particularly important for configuration repair. They simplify configuration acquisition and specification. They allow firewall subsumption, equivalence, and rule redundancy analysis. These techniques are the languages Prolog, Datalog, and arithmetic quantifier-free forms [51, 53, 67], the Kodkod [41, 69] constraint solver for first-order logic of finite domains, the ZChaff [27, 46, 73] minimum-cost SAT solver for Boolean logic, and Ordered Binary Decision Diagrams (OBDDs) [12]. Section 9.6 outlines related techniques for realizing the above validation system design. These are type inference for configuration acquisition [47], symbolic reachability analysis [72], its implementation [3] with symbolic model checking [48], and finally, validation techniques for Border Gateway Protocol (BGP), the Internet-wide routing protocol, and one of the most complex. Section 9.7 contains a summary and outlines future research directions. 9.2 Configuration Validation for a Collaboration Network This section discusses the challenges of configuring a realistic, multi-enterprise collaboration network, the types of its vulnerabilities caused by configuration errors, the reasons why these arise, and the benefits that can be derived from using a configuration validation system. Multiple communities of interest (COIs) are set up as logically partitioned virtual private networks (VPNs) overlaid on a common IP network backbone. The “nodes” of this VPN are gateway routers at each enterprise that participate in the COI. An enterprise can participate in more than one COI, in which case it would have one gateway router for each COI. For each COI, agreement is reached between participating network administrators on the top-level connectivity, security, performance, and reliability requirements governing the COI. Configuration of routers, firewalls, and other network components to implement these requirements is up to administrators. There is no centralized configuration authority. The administrators at different enterprises in a COI negotiate with each other to ensure configuration consistency. Such decentralized networks exist in industry, academia, and government and are clear candidates for the application of configuration validation tools. Typical COI requirements are now described. The connectivity requirement is that every COI site must be reachable from every other COI site. The security requirement is twofold. First, all communication between sites must be encrypted. Second, no packets from one COI can leak into another COI. This requirement is especially important since collaborating enterprises have limited mutual trust. A site can be a part of more than one COI but the information that site is willing to share 280 S. Narain et al. with partners on one COI is distinct from that with partners in another COI. The performance requirement specifies the bandwidth, delay, jitter, and packet loss for various types of applications. The reliability requirement specifies that connectivity be maintained in the face of link or node failure. Since these requirements are complex, large numbers of configuration errors can be made. This number is compounded by the lack of a centralized configuration authority. The complexity has the further consequence that –less experienced administrators, especially in an emergency, tend to statically route traffic directly over the IP backbone rather than correctly set up dynamic routing. But, when the emergency passes, static routes are not removed for concern of breaking the routing. Over time, this causes the COIs to become brittle in that routes cannot be automatically recomputed in the face of link or node failure. While administrators are well aware of configuration errors and their adverse effects on the global network, they lack the tools to identify these, much less remove these. The decentralized nature of the network prevents them from obtaining a picture of the global architecture. A validation system that could identify configuration errors, make recommendations for repairing these and help understand the global relationships would be of immense value to administrators. Figure 9.1 shows the architecture of a typical COI with four collaborating sites A, B, C, D. Each site contains a host, an internal router, and a gateway router. The first two items are shown only for sites A and C. Each gateway router is physically connected to the physical IP backbone network (WAN). Overlaid on this backbone is a network of IPSec [41] tunnels interconnecting the gateway routers. An IPSec tunnel is used to encrypt packets flowing between its endpoints. Overlaid on the IPSec network is a network of GRE [22] tunnels. A GRE tunnel provides the appearance of two routers being directly connected even though there may be many physical hops between them. The two overlay networks are “glued” together in such RB Physical Link RC RA WAN IC IPSec Tunnel IA HA RD Fig. 9.1 Community of interest architecture HC GRE Tunnel 9 Network Configuration Validation 281 a way that all packets through GRE tunnels are encrypted. A routing protocol, e.g., BGP [33, 36], is run over the GRE network to discover routes on this overlay. If a link or node in this network fails, BGP discovers an alternate route if possible. A packet originating at host HA destined to host HC is first directed by its internal router IA to the gateway router RA. RA encrypts the packet, then finds a path to HC on the GRE network. When the packet arrives at RC, it is decrypted, decapsulated, and forwarded to IC. IC then forwards it to HC. All routers also run the internal routing protocol called OSPF [42]. OSPF discovers routes to destinations that are internal to a site. The OSPF process at the gateway router redistributes or injects internal routes into the BGP process. The BGP process then informs its peers at other gateway routers about these routes. Eventually, all gateway routers come to know about how to route packets to any reachable internal destination at any site. In summary, connectivity, security, and reliability requirements are satisfied by the use, respectively, of GRE, IPSec and BGP, and OSPF. The security requirement that data from one COI not leak into another is satisfied implicitly. GRE reachability to a different COI is disallowed, static routes to destinations in different COIs are not set up, gateway routers at the same enterprise but belonging to different COIs are not directly connected, and BGP sessions across different COIs are not set up. The performance requirement is satisfied by ensuring that GRE tunnels are mapped to physical links of the proper bandwidth, delay, jitter, and packet loss properties, although this is not always in control of COI administrators. Avoiding one cause of packet loss, is however, in their control. This is the blocking of Maximum Transmission Unit (MTU) mismatch messages. If a router receives a packet whose size is larger than the router’s configured MTU, and the packet’s Do Not Fragment bit is set, the router will drop the packet. The router will also warn the sender in an ICMP message that it has dropped the packet. Then, the sender can reduce the size of packets its sends. However, since ICMP is the same protocol used to carry ping messages, firewalls at many sites block ICMP. The result is that the sender will continue to send packets without reducing their size and they will all be dropped by the router [68]. Packets increase in size beyond an expected MTU because GRE and IPSec encapsulations add new headers to packets. To avoid such packet loss, the MTU at all routers is set to some fixed value accounting for the encapsulation. Alternatively, ICMP packets carrying MTU mismatch messages are not blocked. This design is captured by the following requirements: Connectivity Requirements 1. Each site has a gateway router connected to the WAN. 2. There is a full-mesh of GRE tunnels between gateway routers. 3. Each gateway router is connected to an internal router at the same site. Security Requirements 1. 2. 3. 4. There is a full-mesh network of IPSec tunnels between all gateway routers. Packets through every GRE tunnel are encrypted with an IPSec tunnel. No gateway router in a COI has a static route to a destination in a different COI No cross-COI physical, GRE, BGP connectivity, or reachability is permitted. 282 S. Narain et al. Reliability Requirements 1. BGP is run on the GRE tunnel network to discover routes to destinations in different sites. 2. OSPF is run within a site to discover routes to internal destinations. 3. OSPF and BGP route redistribution is set up. Performance Requirements 1. MTU settings on all interfaces are set to be less than the expected packet size after taking into account GRE and IPSec encapsulation. 2. Alternatively, access-control lists at each gateway router permit ICMP packets carrying MTU messages. Configuration parameters that must be correctly set to implement the above requirements include: 1. IP addresses and mask of physical and GRE interfaces 2. IP address of the local and remote BGP session end points and the autonomous system (AS) number of the remote end point 3. Names of GRE interface and IP address of associated local and remote physical tunnel end points 4. IP addresses of local and remote IPSec tunnel end points, encryption and hash algorithms to apply to protected packets, and the profile of packets to be protected 5. Destination, destination mask, and next hop of static routes 6. Interfaces on which OSPF is enabled and the OSPF areas to which they belong 7. Source and destination address ranges, protocols, and port ranges of packets for access-control lists 8. Maximum transmission units for router interfaces As can be imagined, a large number of errors can be made in manual computation of configuration parameter values implementing these requirements. GRE tunnels may only configure in one direction or not at all. IPSec tunnels may only configure in one direction or not at all. GRE and IPSec tunnels may not be “glued” together. GRE tunnels or sequences of tunnels may link routers in distinct COIs. A COI gateway router may contain static routes to a different COI, so packets could be routed to that COI via the WAN. BGP sessions may be set up between routers in different COIs, so these routers may come to know about destinations behind each other. BGP sessions may only be configured in one direction or not at all. BGP sessions may not be supported by GRE tunnels, so these sessions will not be established. There may be single points of failure in the GRE and BGP networks. Finally, MTU settings on routers in a COI may be different leading to the possibility of packet loss. Such errors can be visualized by mapping various logical topologies. Two of these are shown below. In Fig. 9.2, nodes represent routers and edges represent a GRE edge between routers. These edges have to be set up in both directions for a GRE tunnel to be established. This graph shows two problems. First, the edge labeled “Asymmetric” has no counterpart in the reverse direction. Second, the dotted line indicates a missing 9 Network Configuration Validation 283 Fig. 9.2 GRE tunnel topology Single point of Failure Missing Asymmetric Fig. 9.3 BGP neighbor topology COl 1 COl 2 tunnel. Third, the hatched router indicates a single point of GRE failure. All GRE packets to destinations to the right of this router pass through this router. In Fig. 9.3, nodes represent routers and links represent BGP sessions between nodes. This graph shows two problems. First, there is no full-mesh of BGP sessions within COI 1. Second, there is a BGP session between routers in two distinct COIs. 9.3 Creating a Configuration Validation System This section outlines the design of a system that can not only validate the network of the previous section but also evolve to validate even more complex ones. As shown in Fig. 9.4, this consists of a Configuration Acquisition system to acquire configuration information in a vendor-neutral format, a Requirement Library containing fundamental requirements simplifying the task of conceptualizing administrator intent, an easy-to-use Specification Language in which to specify requirements, and an Evaluation System to efficiently evaluate specifications in this language. These subsystems are now described. 284 S. Narain et al. Configuration Files Requirement Library Configuration Acquisition System Administrator Configuration Database End-to-End Requirements in Specification Language Specification Language Evaluation System Root-Cause Of Non-Compliance Visualizations Suggestions For Repair Fig. 9.4 Validation system architecture 9.3.1 Configuration Acquisition System Each component has associated with it a configuration file containing commands that define that component’s configuration. These commands are entered by the network administrator. The most reliable method of acquiring a device’s configuration information is to acquire this file, manually or automatically. Other less-reliable methods are accessing the devices’ SNMP agent and querying configuration databases. SNMP agents often do not store all of the configuration information one might be interested in. The correctness and completeness of a configuration database varies from enterprise to enterprise. If configuration information is acquired from files, then these files have to be parsed. Configuration languages have a simple syntax and semantics, since they are intended to be used by network administrators who may not be expert programmers. Different vendors offer syntactically different configuration languages. However, the abstract configuration information stored in these files is the same, barring nonstandard features that vendors sometimes implement. This information is associated with standardized protocols. Examples of it from the previous section are IP addresses, OSPF area identifiers, BGP neighbors, and IPSec cryptographic algorithms. This information needs to be extracted from files and stored in a vendor-neutral format database. Then, algorithms for evaluating requirements can be written just once against this database, and not once for every combination of vendor configuration language. However, configuration languages are vast, each with a very large set of features. Their syntax can change from one product release to another. Some 9 Network Configuration Validation 285 vendors do not supply APIs to extract the abstract information. It should be possible to extract configuration information without having to understand all features of a configuration language. Extraction algorithms should be resilient to inevitable changes in configuration language syntax. 9.3.2 Requirement Library The Requirement Library is analogous to libraries implementing fundamental algorithms in software development. The Library should capture design patterns and best practices for accomplishing fundamental goals in connectivity, security, reliability, and performance. Examples of these for security can be found in [18] and for routing in [33]. These patterns can be expressed as requirements. The administrator should be easily able to conceptualize end-to-end requirements as compositions of Library requirements. 9.3.3 Specification Language The specification language should provide an easy-to-use syntax for expressing end-to-end requirements. Specifications should be as close as possible in their forms to their natural language counterparts. The syntax can be text-based or visual. Since requirements are logical concepts, the syntax should allow specification of objects, attributes, and constraints between these and compositions of constraints via operators such as negation, conjunction, disjunction, and quantification. For example, all of these constructs appear in the Section 9.2 requirement “No gateway router in a COI has a static route to a destination in a different COI.” 9.3.4 Evaluation System The Requirement Evaluation system should contain efficient algorithms to evaluate a requirement against configuration. These algorithms should output not just a yes/no answer but also explanations or counterexamples to guide configuration repair. Configuration repair is harder than evaluation. A set of requirements can be independently evaluated but if some are false, they cannot be independently made true. Changing the configuration to make one requirement true may falsify another. To provide further insight into reasons for truth or falsehood of requirements, this system should compute visualizations of logical relationships that are set up via configuration, analogous to visualizations of quantitative data [70]. 286 S. Narain et al. 9.4 IP Assure Validation System This section describes the Telcordiar IP Assure product and discusses the choices made in it to implement the above abstract design of a validation system. This product aims to improve the security, availability, QoS, and regulatory compliance of IP networks. It uses a parser generator for configuration acquisition. Its Requirement Library consists of well over 100 requirements on integrity of logical structures, connectivity, security, performance, reliability, and government policy. Its specification language is one of visual templates. Its evaluation system uses algorithms from graph theory and constraint solving. It also computes visualizations of several types of logical topologies. If a requirement is false, IP Assure does compute a root-cause, although its computation is hand-crafted for each requirement. IP Assure does not compute a repair that concurrently satisfies all requirements. 9.4.1 Configuration Acquisition System Section 9.3 raised three challenges in the design of a configuration acquisition system. The first was the design of a vendor-neutral database schema for storing configuration information. The second was extracting information from configuration files without having to know the entire configuration language for a given vendor. The third was making the extraction algorithms robust to inevitable changes in the configuration language. This section describes IP Assure’s configuration acquisition system and sketches how well it meets these challenges. IP Assure has defined a schema loosely modeled after DMTF [20] schemas. It uses the ANTLR [5] system to define a grammar for configuration files. The parser generated by ANTLR reads the configuration file and if successful returns an abstract syntax tree exposing the structure of the file. This tree is then analyzed by algorithms implemented in Java to create and populate tables in its schema. Often, information in a table is assembled from information scattered in different parts of the file. The system is illustrated in the context of a configuration file containing the following commands in Cisco’s IOS configuration language: hostname router1 ! interface Ethernet0 ip address 1.1.1.1 255.255.255.0 crypto map mapx ! crypto map mapx 6 ipsec-isakmp set peer 3.3.3.3 set transform-set transx match address aclx ! 9 Network Configuration Validation 287 crypto ipsec transform-set transx esp-3des hmac ! ip access-list extended aclx permit gre host 3.3.3.3 host 4.4.4.4 A configuration file is a sequence of command blocks consisting of a main command followed by zero or more indented subcommands. The first command specifies the name router1 of the router. It has no subcommands. Any line beginning with ! is a comment line. The second command specifies an interface Ethernet0. It has two subcommands. The first specifies the IP address and mask of this interface. The second specifies the name mapx of an IPSec tunnel originating from this interface. The parameters of the IPSec tunnel are specified in the next command block. The main command specifies the name of the tunnel, mapx. The subcommands specify the address of the remote endpoint of the IPSec tunnel, the set transx of cryptographic algorithms to be used, and the profile aclx of the traffic that will be secured by this tunnel. The next command block defines the set transx as consisting of the encryption algorithm esp-3des and the hash algorithm hmac. The last command block defines the traffic profile aclx as any packet with protocol, source address and destination address equal to gre, 3.3.3.3 and 4.4.4.4, respectively. Part of an ANTLR grammar for recognizing the above file is: commands: command NL (rest=commands | EOF) ->ˆ(COMMAND command $rest?); command: (’interface’) => interface_cmd |(’crypto’) => crypto_cmd |(’ip’) => ip_cmd |unparsed_cmd; interface_cmd: ’interface’ ID (LEADINGWS interface_subcmd) * -> ˆ(’interface’ ID interface_subcmd *) interface_subcmd: ’ip’ ’address’ a1=ADDR a2=ADDR -> ˆ(’address’ $a1 $a2) |’crypto’ ’map’ ID -> ˆ(CRYPTO_MAP ID) |unparsed_subcmd; The first grammar rule states that commands is a sequence of one or more command blocks. The ˆ symbol is a directive to construct the abstract syntax tree whose root is the symbol COMMAND, whose first child is the command block just read, and second child is the tree representing the sequence of subsequent command blocks. The next rule states that a command block begins with the keywords interface, crypto, or ip. The symbol = > means no backtracking. The last line in this rule states that if a command block does not begin with any of these identifiers, it is skipped. Skipping is done via the unparsed cmd symbol. Grammar rules defining it skip all tokens till the beginning of the next command block. The last two rules define the structure of an interface command block. ANTLR produces a parser that processes the above file and outputs an abstract syntax tree. This tree is then analyzed to create the tables below. Note that the ipsec table assembles information from the interface, crypto map, crypto ipsec, and ip access-list command blocks. 288 Host router1 S. Narain et al. Interface Ethernet0 Host router1 SrcAddr 1.1.1.1 Host router1 Filter Aclx ipAddress Table Address 1.1.1.1 ipsec Table DstAddr EncryptAlg 3.3.3.3 esp-3des acl Table Protocol gre SrcAddr 3.3.3.3 Mask 255.255.255.0 HashAlg hmac DstAddr 4.4.4.4 Filter aclx Perm permit IP Assure’s vendor-neutral schema captures much of the configuration information for protocols it covers. Its skipping idea allows one to parse a file without recognizing the structure of all possible commands and command blocks. However, the idea is quite hard to get right in the ANTLR framework. One is trying to avoid writing a grammar for the skipped part of the language, yet the only method one can use is to write rules defining unparsed cmd. 9.4.2 Requirement Library 9.4.2.1 Requirements on Integrity of Logical Structures A very useful class of requirements is on the integrity of logical structures associated with different protocols. Before a group of components executing a protocol can accomplish an intended joint goal, various logical structures spanning these components must be set up. These structures are set up by making component configurations satisfy definite constraints. For example, before packets flowing between two interfaces can be secured via IPSec, the lPSec tunnel logical structure must be set up. This is done by setting IPSec configuration parameters at the two interfaces and ensuring that their values satisfy definite constraints. For example, the two interfaces must use the same hash and encryption algorithms, and the remote tunnel endpoint at each interface must equal the IP address of its counterpart. An Hot Standby Routing Protocol (HSRP) [44] router cluster is another example of a logical structure. It allows two or more routers to behave as a single router by offering a single virtual IP address to the outside world, on a given subnet. This address is mapped to the real address of an interface on the primary router. If this router fails, another router takes over the virtual address. Before the cluster correctly functions, however, the same virtual address and HSRP group identifier must be configured on all interfaces and the virtual and all physical addresses must belong to the same subnet. Much more complex logical structures are set up for BGP. Different routers in an autonomous system (AS) connect to different neighboring ASes, giving each router only a partial view of BGP routes. To allow all routers in an AS to construct 9 Network Configuration Validation 289 a complete view of routes, routers exchange information between themselves via iBGP (internal BGP) sessions. The simplest logical structure for accomplishing this exchange is a full-mesh of iBGP sessions, one for each pair of routers. But a full-mesh is impractical for a large AS, since the number of sessions grows quadratically with the number of routers. Linear growth is accomplished with a hub-and-spoke structure. All routers exchange routes with a spoke called a route reflector. If these structures are incorrectly set up, protocol oscillations, forwarding loops, traffic blackholes, and violation of business contracts can arise [6,31,74]. See Section 9.6.4 for more discussion of BGP validation. IP Assure evaluates requirements on integrity of logical structures associated with all common protocols. These structures include IP subnets, GRE tunnels, IPSec tunnels, MPLS [60] tunnels, BGP full-mesh or hub-and-spoke structures, OSPF subnets and areas, and HSRP router clusters. 9.4.2.2 Connectivity Requirements Connectivity (also called reachability) is a fundamental requirement of a network. It means the existence of a path between two nodes in the network. The most obvious network is an IP network whose nodes represent subnets and routers and links represent direct connections between these. But as noted in Section 9.2, connectivity requirements are also meaningful for many other types of networks such as GRE, IPSec, and BGP. IP Assure evaluates connectivity for IP, VLANs, GRE, IPSec, BGP, and MPLS networks. IP Assure also evaluates reachability in the presence of access-control policies, or lists, configured on routers or firewalls. An access-control list is a collection of rules specifying the IP packets that are permitted or denied based on their source and destination address, protocol, and source and destination ports. These rules are order-dependent. Given a packet, the rules are scanned from the top-down and the permit or deny action associated with the first matching rule is taken. Even if a path exists, a given packet may fail to reach a destination because an access-control list denies that packet. 9.4.2.3 Reliability Requirements Reliability in a network means the ability to maintain connectivity in the presence of failures of nodes or links. A single point of failure for connectivity between two nodes in a network is said to exist if a single failure causes connectivity between the two nodes to be lost. Reliability is achieved by provisioning backup resources and setting up a reliability protocol. This protocol monitors for failures and when one occurs, finds backup resources and attempts to restore connectivity using those. Configuration errors may prevent backup resources from being provisioned. For example, in Section 9.2, some GRE tunnels were only configured in one direction, not in the other, so they were unavailable for being rerouted over. Even if backup 290 S. Narain et al. resources have been provisioned, configuration errors in the routing protocol can prevent these resources from being found. For example, in Section 9.2, BGP was simply not configured to run over some GRE tunnels, so it would not find these links to reroute over. The architecture of the fault-tolerance protocol itself can introduce a single point of failure. For example, a nonzero OSPF area may be connected to OSPF area zero by a single area-border-router. If that router fails, then OSPF will fail to discover alternate routes to another area [36] even if these exist. Similarly, unless BGP route reflectors are replicated, they can become single points of failure [7]. Furthermore, redundant resources at one layer must be mapped to redundant resources at lower layers. For example, if all GRE tunnels originate at the same physical interface on a router, then if that interface fails, all tunnels would simultaneously fail. Ideally, all GRE tunnels originating at a router must originate at distinct interfaces on that router. Single points of failure can also arise out of the dependence between security and reliability. As shown in Fig. 9.5, routers R1 and R2 together constitute an HSRP cluster with R1 as the primary router. This cluster forms the gateway between an enterprise’s internal network on the right and the WAN on the left. For security, an IPSec tunnel is configured from R1 to the gateway router C of a collaborating site. However, this tunnel is not replicated on R2. Consequently, if R1 fails, then R2 would take over the cluster’s virtual address; however, IPSec connectivity to C would be lost. Reliability requirements that IP Assure evaluates include absence of single points of failure in IP networks, with and without access-control policies; absence of single OSPF area-border-routers; and replication of IPSec tunnels in an HSRP cluster. IPSec Tunnel 1 C WAN R1 HSRP Cluster X IPSec Tunnel 2 Fig. 9.5 HSRP cluster R2 Internal network 9 Network Configuration Validation 291 9.4.2.4 Security Requirements Typical network security requirements are about data confidentiality, data integrity, authentication, and access-control. IPSec is commonly used to satisfy the first three requirements and access-control lists are used to satisfy the last one. Access-control lists were discussed in Section 9.4.2.2. Components dedicated just to processing access-control lists are called firewalls. IP Assure evaluates requirements for both these technologies. For IPSec, it evaluates the tunnel integrity requirements in Section 9.4.2.1. For access-control lists, IP Assure evaluates two fundamental requirements. First, an access-control list subsumes another in that any packet permitted by the second is also permitted by the first. A related requirement is that one list is equivalent to another in that any packet permitted by one is permitted by the other. Two lists are equivalent if each subsumes the other. An enterprise may have multiple egress firewalls. Access-control lists on these may have been set up by different administrators over different periods of time. It is useful to check that the policy governing packets that leave the enterprise are equivalent. The second requirement that IP Assure evaluates on access-control lists is that a firewall has no redundant rules. A rule is redundant if deleting it will not change the set of packets a firewall permits. Deleting redundant rules makes lists compact and easier to understand and maintain. 9.4.2.5 Performance Requirements The [19] protocol allows one to specify policies for partitioning packets into different classes, and then for according them differentiated performance treatment. For example, a packet with a higher DiffServ class is given transmission priority over one with a lower. Typically, voice packets are given highest priority because of the high sensitivity of voice quality to end-to-end delays. Performance requirements that IP Assure evaluates are that all DiffServ policies on all routers are identical, and that any policy that is defined is actually used by being associated with an interface. IP Assure also evaluates the requirement that ICMP packets are not blocked. This is a sufficient condition for avoiding packet loss due to mismatched MTU sizes and setting of Do Not Fragment bits discussed in Section 9.2. 9.4.2.6 Government Regulatory Requirements Government regulatory requirements represent “best practices” that have evolved over a period of time. Compliance to these is deemed essential for connectivity, reliability, security, and performance of an organization’s network. Compliance to certain regulations such as the Federal Information Security Management Act (FISMA) [26] is mandatory for government organizations. Two examples of a FISMA requirement are (a) alternate communications services do not share a single 292 S. Narain et al. point of failure with primary communication services, (b) all access between nodes internal to an enterprise and those external to it is mediated by a proxy server. IP Assure allows specification of a large number of FISMA requirements. 9.4.3 Specification Language IP Assure’s specification language is that of graphical templates. It offers a menu of more than 100 requirements in different categories. A user can select one or more of these to be evaluated. For each requirement, one can specify its parameters. For example, for a reachability requirement, one can specify the source and destination. For an access-control list equivalence requirement, one can specify the two lists. One cannot apply disjunction or quantification operators to requirements. The only way to define new requirements is to program in Java and SQL. Figure 9.6 shows a few requirement classes that can be evaluated. These are QoS (DiffServ), HSRP, OSPF, BGP, and MPLS. Fig. 9.6 IP Assure requirement specification screen 9 Network Configuration Validation 293 9.4.4 Evaluation System Structural integrity requirements are evaluated with algorithms specialized to each requirement. In IP Assure, these algorithms are implemented with SQL and Java. The relevant tuples from the configuration database are extracted with SQL and analyzed by Java programs. For example, to evaluate whether an IPSec tunnel between two addresses local1 and local2 is set up, one checks that there are tuples ipsec(h1, local1, remote1, ea1, ha1, filter1) and ipsec(h2, local2, remote2, ea2, ha2, filter2) in the configuration database, and that local1 = remote2, remote1 = local2, ea1 = ea2, ha1 = ha2 and filter1 is a mirror image of filter2. Reachability and reliability requirements for a network are evaluated by extracting the relevant graph information from the configuration database with SQL queries, then applying graph algorithms [63]. For example, given the tuple ipAddress(host, interface, address, mask), one creates two nodes, the router host and the subnet whose address is the bitwise-and of address and mask, and then creates directed edges linking these in both directions. This step is repeated for all such tuples to compute an IP network graph. To evaluate whether a node or a link is a single point of failure, one removes it from the graph and checks whether two nodes are reachable. If not, then the deleted node or link is a single point of failure. To check reachability in the presence of access-control lists, all edges at which these lists block a given packet are deleted, and then reachability analysis is repeated for the remaining graph. Firewall requirements cannot be evaluated by enumerating all possible packets and checking for subsumption, equivalence, or redundancy. The total number of combinations of all source and destination addresses, ports, and protocols is astronomical: the total number of IPv4 source and destination address, source and destination port, and protocol combinations is 2^ 104 (32 C 32 C 16 C 16 C 8). Instead, symbolic techniques are used. Each policy is represented as a constraint on the following fields of a packet: source and destination address, protocol, and source and destination ports. The constraint is true precisely for those packets that are permitted by the firewall, taking rule ordering into account. Let P1 and P2 be two policies and C1 and C2 be, respectively, the constraints representing them. The constraint can be constructed in time linear in the number of rules. Then, P1 is subsumed by P2 if there is no solution to the constraint C1 ^ :C2. To check that a rule in P1 is redundant, delete it from P1 and check that the resulting policy is equivalent to P1. For example, let a firewall contain the following rules that, for simplicity, only check whether the source and destination addresses are in definite ranges: 1, 2, 3, 4, deny 5, 6, 7, 8, permit 10, 15, 15, 20, permit 294 S. Narain et al. The first rule states that any packet with source address between 1 and 2 and destination address between 3 and 4 is denied. Similarly, for the second and third rules. These are represented by the following constraint C1 on the variables src and dst. : (1=, >D, and bitwise logic operators. This QFF is then efficiently solved by Kodkod. If ConfigAssure is unable to find a solution, it outputs a proof of unsolvability, inherited from Kodkod. This proof is interpreted as a root-cause and guides configuration repair. Arithmetic quantifierfree forms constitute a good intermediate language between Boolean logic and first-order logic. Not only is it easy to express requirements in it, but it can also be efficiently compiled into Boolean logic. ConfigAssure was designed to avoid, where possible, the generation of very large intermediate constraints in Kodkod’s transformation of first-order logic into Boolean. If the fields that are responsible for making a requirement false are known, then one way to repair these is as follows: replace these fields with variables and use ConfigAssure to find new values of these variables that make the requirement true. Two approaches can be used to narrow down these fields. The first exploits the proof of unsolvability of the falsified requirement to compute a type of root-cause. The second exploits properties of Datalog proofs and ZChaff to compute that set of fields whose cost of change is minimal. The second approach has been developed in the MulVAL [35,55,56] system. More generally, MulVAL is a system for enterprise security analysis using attack graphs. Ordered Binary Decision Diagrams are an alternative to SAT solvers for evaluating firewall policy subsumption and rule redundancy with a method conceptually similar to that in Section 9.4.4. The use of these techniques for building different parts of a validation system is now illustrated with concrete examples based on the case study in Section 9.2. 9.5.1 Configuration Acquisition by Querying When the structure of a configuration file is simple, as it is for Cisco’s IOS, then it is not necessary to write a grammar with ANTLR or PADS/ML [47]. Instead, the structure can be put into a command database and then queried to construct the 298 S. Narain et al. configuration database. The query needs to refer only to that part of the command database necessary to construct a given table. All other parts are ignored. This idea provides substantial resilience to insertion of new command blocks, insertion of new subcommands in a known command block, and insertion of new keywords in subcommands. This idea is illustrated using Prolog, although any database engine could be used. Each command block is transformed into an ios cmd tuple or Prolog fact, with the structure ios_cmd(FileName, MainCommand, ListOfSubCommands) where MainCommand and each item in ListOfSubCommands is of the form [NestingLevel j ListOfTokens]. [AjB] means the list with head A and tail B. For example, the IOS file of Section 9.4.1, named f here, is transformed into the following Prolog tuples: ios_cmd(f, [0, hostname, router1], []). ios_cmd(f, [0, interface, ’Ethernet0’], [ [1, ip, address, ’1.1.1.1’, ’255.255.255.0’], [1, crypto, map, mapx] ]). ios_cmd(f, [0, crypto, map, mapx, 6, ’ipsec-isakmp’], [ [1, set, peer, ’3.3.3.3’], [1, set, ’transform-set’, transx], [1, match, address, aclx]]). ios_cmd(f, [0,crypto,ipsec,’transform-set’, transx,’esp-3des’,hmac], []). ios_cmd(f, [0, ip, ’access-list’, extended, aclx], [ [1, permit, gre, host, ’3.3.3.3’, host, ’4.4.4.4’]]). Note the close correspondence between the structure of command blocks in the IOS file and associated ios cmd tuples. One can now write Prolog rules to construct the configuration database. For instance, to construct rows for the ipAddress table, one can use: ipAddress(H, I, A, M):ios_cmd(File, [0, hostname, H|_], _), ios_cmd(File, [0, interface, I|_], Args), member(SubCmd, Args), subsequence([ip, address, A, M], SubCmd). The syntactic convention followed in Prolog is that identifiers beginning with capital letters are variables, otherwise they are constants. The :- symbol is a shorthand for if. All variables are universally quantified. The rule states that ipAddress of an interface I on host H is A with mask M if there is a File containing a hostname command declaring host H, an interface command declaring interface I, and a subcommand of that command declaring its address and mask to be A and M, respectively. 9 Network Configuration Validation 299 Note that this definition is unaffected by subcommands of the interface command that are not of interest for computing ipAddress, or that are defined in a subsequent IOS release. It only tries to find a subcommand containing the sequence [ip, address, A, M]. It does not require that the subcommand be in a definite position in the block, or that the sequence address A, M appear in definite position in the ip subcommand. Now, where H, I, A, M are variables, the query ipAddress(H, I, A, M) will succeed with the solution H = f, I = ’Ethernet0’, A = ’1.1.1.1’ and M = ’255.255.255.0’. Here f is a host, I is an interface on this host, and A and M its address and mask, respectively. ipsec is more complex but querying simplifies the assembly of information from different parts of a configuration file. For each interface, one finds the name of a crypto map Map applied to that interface, and then finds the corresponding crypto map command, from which one can extract the peer address Peer, the filter Filter, and transform-set Transform. These values are used to select the crypto ipsec command from which the Encrypt and Hash values are extracted. Thus, the ipSecTunnel(H, Address, Peer, Encrypt, Hash, Filter) is constructed. ipsec(H, Address, Peer, Encrypt, Hash, Filter):ios_cmd(File, [0, interface, I |_], Args), member([_, crypto, map, Map |_], Args), ios_cmd(File, [0, hostname, H |_], _), ipAddress(H, I, Address, _), ios_cmd(File, [0, crypto, map, Map |_], CArgs), member([_, set, peer, Peer |_], CArgs), member([_, match, address, Filter|_], CArgs), member([_, set, ’transform-set’, Transform |_], CArgs), ios_cmd(File, [0, crypto, ipsec, ’transform-set’, Transform, Encrypt, Hash],_). The ipAddress and ipsec tuples are constructed in all possible ways via Prolog backtracking. Together, these form the configuration database for these protocols. 9.5.2 Specification Language This section shows how Prolog can be used to specify the types of requirements in the case study of Section 9.2. It has already been used to validate VPN and BGP requirements [50, 58] As shown in Fig. 9.9, routers RA and RB are in the same COI but RX is in a different COI. RA’s configuration violates two security requirements and one connectivity requirement. First, RA has a GRE tunnel into RX. Second, RA has a default static route using which it can forward packets destined to RX, to the WAN. Third, RA does not have a GRE tunnel into RB. All these violations need to be detected and configurations repaired. 300 S. Narain et al. RB COI1 eth_0 address = 200 COI1 eth_0 address = 100 RA WAN tunnel_0 COI2 eth_0 address = 300 RX Fig. 9.9 Network violating security and connectivity requirements A configuration database for the above network is represented by the following Prolog tuples: static_route(ra, 0, 32, 400). gre(ra, tunnel_0, 100, 300). ipAddress(ra, eth_0, 100, 0). ipAddress(rb, eth_0, 200, 0). ipAddress(rx, eth_0, 300, 0). coi([ra-coi1, rb-coi1, rx-coi2]). The first tuple states that router ra has a default static route with a next hop of address 400. Normally, a mask is a sequence of 32 bits containing a sequence of ones followed by a sequence of zeros. In the ipAddress tuple, a mask is represented implicitly as the number of zeros at the end of the sequence. This simplifies the computations we need. The route is called “default” because any address matches it. The second states that router ra has a GRE tunnel originating from GRE interface tunnel 0 with local physical address 100 and remote physical address 300. The third tuple states that router ra has a physical interface eth 0 with address 100 and mask 0. Similarly, for the fourth and fifth tuples. The last tuple lists the community of interest of each router. Requirements are defined with Prolog clauses, e.g.: good:-gre_connectivity(ra, rb). gre_connectivity(RX, RY):gre_tunnel(RX, RY), route_available(RX, RY). 9 Network Configuration Validation 301 gre_tunnel(RX, RY):gre(RX, _, _, RemoteAddr), ipAddress(RY, _, RemoteAddr, _). route_available(RX, RY):static_route(RX, Dest, Mask, _), ipAddress(RY, _, RemotePhysical, 0), contained(Dest, Mask, RemotePhysical, 0). contained(Dest, Mask, Addr, M):Mask>=M, N is ((2ˆ32-1)<< Mask)/\Dest, N is ((2ˆ32-1)<< Mask)/\Addr. bad:-gre_tunnel(ra, rx). bad:-route_available(ra, rx). The first clause states that good is true provided there is GRE connectivity between routers ra and rb since they are in the same COI. The second clause states that there is GRE connectivity between any two routers RX and RY provided RX has a GRE tunnel configured to RY and a route available to RY. The third clause states that a GRE tunnel to RY is configured on RX provided there is a GRE tuple on RX whose remote address is that of an interface on RY. The fourth clause states that a route to RY is available on RX provided an address RemotePhysical on RY is contained within the address range of a static route on RX. The fifth clause checks this containment. < < is the left-shift operator and /n is the bitwise-and operator, not to be confused with the conjunction operator. The sixth clause states that bad is true provided there is a gre tunnel between ra and rx since ra and rx are not in the same COI. The last clause states that bad is also true provided a route on ra is available for packets with a destination on rx. We now show how to capture requirements containing quantifiers. To capture the requirement all good that between every pair of routers in a COI there is GRE connectivity, we can write: all_good:-not(same_coi_no_gre). same_coi_no_gre:-same_coi(X, Y), not(gre_connectivity (X, Y)). same_coi(X, Y):-coi(L), member(X-C, L), member (Y-C, L). The first rule states all good is true provided same coi no gre is false. The second rule states that same coi no gre is true provided there exist X and Y that are in the same COI but for which gre connectivity(X, Y) is false. The last rule states that X and Y are in the same COI provided there is some COI C such that X-C and Y-C are in the COI association list L. Similarly, we can capture the requirement no bad that no router contains a route to a router in a different COI. As previously mentioned, the MulVAL system has proposed the use of Datalog for specification and analysis of attack graphs. Datalog is a restriction of Prolog in which arguments to relations are just variables or atomic terms, i.e., no complex terms and data structures. This restriction means, in particular, that predicates such as all good and all pairs gre cannot be specified and neither can subnet id since it needs bitwise operations. However, the first five Prolog tuples 302 S. Narain et al. above and the first three rules can be specified. This restriction, however, permits MulVAL to perform fine-grained analysis of root-causes of configuration errors and to compute strategies for their repair. This is discussed in the next section. 9.5.3 Evaluation for Repair If a configuration database and requirements are expressed in Prolog, then its query capability can be used to evaluate whether requirements are true. For example, the query route available(ra, rb) is evaluated to be true by clauses for route available, static route, and contained. The query bad succeeds for two reasons. First, the static route on ra is a default route. It forwards packets to any destination, including to destinations in a different COI. Second, a GRE tunnel to router rx is configured on ra even though rx is in a different COI. On the other hand, the query good fails. This is because the predicate gre tunnel(ra, rb) fails. The only GRE tunnel configured on ra is to rx, not to rb. If requirement evaluation against a configuration database is the only goal, then a Prolog-based validation system is practical on a realistic scale. However, if a requirement is false for a configuration database and the goal is to change some fields in some tuples so that the requirement becomes true, then Prolog is not adequate. The Prolog query (good,not(bad)), representing the conjunction of good and not(bad), will simply fail. Prolog will not return new values of these fields that make the query true. In order to efficiently compute new values of these fields, a constraint solver with the capability to compute a proof of unsolvability is needed. Such a capability is provided by the ConfigAssure system. ConfigAssure allows one to replace some fields in some tuples in a configuration database with configuration variables. These variables are unrelated to Prolog variables. ConfigAssure also allows one to specify a requirement R as an equivalent QFF RC on these configuration variables. Solving RC would compute new values of these fields, in effect repairing the fields. For example, suppose we suspect that the query (good,not(bad)) fails because addresses and the static route mask are incorrect. We can replace all these with configuration variables to obtain the following database: static_route(ra, dest(0), mask(0), 400). gre(ra, tunnel_0, gre_a_local(0), gre_a_remote(0)). ipAddress(ra, eth_0, ra_addr(0), 0). ipAddress(rb, eth_0, rb_addr(0), 0). ipAddress(rx, eth_0, rx_addr(0), 0). coi([ra-coi1, rb-coi1, rx-coi2]). Here, dest(0), mask(0), gre a local(0), gre a remote(0), ra addr(0), rb addr(0), rx addr(0) are all configuration variables. In order that this database satisfy (good ^ not(bad)), these configuration variables must satisfy the following constraint RC: 9 Network Configuration Validation 303 :gre a remote(0)=rx addr(0)^ :contained(dest(0),mask (0), rx addr(0),0) ^ gre a remote(0)=rb addr(0) ^ contained(dest(0),mask(0),rb addr(0),0) ^ : ra addr(0)=rb addr(0) ^ :rb addr(0)=rx addr(0) ^ :rx addr(0)=ra addr(0) The constraint on the first two lines is equivalent to not(bad). It states that ra should neither have a GRE tunnel nor a static route to rx. The constraint on the next two lines is equivalent to good. It states that ra should have both a GRE tunnel and a static route to rb. The constraint on the last line states that all interface addresses are unique. Solving this constraint would indeed find new values of configuration variables and hence repair the fields. However, one may change fields, such as ra addr(0), unrelated to the failure of (good,not(bad)). To change fields only related to failure, one can exploit the proof of unsolvability that ConfigAssure automatically computes when it fails to solve a requirement. This proof is a typically small and unsolvable part of the requirement, and can be taken to be a root-cause of unsolvability. The idea is to generate a new constraint InitVal that is a conjunction of equations of the form x = c where x is a configuration variable that replaced a field and c is the initial value of that field. Now try to solve RC^InitVal. Since R is false for the database without variables, ConfigAssure will find RC^InitVal to be unsolvable and return a proof of unsolvability. If, in this proof, there is an equation x = c that is also in InitVal, then relax the value of x by deleting x = c from InitVal to create InitVal’. Reattempt a solution to RC^InitVal’ to find a new value of x. More than one such equation can be deleted in a single step. For example, the definition of InitVal for above configuration variables is: dest(0)=0 ^ mask(0)=32 ^ gre a local(0)=100 ^ gre a remote(0)=300 ^ ra addr(0)=100 ^ rb addr(0)=200 ^ rx addr(0)=300 Submitting RC^InitVal to ConfigAssure generates a proof of unsolvability that ra should have a tunnel to rb but instead has one to rx: gre a remote(0)=rb addr(0) ^ gre a remote(0)=300 ^ rb addr(0)=200 Deleting the second equation from InitVal to obtain InitVal’ and solving RC^InitVal’ we obtain another proof of unsolvability that ra has a static route to rx: rx addr(0)=300 ^ dest(0)=0 ^ mask(0)=32 ^ :contained (dest(0),mask(0),rx addr(0),0) 304 S. Narain et al. Deleting the second and third equations and solving, we obtain a solution that fixes both the GRE tunnel and the static route on ra: dest(0)=200 mask(0)=0 gre_a_remote(0)=200 gre_a_local(0)=100 ra_addr(0)=100 rb_addr(0)=200 rx_addr(0)=300 Values of just the first three variables needed to be recomputed. Values of others do not need to be. Note that ra addr(0) never appeared in a proof of unsolvability even though it did in RC. Thus, its value definitely does not need to be recomputed. This is not obvious from RC. Note also that repair is holistic in that it satisfies both good and not(bad). The remaining task is generation of the constraint RC. It is accomplished by thinking about specification as a method of computing an equivalent quantifier-free formula, i.e., defining the predicate eval(Req, RC) where Req is the name of a requirement and RC is a QFF equivalent to Req. The original Prolog specification of Req in Section 9.5.2 is no longer needed. It is replaced by a metalevel version as follows: eval(bad, or(C1, C2)):eval(gre_tunnel(ra, rx), C1), eval(route_available(ra, rx), C2). eval(gre_tunnel(RX, RY), RemoteAddr=Addr):gre(RX, _, _, RemoteAddr), ipAddress(RY, _, Addr, _). eval(route_available(RX, RY), C):static_route(RX, Dest, Mask, _), ipAddress(RY, _, RemotePhysical, _), C=contained(Dest, Mask, RemotePhysical, 0). eval(addr_unique, C):andEach([not(ra_addr(0)=rb_addr(0)), not(rb_addr(0)=rx_addr(0)), not(rx_addr(0)=ra_addr(0))], C). eval(topReq, C):eval(good, G), eval(bad, B), eval(addr_unique, AU), andEach([G, B, AU], C). These rules capture the semantics of the Prolog rules. The first states that a QFF equivalent to bad is the disjunction of C1 and C2 where C1 is the QFF equivalent to gre tunnel(ra, rx) and C2 is the QFF equivalent to route available(ra, rx). The second rule states that the QFF equivalent to gre tunnel(RX, RY) is RemoteAddr= Addr where RemoteAddr is the remote physical address of a GRE tunnel on RX and Addr is the address of an interface on RY. The third rule states that the QFF equivalent to 9 Network Configuration Validation 305 route available(RX, RY) is C provided C is the constraint that RX contains a static route for an address on RY. The fourth rule computes the QFF for all interface addresses being unique. The last rule computes the QFF for the top-level constraint topReq. Now, the Prolog query eval(topReq, RC) computes RC as above. As has been shown in [51], QFFs are much more expressive than Boolean logic, so it is not hard to write requirements using the eval predicate. 9.5.4 Repair with MulVAL The MulVAL system proposes an alternative, precise method of computing the fields that cause the success of an undesirable requirement provided that requirement is expressed in Datalog. A requirement, such as bad, is said to be undesirable if it enables adversary success. This method is based on the observation that any tuple in a proof of an undesirable requirement is responsible for the truth of that requirement. These tuples contain all the fields that need to be replaced by configuration variables. For example, one proof of bad with the original Prolog specification in Section 9.5.2 is: bad gre_tunnel(ra, rx) gre(ra, tunnel 0, 100, 300) ^ ipAddress(rx, eth 0, 300, 0) Here, each condition is implied by its successor by the use of a rule in the Prolog specification. The second proof of bad is: bad route_available(ra, rx) static route(ra,0,32,400) ^ ipAddress(rx, eth 0,300, 0) ^ contained(0,32,300,0) The tuples that contribute to the proof of bad are: gre(ra, tunnel_0, 100, 300) -- from the first proof ipAddress(rx, eth_0, 300, 0) -- from the first proof static_route(ra, 0, 31, 400) -- from the second proof The following tuples do not contribute to the proof of bad: ipAddress(ra, eth_0, 100, 0). ipAddress(rb, eth_0, 200, 0). The three tuples in the proof of bad contain all the fields that need to be replaced by configuration variables. Note that the address of interfaces at ra and rb do not need to be replaced. 306 S. Narain et al. The MulVAL system does not actually compute new values of fields. It only computes the set of tuples that should be disabled to disable all proofs of the undesirable property. A tuple can be disabled by changing its fields to different values or deleting it. But, MulVAL computes the set in an optimal way. It first derives a Boolean formula representing all the ways in which tuples should be disabled, then solves this with a minimum-cost SAT solver. A solution represents a set of tuples to disable. For example, the Boolean formula for the above two proofs is: : gre(ra, tunnel 0, 100, 300) _ :ipAddress(rx, eth 0,300, 0) ^ : ipAddress(rx, eth 0, 300, 0) _ :static route(ra, 0, 32, 400) The first formula states that to disable the first proof, either the gre tuple or the ipAddress tuple must be disabled. The second formula states that to disable the second proof, either the ipAddress or the static route tuple must be disabled. Costs are associated with disabling each tuple. The minimum-cost SAT solver computes that set of tuples whose cost of disabling is a minimum. For example, the cost of disabling the ipAddress tuple may be high because many requirements depend on this tuple. The cost of disabling the static route and gre tuples may be a lot lower. It is not, in general, simple to assign cost to disabling a tuple. Furthermore, this approach only computes how to disable an undesirable requirement. It does not guarantee that disabled tuples will also not disable desirable requirements, unless these latter requirements are also expressed in Boolean logic and the combined constraint is solved. 9.5.5 Evaluating Firewall Requirements with Binary Decision Diagrams Hamed et al. [34] evaluate firewall subsumption and rule redundancy using Ordered Binary Decision Diagrams [12]. Their algorithm is conceptually the same as in Section 9.4.4. It first transforms firewall policies into Boolean constraints upon source and destination addresses, source and destination ports, and the protocol. These constraints are true only for those packets that are permitted by the firewall. These fields are represented as sequences of Boolean variables, e.g., an address field as a sequence of 32 variables and a port field as a sequence of 16 bits. The algorithm then checks whether combinations of constraints for evaluating subsumption and redundancy have a solution. Since constraints are represented as Ordered Binary Decision Diagrams, this check is straightforward. By contrast, ConfigAssure represents the above fields as integer variables and represents a policy as an arithmetic quantifierfree form constraint. It lets Kodkod transform this into a Boolean constraint and use a SAT solver to check satisfiability. 9 Network Configuration Validation 307 9.6 Related Work 9.6.1 Configuration Acquisition by Type Inference Another approach to parsing configuration files is with the use of PADS/ML system [47]. Based on the functional language ML, PADS/ML describes the accepted language as if it were a type definition. PADS/ML supports the generation of parser, printer, data structure representation, and a generic interface to this representation. The generated code is in OCAML [43] language and additional tools, written in OCAML, then manipulate the internal data structure. This internal data structure is traversed to populate the relational database in the same way that the ANTLR abstract syntax tree is traversed. Adaptive parsers are reported in [17]. These can modify the language they recognize when given examples of legal input. The inference system recognizes commands that are only handled in the abstract, much as the ANTLR grammar of IP Assure skips over some commands. Repeated instances of commands are used to generate new PADS/ML types, which are then further refined to provide access to fields in the commands. This means that as the IOS language evolves, the parser can evolve to provide an ever richer internal representation. 9.6.2 Symbolic Reachability Analysis Instead of performing reachability analysis for each packet, a system for reachability analysis for sets of packets is described in Xie et al. [72]. This makes it possible to evaluate a requirement such as “a change in static routes at one or more routers does not change the set of packets that can flow between two nodes.” It is not feasible to evaluate such a requirement by enumerating all packets and checking reachability. In this system, the reachability upper bound is defined to be the union of all packets permitted by each possible forwarding path from the source to the destination. This bound models a security policy that denies some packets (i.e., those outside the upper bound) under all conceivable operational conditions. The reachability lower bound is defined to be the common set of packets allowed by every feasible forwarding path from the source to the destination. This bound models a resilience policy that assures the delivery of some packets despite network faults, as long as a backup forwarding path exists. Algorithms are created for estimating the reachability upper and lower bounds from a network’s packet filter configurations. Moreover, the work shows that it is possible to jointly reason about how packet filters, routing, and packet transformations affect reachability. An interesting implementation of reachability analysis for sets of packets is found in the ConfigChecker [3] system. It represents the network’s packet forwarding behavior as a giant state machine in which a state defines what packets are at what routers. However, the state-transition relation is not represented explicitly but rather 308 S. Narain et al. symbolically as a constraint that must be satisfied by two states for the network to transition between these. This constraint itself is represented as an Ordered Binary Decision Diagram and input to a symbolic model checker [48]. Reachability requirements such as that above are expressed in Computational Tree Logic [48] and the symbolic model checker used to evaluate these. The transition-relation also takes into account features such as IPSec tunnels, multicast, and network address translation. 9.6.3 Alloy Specification Language Alloy [2, 39] is a first-order relational logic system. It lets one specify object types and their attributes. It also lets one specify first-order logic constraints on these attributes. These are more expressive than Prolog constraints. Alloy solves constraints by compiling these into Kodkod and using Kodkod’s constraint solver. The use of Alloy for network configuration management was explored in [49].Alloy’s specification language is very appropriate for specifying requirements. All the requirements in Section 9.2 can be compactly expressed in Alloy. However, its constraint solver is inappropriate for evaluating requirements. This is because the compilation of first-order logic into Boolean logic leads to very large intermediate constraints. Kodkod addresses this problem by its partial-model optimization that exploits knowledge about parts of the solution. If the value of a variable is already known, it does not appear in the constraint that is submitted to the SAT solver. ConfigAssure follows a related approach but at a higher layer. The intuition is that given a requirement, many parts of it can be efficiently solved with non-SAT methods. Solving these parts and simplifying can yield a requirement that truly requires the power of a SAT solver. This plan is carried out by transforming a requirement into an equivalent quantifier-free form by defining the eval predicate for that requirement. QFFs have the property that not only is it easy to write eval rules, but also that QFFs are efficiently compiled and solved by Kodkod. Evaluation of parts of requirements and simplification are accomplished in the definition of eval. 9.6.4 BGP Validation The Internet is, by definition, a “network of networks,” and the responsibility for gluing together the tens of thousands of independently administered networks falls to the Border Gateway Protocol (BGP) [59, 64]. A network, or AS uses BGP to tell neighboring networks about each block of IP addresses it can reach; in turn, neighboring ASes propagate this information to their neighbors, allowing the entire Internet to learn how to direct packets toward their ultimate destinations. On the surface, BGP is a relatively simple path-vector routing protocol, where each router selects a single best route among those learned from its neighbors, adds its own AS 9 Network Configuration Validation 309 number to the front of the path, and propagates the updated routing information to its neighbors for their consideration; packets flow in the reverse direction, with each router directing traffic along the chosen path in a hop-by-hop fashion. Yet, BGP is a highly configurable protocol, giving network operators significant control over how each router selects a “best” route and whether that route is disseminated to its neighbors. The configuration of BGP across the many routers in an AS collectively expresses a routing policy that is based on potentially complex business objectives [15]. For example, a large Internet Service Provider (ISP) uses BGP policies to direct traffic on revenue-generating paths through their own downstream customers, rather than using paths through their upstream providers. A small AS like a university campus or corporate network typically does not propagate a BGP route learned from one upstream provider to another, to avoid carrying data traffic between the two larger networks. In addition, network operators may configure BGP to filter unexpected routes that arise from configuration mistakes and malicious attacks in other ASes [14,52]. BGP configuration also affects the scalability of the AS, where network operators choose not to propagate routes for their customers’ small address blocks to reduce the size of BGP routing tables in the rest of the Internet. Finally, network operators tune their BGP configuration to direct traffic away from congested paths to balance load and improve user-perceived performance [25]. The routing policy is configured as a “route map” that consists of a sequence of clauses that match on some attributes in the BGP route and take a specific action, such as discarding the route or modifying its attributes with the goal of influencing the route-selection process. The BGP defines many different attributes, and the route-selection process compares the routes one attribute at a time to ultimately identify one “best” route. This somewhat indirect mechanism for selecting and propagating routes, coupled with the large number of route attributes and routeselection steps, makes configuring BGP routing policy immensely complicated and error-prone. Network operators often use tools for automatically configuring their BGP-speaking routers [11, 21, 29]. These tools typically consist of a template that specifies the sequence of vendor-specific commands to send to the router, with parameters unique to each BGP session populated from a database; for example, these parameters might indicate a customer’s name, AS number, address block(s), and the appropriate route-maps to use. When automated tools are not used, the network operators typically have configuration-checking tools to ensure that the sessions are configured correctly, and that different sessions are configured in a consistent manner [16, 24]. Configuring the BGP sessions with neighboring ASes, while important, is not the only challenge in BGP configuration. In practice, an AS consists of multiple routers in different locations; in fact, a large ISP may easily have hundreds if not thousands of routers connected by numerous links into a backbone topology. Different routers connect to different neighbor ASes, giving each router only a partial view of the candidate BGP routes. As such, large ISPs typically run BGP inside their networks to allow the routers to construct a more complete view of the available routes. These internal BGP (iBGP) sessions must be configured correctly to ensure that each router has all the information it needs to select routes that satisfy the AS’s 310 S. Narain et al. policy. The simplest solution is to have a “full-mesh” configuration, with an iBGP session between each pair of routers. However, this approach does not scale, forcing large ISPs to introduce hierarchy by configuring route reflectors or confederations that limit the number of iBGP sessions and constrain the dissemination of routes. Each route reflector, for instance, selects a single “best route” that it disseminates to its clients; as such, the route-reflector clients do not learn all the candidate routes they would have learned in a full-mesh configuration. When the “topology” formed by these iBGP sessions violates certain properties, routing anomalies like protocol oscillations, forwarding loops, traffic blackholes, and violations of business contracts can arise [6, 31, 74]. Fortunately, static analysis of the iBGP topology, spread over the configuration of the routers inside the AS, can detect when these problems might arise [24]. Such tools check, for instance, that the top-level route reflectors are fully connected by a “full-mesh” of iBGP sessions. This prevents “signaling partitions” that could prevent some routers from learning any route for a destination. Static analysis can also check that route reflectors are “close” to their clients in the underlying network topology, to ensure that the route reflectors make the same routing decisions that their clients would have made with full information about the alternate routes. Finally, these tools can validate an ISP’s own local rules for ensuring reliability in the face of router failures. For instance, static analysis can verify that each router is configured with at least two route-reflector parents. Collectively, these kinds of checks on the static configuration of the network can prevent a wide variety of routing anomalies. For the most part, configuration validation tools operate on the vendor-specific configuration commands applied to individual routers. Configuration languages vary from one vendor to another, – for example, Cisco and Juniper routers have very different syntax and commands, even for relatively similar configuration tasks. Even within a single company, different router products and different generations of the router operating system have different commands and options. This makes configuration validation an immensely challenging task, where the configuration-checking tools much support a wide range of languages and commands. To address these challenges, research and standards activities have led to new BGP configuration languages that are independent of the vendor-specific command syntax [1, 71], particularly in the area of BGP routing policy. In addition to abstracting vendor-specific details, these frameworks provide some support for configuring entire networks rather than individual routers. For example, the Routing Policy Specification Language (RPSL) [1] is object-oriented, where objects contain AS-wide policy and administrative information that can be published in Internet Routing Registries [37]. Routing policy can be expressed in terms of user-friendly keywords for defining actions and groups of address blocks or AS number. Configuration-generation tools can read these specifications to generate vendor-specific commands to apply to the individual routers [37]. However, while RPSL is used for publishing information in the IRRs, many ISPs still use their own configuration tools (or manual processes) for configuring their underlying routers. In summary, the configuration of BGP takes place at many levels – within a single router (to specify a single end point of a BGP session with the appropriate route- 9 Network Configuration Validation 311 maps and addresses), between pairs of routers (to ensure consistent configuration of the two ends of a BGP session), across different sessions to the same neighboring AS (to ensure consistent application of the routing policy at each connection point), and across an entire AS (to ensure that the iBGP topology is configured correctly). In recent years, tools have emerged for static analysis of router-configuration data to identify potential configuration mistakes, and for automated generation of the configuration commands that are sent to the routers. Still, many interesting challenges remain in raising the level of abstraction for configuring BGP, to move from the low-level focus on configuring individual routers and BGP sessions toward configuring an entire network, and from the specific details of the BGP route attributes and route-selection process to a high-level specification of an AS’s routing policy. As the Internet continues to grow, and the business relationships between ASes become increasingly complex, these issues will only become more important in the years ahead. 9.6.5 Other Validation Systems Netsys was an early software product for configuration validation. It was first acquired by Cisco Systems and then by WANDL Corporation. It contained about a 100 requirements that were evaluated against router configurations. OPNET offers validation products NetDoctor and NetMapper. These are not standalone but rather modules that need to be plugged into the base IT Sentinel system [54]. For more description of these, see [23]. None of these products offer configuration repair, reasoning about firewalls, or symbolic reachability analysis. The Smart Firewalls work [13] was an early attempt at Telcordia to develop a network configuration validation system. A survey of system, not network, configuration is found in [4]. Formal methods for jointly reasoning about IPSec and firewall polices are described in [32]. A high-level configuration language is described in [45]. 9.7 Summary and Directions for Future Research To set up network infrastructure satisfying end-to-end requirements, it is not only necessary to run appropriate protocols on components but also to correctly configure these components. Configuration is the “glue” for logically integrating components at and across multiple protocol layers. Each component has a finite number of configuration parameters, each of which can be set to a definite value. However, today, the large conceptual gap between end-to-end requirements and configurations is manually bridged. This causes large numbers of configuration errors whose adverse effects on security, reliability, and high cost of deployment of network infrastructure are well documented. See also [57, 62]. 312 S. Narain et al. Thus, it is critical to develop validation tools that check whether a given configuration is consistent with the requirements it is intended to implement. Besides checking consistency, configuration validation has another interesting application, namely network testing. The usual invasive approach to testing has several limitations. It is not scalable. It consumes resources of the network and network administrators and has the potential to unleash malware into the network. Some properties such as absence of single points of failure are impractical to test as they require failing components in operational networks. A noninvasive alternative that overcomes these limitations is analyzing configurations of network components. This approach is analogous to testing software by analyzing its source code rather than by running it. This approach has been evaluated for a real enterprise. Configuration validation is inherently hard. Whether a component is correctly configured cannot be evaluated in isolation. Rather, the global relationships into which the component has been logically integrated with other components have to be evaluated. Configuration repair is even harder since changing configurations to make one requirement true may falsify another. The configuration change should be holistic in that it should ensure that all requirements concurrently hold. This chapter described the challenges of configuring a typical collaboration network and the benefits of using a validation system. It then presented an abstract design of a configuration validation system. It consists of four subsystems: configuration acquisition system, requirement library, specification language, and evaluation system. The chapter then surveyed technologies for realizing this design. Configuration acquisition systems have been built using three approaches: parser generator, type inference, and database query. Classes of requirements in their Requirements Library are logical structure integrity, connectivity, security, reliability, performance, and government regulatory. Specification languages include visual templates, Prolog, Datalog, arithmetic quantifier-free forms, and Computational Tree Logic. Evaluation systems have used graph algorithms, the Kodkod constraint solver for first-order logic constraints, the ZChaff SAT solver for Boolean constraints, Binary Decision Diagrams, and symbolic model checkers. Visualization of not just the IP topology but also of various other logical topologies provides useful insights into network architecture. Logic-based languages are very useful for creating a validation system, particularly for solving the hard problems of configuration repair and symbolic reasoning about requirements. Future research needs to focus on all four components of a validation system. Robust configuration acquisition systems are critical to automated validation. The accumulated experience of building large networks is vast but largely unformalized. Formalizing these in a Requirement Library would not only raise the level of abstraction at which network requirements are written but also improve their precision. New classes of requirements, one on VLAN optimization and another on configuration complexity, are reported in [28, 65] and in [9], respectively. Specification languages that are easy to use by network administrators are also critical for broad adoption of validation systems. Logic-based languages are a good candidate despite the perception that these are too complex for administrators. These are closest in form to the natural language requirements in network design documents. The 9 Network Configuration Validation 313 configuration languages administrators use are already declarative in that they do not contain side-effects and the ordering of commands is unimportant. Introducing logical operators, data structures, and quantifiers into these is a natural step toward making these much more expressive. See [71] for a recent example of using the Haskell functional language for specifying BGP policies. High-level descriptions of component configurations could then again be composed by logical operators to describe network-wide requirements. In the nearer term, even making an implementation of the Requirement Library available as APIs in system administration languages like Perl or Python should vastly improve configuration debugging. Much greater understanding is needed of useful ways to visualize logical structures and relationships in networks. One might derive inspiration from works such as [70]. Finally, a good framework for repairing configurations was described in Section 9.5.3, but it needs to be further explored. For example, one needs to understand how the convergence of the repair procedure is affected by choice of configuration variable to relax, and how ideas of MulVAL can be generalized and combined with those of ConfigAssure. Creating the trust in network administrators before they allow automated repair of their component configurations is an open problem. Acknowledgments We are very grateful to Jennifer Rexford, Andreas Voellmy, Richard Yang, Chuck Kalmanek, Simon Ou, Geoffrey Xie, Yitzhak Mandelbaum, Ehab Al-Shaer, Sanjay Rao, Adel El-Atawy, and Paul Anderson for their contributions and comments. References 1. Alaettinoglu, C., Villamizar, C., Gerich, E., Kessens, D., Meyer, D., Bates, T., et al. (1999). Routing Policy Specification Language. RFC 2622. 2. Alloy. http://alloy.mit.edu/ 3. Al-Shaer, E., Marrero, W., El-Atawy, A., & ElBadawy, K. (2008). Towards global verification and analysis of network access control configuration. Technical Report, TR-08008, DePaul University, from http://www.mnlab.cs.depaul.edu/projects/ConfigChecker/TR08-008/paper.pdf 4. Anderson P (2006) System Configuration. In Short Topics in System Administration ed. Rick Farrow. USENIX Association. 5. ANTRL v3. http://www.antlr.org/ 6. Basu, A., Ong, C.H., Rasala, A., Shepherd, F.B., & Wilfong, G. (2002). Route oscillations in I-BGP with route reflection. ACM SIGCOMM. 7. Bates, T., Chandra, R., & Chen, E. (2000). BGP route reflection – an alternative to full mesh IBGP. RFC 2796. http://www.faqs.org/rfcs/rfc2796 8. Bellovin, R., & Bush, R. (2009). Configuration management and security. IEEE Journal on Selected Areas in Communications [special issue on Network Infrastructure Configuration], 27(Suppl. 3). 9. Benson, T., Akella, A., & Maltz, D. (2009). Unraveling the complexity of network management. USENIX Symposium on Network Systems Design and Implementation. 10. Berkowitz, H. (2000). Techniques in OSPF-Based Network. http://tools.ietf.org/html/draft-ietfospf-deploy-00 11. Bohm, H., Feldmann, A., Maennel, O., Reiser, C., & Volk, R. (2005). Network-wide interdomain routing policies: Design and realization. Unpublished report, http://www.net.t-labs. tu-berlin.de/papers/BFMRV-NIRP-05.pdf. 314 S. Narain et al. 12. Bryant, R. (1986). Graph-based algorithms for Boolean function manipulation. IEEE Transactions on Computers, C-35(Suppl. 8), 677–691. 13. Burns, J., Cheng, A., Gurung, P., Martin, D., Rajagopalan, S., Rao, P., et al. (2001). Automatic management of network security policy. Proceedings of DARPA Information Survivability Conference and Exposition (DISCEX II’01), volume 2, Anaheim, CA. 14. Butler, K., Farley, T., McDaniel, P., & Rexford, J. (2008). A survey of BGP security issues and solutions. Unpublished manuscript. 15. Caesar, M., & Rexford, J. (2005). BGP routing policies in ISP networks. IEEE Network Magazine [Special issue on Interdomain Routing], 19, 5–11. 16. Caldwell, D., Gilbert, A., Gottlieb, J., Greenberg, A., Hjalmtysson, G., & Rexford, J. (2003). The cutting EDGE of IP router configuration. ACM SIGCOMM HotNets Workshop. 17. Caldwell, D., Lee, S., & Mandelbaum, Y. (2008). Adaptive parsing of router configuration languages. Proceedings of the Internet Management Workshop. 18. Cheswick, W., Bellovin, S., & Rubin, A. (2003). Firewalls and Internet security: Repelling the Wily Hacker. Reading, MA: Addison-Wesley. 19. Cisco Systems. (2005). DiffServ – The Scalable End-to-End QoS Model. 20. Distributed Management Task Force, from http://www.dmtf.org/home 21. Enck, W., Moyer, T., McDaniel, P., Sen, S., Sebos, P., Spoerel, S., et al. (2009). Configuration management at massive scale: System design and experience. IEEE Journal on Selected Areas in Communications. 27(Suppl. 3), 323–335. 22. Farinacci, D., Li, T., Hanks, S., Meyer, D., & Traina, P. (2000). Generic routing and encapsulation. RFC 2784. 23. Feamster, N. (2006). Proactive techniques for correct and predictable Internet routing. Doctoral dissertation, Massachusetts Institute of Technology, Boston, MA. 24. Feamster, N., & Balakrishnan, H. (2005). Detecting BGP configuration faults with static analysis. Symposium on Networked Systems Design and Implementation. 25. Feamster, N., & Rexford, J. (2007). Network-wide prediction of BGP routes. IEEE/ACM Transactions on Networking, 15(2), 253–266. 26. Federal Information Security Management Act. (2002). National Institute of Standards and Technology. 27. Fu, Z., & Malik, S. (2006). Solving the minimum-cost satisfiability problem using branch and bound search. Proceedings of IEEE/ACM International Conference on Computer-Aided Design ICCAD. 28. Garimella, P., Sung Y.W., Zhang, N., & Rao, S. (2007). Characterizing VLAN usage in an Operational Network. ACM SIGCOMM Workshop on Internet Network Management. 29. Gottlieb, J., Greenberg, A., Rexford, J., & Wang, J. (2003). Automated provisioning of BGP customers IEEE Network Magazine. 30. Graphviz. http://www.graphviz.org/ 31. Griffin, T.G., & Wilfong, G. (2002). On the correctness of IBGP configuration. Proceedings of ACM SIGCOMM. 32. Guttman, J. (1997). Filtering postures: local enforcement for global policies. Proceedings of the 1997 IEEE Symposium on Security and Privacy. 33. Halabi, B. (1997). Internet routing architectures. Indianapolis, IN: New Riders Publishing. 34. Hamed, H., Al-Shaer, E., & Marrero, W. (2005). Modeling and verification of IPSec and VPN security policies. Proceedings of IEEE International Conference on Network Protocols. 35. Homer, J., & Ou, X. (2009). SAT-solving approaches to context-aware enterprise network security management. IEEE JSAC [Special Issue on Network Infrastructure Configuration]. 36. Huitema, C. (1999). Routing in the Internet. Upper Saddle River, NJ: Prentice Hall. 37. Internet Routing Registry Toolset Project, from https://www.isc.org/software/IRRtoolset 38. IP Assure. Telcordia Technologies, Inc., from http://www.telcordia.com/products/ip-assure/ 39. Jackson, D. (2006). Software abstractions: Logic, language, and analysis. Cambridge, MA: MIT Press. 40. Juniper Networks. (2008). What is behind network downtime? Proactive steps to reduce human error and improve availability of networks, from http://www.juniper.net/ solutions/literature/white papers/200249.pdf 9 Network Configuration Validation 315 41. Kodkod, from http://web.mit.edu/emina/www/kodkod.html 42. Lampson, B. (2000). Computer security in real world. Annual computer security applications conference, from http://research.microsoft.com/en-us/um/people/blampson/64securityinrealworld/acrobat.pdf 43. Leroy, X., Doligez, D., Garrigue, J., Rémy, D., & Vouillon, J. (2007). The objective caml system, release 3.10, documentation and user’s manual. 44. Li, T., Cole, B., Morton, P., & Li, D. (1998). Cisco Hot Standby Router Protocol. RFC 2281. 45. Lobo, J., & Pappas, V. (2008). C2: The case for network configuration checking language. Proceedings of IEEE Workshop on Policies for Distributed Systems and Networks. 46. Mahajan, Y., Fu, Z., & Malik, S. (2004). Zchaff2004, An Efficient SAT Solver. Proceedings of 7th International Conference on Theory and Applications of Satisfiability Testing. 47. Mandelbaum, Y., Fisher, K., Walker, D., Fernandez, M., & Gleyzer, A. (2007). PADS/ML: A functional data description language. ACM Symposium on Principles of Programming Language. 48. McMillan, K. (1992). Symbolic model checking. Doctoral dissertation, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. 49. Narain, S. (2005). Network configuration management via model-finding. Proceedings of USENIX Large Installation System Administration (LISA) Conference. 50. Narain, S., Kaul, V., & Parmeswaran, K. (2003). Building autonomic systems via configuration. Proceedings of AMS Autonomic Computing Workshop. 51. Narain, S., Levin, G., Kaul, V., & Malik, S. (2008). Declarative infrastructure configuration synthesis and debugging. In E. Al-Shaer, C. Kalmanek, F. Wu (Eds), Journal of Network Systems and Management [Special issue on Security Configuration] 52. Nordstrom, O. & Dovrolis, C. (2004). Beware of BGP attacks. ACM SIGCOMM Computer Communications Review, 34(Suppl. 2), 1–8. 53. O’Keefe, R. (1990). The craft of prolog. Reading, MA: Addison Wesley. 54. OPNET IT Sentinel, from http://www.opnet.com/solutions/network planning operations/ it sentinel.html 55. Ou, X., Boyer, W., & McQueen, M. (2006). A scalable approach to attack graph generation. 13th ACM Conference on Computer and Communications Security (CCS). 56. Ou, X., Govindavajhala, S., & Appel, A. (2005). MulVAL: A logic-based network security analyzer. 14th USENIX Security Symposium, Baltimore, MD. 57. Pappas, V., Wessels, D., Massey, D., Terzis, A., Lu, S., & Zhang, L. (2009). Impact of configuration errors on DNS robustness. IEEE Journal on Selected Areas in Communication, 27(Suppl. 1), 275–290. 58. Qie, X., & Narain, S. (2003). Using service grammar to diagnose configuration errors in BGP-4. Proceedings of USENIX Systems Administrators Conference. 59. Rekhter, Y., Li, T., & Hares, S. (2006). A Border Gateway Protocol 4 (BGP-4), RFC 4271. 60. Rosen, E., Viswanathan, A., & Callon, R. (2001). Multiprotocol Label Switching Architecture. RFC 3031. 61. Schwartz, J. (2007). Who Needs Hackers? New York Times http://www.nytimes.com/ 2007/09/12/technology/techspecial/12threat.html 62. Securing Cyberspace for the 44th Presidency. (2008). CSIS Commission On Cybersecurity. 63. Sedgewick, R. (2003). Algorithms in Java. Reading, MA: Addison Wesley. 64. Stewart, J. (1999). BGP4: Inter-Domain Routing in the Internet. Reading, MA: AddisonWesley. 65. Sung, E.Y., Rao, S., Xie, G., & Maltz, D. (2008). Towards systematic design of enterprise networks. ACM CoNEXT Conference. 66. SWI-Prolog Semantic Web Library, from http://www.swi-prolog.org/pldoc/package/ semweb.html 67. SWI-Prolog, from http://www.swi-prolog.org/ 68. TCP Problems with Path MTU discovery. RFC 2923. 69. Torlak, E., & Jackson, D. (2007). Kodkod: A Relational Model Finder. Tools and Algorithms for Construction and Analysis of Systems (TACAS ‘07). 316 S. Narain et al. 70. Tufte, E. (2001). The visual display of quantitative information. Cheshire, CT: Graphics Press. 71. Voellmy, A., & Hudak, P. Nettle: A domain-specific language for routing configuration, from http://www.haskell.org/YaleHaskellGroupWiki/Nettle 72. Xie, G., Zhan, J., Maltz, D., Zhang, H., Greenberg, A., Hjalmtysson, G., et al. (2005). On static reachability analysis of IP networks. IEEE INFOCOM. 73. ZChaff, from http://www.princeton.edu/chaff/ 74. Zhang-Shen, R., Wang, Y., & Rexford, J. (2008). Atomic routing theory: Making an AS route like a single node. Princeton University Computer Science technical report TR-827-08. Part V Network Measurement Chapter 10 Measurements of Data Plane Reliability and Performance Nick Duffield and Al Morton 10.1 Introduction 10.1.1 Service Without Measurement: A Brief History Measurement was not a priority in the original design of the Internet, principally because it was not needed in order to provide Best Effort service, and because the institutions using the Internet were also the providers of this network. A technical strength of the Internet has been that endpoints have not needed visibility into the details of the underlying network that connects them in order to transmit traffic between one another. Rather, the functionality required for data to reach one host from another is separated into layers that interact through standardized interfaces. The transport layer provides a host with the appearance of a conduit through which traffic is transferred to another host; lower layers deal with routing the traffic through the network, and the actual transmission of the data over physical links. The Best Effort service model offers no hard performance guarantees to which conformance needs to be measured. Basic robustness of connectivity – the detection of link failures and rerouting traffic around them – was a task of the network layer, and so need not concern the endpoints. The situation described above has changed over the intervening years; the complexity of networks, traffic, and the protocols that mediate them, the separation of network users from network providers, coupled with customer needs for service guarantees beyond Best Effort now require detailed traffic measurements to manage and engineer traffic, and to verify that performance meets required goals, and to diagnose performance degradations when they occur. In the absence of detailed N. Duffield () AT&T Labs, 180 Park Avenue, Florham Park, NJ 07901, USA e-mail: duffield@research.att.com Al Morton AT&T Labs, 200 S Laurel Ave, Middletown, NJ 07748, USA e-mail: acmorton@att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 10, c Springer-Verlag London Limited 2010 319 320 N. Duffield and Al Morton network monitoring capabilities integrated with the network, many researchers, developers, and vendors jumped into the void to provide solutions. As measurement methodologies become increasingly mature, the challenge for service providers becomes how to deploy and manage measurement infrastructure scalably. Indeed, to meet this need, sophisticated measurement capabilities are increasingly being found on network routers. Furthermore, all parties concerned with the provenance and interpretation of measurements – vendors of measurement systems, software and services, service providers and enterprises, network users and customers – need a consistent way to specify how measurements are to be conducted, collected, transmitted, and interpreted. Many of these aspects for both passive and active measurement are now codified by standard bodies. We continue this introduction by briefly setting out the type of passive and active measurements that are the subject of this chapter, then previewing the broader challenges that face service providers in realizing them in their networks. 10.1.2 Passive and Active Measurement Methods This chapter is concerned with two forms of dataplane measurement: passive and active measurements. These two types of measurement have generally focused on different aspects of network behavior, support different applications, and are accomplished by different technical means. Passive measurement comprises recording information concerning traffic as it passes observation points in the network. We consider three categories of passive measurement: – Link utilization statistics as provided by router interface counters; these are retrieved from a managed device by a network management station using the SNMP protocol. – Flow-level measurements comprising summaries of flows of packets with common network and transport header properties. These are commonly compiled by routers, then exported to a collector for storage and analysis. These statistics enable detailed breakdown of traffic volumes according to network and transport header fields, e.g., IP addresses and TCP/UDP ports. – Inspection of packet payloads in order to provide application-level flow measurements, or to support other payload-dependent applications such as network security and troubleshooting. In active measurement, probe traffic is inserted into the network, and the probe traffic, or the response of the network to it, is subsequently measured. Comparing the probe and response traffic provides a measure of network performance, as experienced by the probes. Active probing has been conducted by standalone tools such as ping and traceroute [53] that utilize or coerce IP protocols for measurement functionality. These and other methods are used for active 10 Measurements of Data Plane Reliability and Performance 321 measurement between hosts in special purpose measurement infrastructures, or between network routers, or from these to other endpoints such as application or other servers. Although the correspondence between methods and applications – passive measurement for traffic analysis and active measurement for performance – has been the norm, it is not firm: passive measurement is used to observe probe packets, and there are purely passive approaches to performance measurement. 10.1.3 Challenges for Measurement Infrastructure and Applications We now describe challenges facing design and deployment of active and passive measurement infrastructure by service providers and enterprises. As we discuss passive and active measurement methodologies in the following sections, we shall discuss their strengths and weaknesses in meeting these challenges. As one would expect, weaknesses in some of the more mature methods that we discuss have often provided the motivation for subsequent methods. Speed Increasingly fast line rates challenge the ability of routers to perform complex per packet processing, including updating flow statistics, and packet content inspection. Scale The product of network speed times the large number of devices producing measurements, gives rise to an immense amount of measurement data (e.g., flow statistics). In addition to consuming resources at the observation points, these data require transmission, storage, and processing in the measurement infrastructure and back-end systems. Granularity Service providers and their customers increasingly require a detailed picture of network usage and performance. This is both to support individualized routine reporting, and also to support detailed retrospective studies of network behavior. These requirements reduce the utility of aggregate usage measurements, such as link-level counters, and simple performance measurement tools, such as ping and traceroute. Scope For passive measurement: not all routers support granular measurement functionality, e.g, reporting flow statistics; or, the functionality may not be enabled due to resource constraints at the observation point or in the measurement collection infrastructure. When measurements are performed, information about protocol layers below IP (such as MPLS), or optical layer attributes (such as the physical link of an IP composite link) may be incompletely reported or even absent. Information above the network layer may be hidden as a result of endpoint encryption. For active measurement: not all network paths or links may be directly measured because of cost or other limitations in the deployment of active measurement hosts. 322 N. Duffield and Al Morton Timeliness Measurement applications increasingly require short temporal granularity of measurements, either because it is desirable to measure events of short duration, such as traffic microbursts and sub-second timescale routing events, or because the reporting latency must be short, e.g., in real-time anomaly detection for security applications. The concomitant increase in measurement reporting or polling frequency increases load on measurement devices and increases the number of measurement data points. Accuracy In passive measurement, reduction of data volumes through sampling, in order to meet the challenges of speed and scale, introduces statistical uncertainty into measurements. In active measurement, bandwidth and scale constraints place a limit on active probing frequency and hence measurement accuracy is inherently dependent on the duration of the measurement period. Management There are several challenges for the management and administration of measurement infrastructure. – Reliability Measurement infrastructure components are subject to failure or outage, resulting in loss or corruption of measurements. The effects of component failure can be mitigated (i) at the infrastructure level (providing redundant capacity with fast detection of failure resulting in failover to backup subsystems), (ii) by employing reporting paradigms (e.g., sequence numbers) that facilitate automated checking, flagging, or workarounds for missing data, and (iii) reporting measurement uncertainty due to missing data or sampling to the consumer of the measurements. – Correlation Measurement applications may require correlation of measurements generated by different measurement subsystems, for example, passive and active traffic measurements, logs from application servers, and authentication, authorization, and accounting subsystems. A common case is when measurements are to be attributed to an entity such as an end host, but the mapping between measurement identifier (such as source IP address) and entity is dynamic (e.g., dynamic DHCP mappings). Correlation of multiple data sets presents challenges for data management, e.g., due to data size, diverse provenance, physical locations, and access policies. The measurement infrastructure must facilitate correlation by measures including the synchronization of timestamps set by different measurement subsystems. – Consistency The methodologies, reporting and interpretation of measurements must be consistent across different equipment and network management software vendors, service providers, and their customers. In this chapter, Sections 10.2–10.6 cover passive measurement, including linklevel aggregates, flow measurement, sampling, packet selection, and deep packet inspection (DPI). Sections 10.7–10.10 cover active measurements, including standardization of performance metrics, service level agreements, and deployment issues for measurement infrastructures. We conclude with an outlook on future challenges in Section 10.11. We shall make use of and refer to other chapters in this book that deal with specific applications of measurements, principally Chapter 5 on Network Planning and Chapter 13 on Network Security. 10 Measurements of Data Plane Reliability and Performance 323 10.2 Passive Traffic Measurement As previewed in Section 10.1.2, we consider three broad types of passive measurement: link statistics, flow measurements, and DPI. These encompass methods that are currently employed in provider networks, and also describe some newer approaches that have been proposed or may be deployed in the medium term. We now motivate and outline in more detail the material on passive measurement. Section 10.3 describes SNMP measurements, or, more precisely, interface packet counters maintained in a router’s Management Information Base (MIB) that are retrieved using the Simple Network Management Protocol (SNMP). The remote monitoring capabilities supported by the RMON MIB are also discussed. SNMP measurements provide an undifferentiated view of traffic on a link. By contrast, measurement applications often need to classify traffic according to the values occurring in protocol header fields that occur at different levels of the protocol stack. They must determine the aggregate traffic volumes attributable to each such value, for example, to each combination of the network layer IP addresses and transport layer TCP/UDP ports. This information, and that relating to encapsulating protocols such as MPLS, has come to be known as “packet header” information. This is contrasted with “packet payload” or “packet content” information, which includes higher layer application and protocol information. This information may be spread across multiple network level packets. The major development in passive traffic measurement over the last roughly 20 years, that serves these needs, has been traffic flow measurement. Traffic flows are sets of packets with common network/transport header values observed locally in time. Routers commonly compile summary statistics of flows (total packets, bytes, timing information) and report them, together with the common header values and some associated router state – but without any payload information – in a flow record that is exported to a collector. Cisco’s NetFlow is the prime example. Flow records provide a relatively detailed representation of network traffic that supports many applications. Several of these are covered in detail in other chapters of this book: generation of traffic matrices and their use in network planning is described in Chapter 5; analysis of traffic patterns and anomalies for network security is described in Chapter 13. Related applications are the routine reporting of traffic matrices and trending of traffic volumes and application mix for customers and for service provider’s network and business development organizations (see e.g. [5]). Section 10.4 describes traffic flow measurement, including the operational formation of flow statistics, protocols for the standardization of flow measurement, flow measurement collection infrastructure, the use of sampling both packets and flow records themselves in order to meet the challenges of speed and scale and its impact on measurement accuracy, some recent proposals for traffic flow measurement and aggregation, and concludes with some applications of flow measurements. Uniform packet sampling is one member of a more general class of packet selection primitives, that also includes filtering and more general sampling operations. In Section 10.5, we describe standardization of packet selection operations, their realization in routers, and applications of combined selection primitive for network 324 N. Duffield and Al Morton management. We describe in detail the hash-based selection primitive, which allows for consistent selection of the same packet at different observation points, and discuss new measurement applications that this enables. Packet header-based flow measurements provide little visibility into properties of the packet payload. However, network- and transport-level packet headers provide only a partial indication of traffic properties for the purposes of application characterization, security monitoring and attack mitigation, and software and protocol debugging. Section 10.6 reviews technologies for DPI of packet payload beyond the network- and transport-level headers, and shows how it serves these applications. 10.3 SNMP, MIBs, and RMON In this section, we discuss traffic statistics that are maintained within routers and the methods and protocols for their recovery. A comprehensive treatment of these protocols and their realization can be found in [25]. 10.3.1 Router Measurement Databases: MIBs A MIB is a type of hierarchical database maintained by devices such as routers. MIBs have been defined by equipment vendors and standardized by the IETF. Currently, over 10,000 MIBs are defined. The MIB most relevant for traffic measurement purposes is MIB-II [60] that maintains counters for the total bytes and numbers of unicast and multicast packets received on an interface, along with discarded and errored packets. The Interface-MIB [59] further provides counts of multicast packets per multicast address. Protocol-specific MIBs, e.g., for MPLS [76], also provide counts of inbound and outbound packets per interface that use those protocols. 10.3.2 Retrieval of Measurements: SNMP SNMP [77] is the Internet Protocol used to manage MIBs. A SNMP agent in the managed device is used to access the MIB and communicate object values to or from a network management station. SNMP has a small number of basic command types. Read commands are used to retrieve objects from the MIB. Write commands are used to write object values to the MIB. Notify commands are used to set conditions under which the managed device will autonomously generate a report. The most recent version of SNMP, SNMPv3, offers security functionality, including encryption and authentication, that were weaker or absent in earlier versions. For traffic measurement applications, the MIB interface-level packet and byte counters are retrieved by periodic SNMP polling from the management station; a polling interval of 5 min is common. The total packets and bytes transmitted between successive polls are then obtained by subtraction. 10 Measurements of Data Plane Reliability and Performance 325 10.3.3 Remote Monitoring: RMON The RMON MIB [81] supports a more detailed capability for remote monitoring than MIB-II, enabling the aggregation and notification over relatively complex events, e.g involving multiple packets. The original focus of RMON was in remote monitoring of LANs; resource limitations make RMON generally unsuitable for monitoring high rate packet streams in the WAN context, e.g., to supply greater detail than presented by SNMP/MIB-II measurements. Indeed, the limitations of RMON motivate the alternate flow and packet measurement paradigm in which samples or aggregates of packet header information are exported from the router to a collector which supports reporting, analysis, and alarming functionality, rather than the router performing these functions itself. We explore this paradigm in more detail in the following sections. 10.3.4 Properties and Applications of SNMP/MIB We now review how SNMP/MIB measurements align with the general measurement challenges described in Section 10.1.3. Scope: The major strength of SNMP measurements is their ubiquitous availability from router MIBs. Scale: From the data management point of view, SNMP statistics have the advantage of being relatively compact, routinely comprising a fixed length data collected per interface at each polling instant, commonly every 5 min. Granularity: The main limitation of SNMP measurement is that they maintain packet and byte counters per interface only. Timeliness: The externally chosen and relatively infrequent polling times for SNMP measurements limit their utility for real-time or event-driven measurement applications. Historically, SNMP measurements have been a powerful tool in the management of networks with undifferentiated service classes. SNMP statistics have been used to trend link utilization, and network administrators have used these trends to plan and prioritize link deployment and upgrades, on the basis of heuristics that relate link utilization to acceptable levels of performance. Active performance measurements using the ping and traceroute tools can also inform these decisions. Although SNMP measurement do not directly report any constituent details within link aggregates, network topology and routing in practice constrain the set of possible edge-to-edge traffic flows that can give rise to the collection of measured traffic rates over all network links. This leads to the formulation of an inverse problem to recover the edge-to-edge traffic matrices from the link aggregates. A number of approaches have been proposed and some are sufficiently accurate to be of operational use; for further detail see Chapter 5. Knowledge of the traffic matrices provides powerful new information beyond simple trending, because it allows the prediction of link utilization under different scenarios for routing, topology, and spatially heterogeneous changes in demand. 326 N. Duffield and Al Morton 10.4 Traffic Flow Measurement This section describes traffic flow measurement, including the operational formation of flow statistics, protocols for the standardization of flow measurement, flow measurement collection infrastructure, the use of sampling both packets and flow records themselves in order to meet the challenges of speed and scale and its impact on measurement accuracy, some recent proposals for traffic flow measurement and aggregation, and concludes with some applications of flow measurements. 10.4.1 Flows and Flow Records 10.4.1.1 Flow and Flow Keys A flow of traffic is a set of packets with a common property, known as the flow key, observed within a period of time. A set of interleaved flows is depicted in Fig. 10.1. Many routers construct and export summary statistics on flows of packets that pass through them. A flow record can be thought of as summarizing a set of packets arising in the network through some higher-level transaction, e.g., a remote terminal session, or a web-page download. In practice, the set of packets that are included in a flow depends on the algorithm used by the router to assign packets to flows. The flow key is usually specified by fields from the packet header, such as the IP source and destination address and TCP/UDP port numbers, and may also include information from the packet’s treatment at the observation point, such as router interface(s) traversed. Flows in which the key is specified by individual values of these fields are often called raw flows, as opposed to aggregate flows in which the key is specified by a range of these quantities. As we discuss further in Section 10.4.3.2, routers commonly create flow records from a sampled substream of packets. 10.4.1.2 Operational Construction of Flow Records Flow statistics are created as follows. A router maintains a cache comprising entries for each active flow, i.e., those flows currently under measurement. Each entry includes the key and summary statistics for the flow such as total packets and bytes, Fig. 10.1 Flows of observed packets, key indicated by shading 10 Measurements of Data Plane Reliability and Performance 327 and times of observation of the first and last packets. When the router observes a packet, it performs a cache lookup on the key to determine if the corresponding flow is active. If not, it instantiates a new entry for that key. The flow statistics are then updated accordingly. A router terminates the recording of a flow according to criteria describe below; then the flow’s statistics are exported in a flow record, and the associated cache memory released for use by new flows. Flow termination criteria include: (i) inactive flow or interpacket timeout: the time since the last packet observed for the flow exceeds some threshold; (ii) protocol-level information, e.g., a TCP FIN packet that terminates a TCP connection; (iii) memory management: termination to release memory for new flows; and (iv) active flow timeout: to prevent data staleness, flows are terminated after a given elapsed time since the arrival of the first packet of the flow. The summary information in the flow record may include, as well as the flow key, and summary statistics of packet timing and size, other information relating to the packet treatment in the router, such as interfaces traversed, next hop router, and routing state information. Additionally, lower layer protocol information from the packet header may be included. For example, Cisco’s NetFlow has a partial ability to report the MPLS label stack: it can report up to three labels from the MPLS label stack, with position in stack configurable. NetFlow can in some cases report the loopback address of the certain tunnel endpoints. 10.4.1.3 Commercial and Standardized Flow Reporting The idea of modeling traffic as packets grouped by a common property seems first to have appeared in [54], and the idea was taken up in support of internet accounting in [62], and systematized as a general measurement methodology in [22]. Early standardization efforts within the Real Time Flow Measurement working group of the Internet Engineering Task Force (IETF) has now been supplanted by the work of the IP Flow Information eXport working group (IPFIX) [49]. In practice flow measurement has become largely identified with Cisco’s NetFlow [18] due to (i) the large installed base; (ii) its emulation in other vendors’ products, and (iii) its effective standardization by the use of NetFlow version 9 [23] as the starting point for the IPFIX protocol. NetFlow v9 offers the ability to administrators to define and configure flow keys, aggregation schemes, and the information reported in flow records. An alternative reporting paradigm is provided by sFlow [71], in which headerlevel information from a subset of sampled packets are exported directly without aggregating information from packet bearing the same key. sFlow reports include a position count of the sampled packet within the original traffic stream; this facilitates estimating traffic rates. 328 N. Duffield and Al Morton 10.4.2 Flow Measurement Infrastructure 10.4.2.1 Generation and Export of Flow Records Cisco originated NetFlow as a by-product of IP route caching [17], but it has subsequently evolved as a measurement and reporting subsystem in its own right. Other router vendors now support the compilation of flow statistics, e.g., Juniper’s JFlow [55], with the flow information being exported using the NetFlow version 9 format or according to the IPFIX standard. Note that implementation differences may lead to different information being reported across different routers. Standalone monitoring devices as discussed in Section 10.6.2 may also compile and export flow records. Cisco Flexible NetFlow [14] provides the ability to instantiate and separately configure multiple flow compilers that operate concurrently. This allows a single router to serve different measurement applications that may have different requirements: traffic can be selected by first filtering on header fields; parameters such as sampling granularity, spatial and temporal aggregation granularity, reporting detail and frequency, and collector destination can be specified for each instantiation. We discuss packet selection operations more generally in Section 10.5. 10.4.2.2 Collection and Mediation of Flow Records Flow records are exported from the observation point, either directly to a collector, or through a mediation device. NetFlow collection systems are available commercially [15] or as freeware [10], either in a basic form that receives and writes flow records to storage, or as part of larger traffic analysis system to support network management functions [5, 69], or focused on specific applications such as security [68]. Although export of flow records may take place directly to the ultimate collector, there are two architectural reasons that favor inserting mediation devices in the export path: scalability and reliability. The primary reason is scalability. Even with the compression of information that summarizes a set of packets in a fixed length flow record, the volumes of flow records produced by large-scale network infrastructure are enormous. As a rough example, a network comprising 100 10 Gb/s links that are 50% loaded in each direction, and in which each flow traverses ten routers, each of which compiles flow statistics after packet sampling at a rate of 1 in several hundred (see Section 10.4.3.2), would produce 1Gb/s of flow records, i.e., roughly 10 TeraBytes per day. A secondary reason for using mediation boxes has been transmission reliability. Until recently, NetFlow has exclusively used UDP for export, in part to avoid the need for buffer flow records at the exporter, as would be required by a reliable transport protocol. But the use of UDP exposes flow records to potential loss in transit, particularly over long WAN paths. Due to skew in flow length distributions (see Section 10.4.3.3) uncontrolled loss of the records of long flows could severely reduce measurement accuracy. 10 Measurements of Data Plane Reliability and Performance 329 Fig. 10.2 Flow measurement collection infrastructure: hardware elements, their resources, and sampling and aggregation operations that act on the measurements Mediation devices can address these issues and provided additional benefits: Data Reduction By aggregating and sampling flow records, then exporting the reduced data to a central collector. Reliable Staging The mediator can receive flow records over a LAN with controlled loss characteristics, then export flow records (or samples or aggregates) to the ultimate collector using a reliable transport protocol such as TCP. NetFlow v9 and the IPFIX protocol both support SCTP [78] for export, which gives administrators flexibility to select a desired trade-off between reliability and buffer resource usage at the exporter. Distributed Query The mediation devices may also support queries on the flow records that traverse them, and thus together constitute a distributed query system. Selective Export Multiple streams of flow records selected according to specified criteria may be exported to collectors serving different applications. An example of such an architecture is illustrated in Fig. 10.2; see also [39]. In each of a number of geographically distributed router centers, a mediation device receives flow records from its colocated routers; aggregates and samples are then exported to ultimate collector. Protocols for flow mediators are currently under standardization in the IPFIX working group of the IETF [49]. 10.4.2.3 Collection and Warehousing of Flow Records The final component of the collection infrastructure is the repository that serves to receive and store the flow records, and serve as a database for reporting and query functions. Concerning the attributes of a data store: Capacity Must be extensive; even with packet and flow sampling, a large service provider network may generate many GB of flow records per day. 330 N. Duffield and Al Morton DataBase Management System Must be well matched to the challenges of large datasets, including rapid ingestion and indexing, managing large tables, a highlevel query language to support complex queries, transaction logging, and data recovery. The Daytona DBMS is an example of such a system in current use; see [44]. Data Sources Interpretation of flow data typically requires joining with other datasets, which should also be present in the management system, including but not limited to, topology and configuration data, control plane measurements (see Chapter 11 for a description of routing state monitoring), MIB variables acquired by SNMP polling, network elements logs from authentication, authorization, and accounting servers, and logs from DHCP and other network servers. Data Quality Data may be corrupt or missing due to failures in the collection and reporting systems. The complexity and volume of measured data necessitate automated mechanisms to detect, mark, and mitigate unclean data; see e.g. [30]. Data Security and Customer Privacy Flow measurements and other data listed should be considered as sensitive customer information. Service provider policies must specify practices to maintain the integrity of the data, including controlled and auditable access restricted to individuals needing to work with the data, encryption, anonymization, and data retention policies. 10.4.3 Sampling in Flow Measurement and Collection 10.4.3.1 Sampling as a Data Reduction Method In the previous sections, we have touched on the fact that the speed of communications links provides a challenge for the formation of flow records at the router, and both speed and the scale of networks – the large number of interfaces that can produce flow records – provide a challenge for the collection and storage of flow records. Figure 10.2 illustrates the relevant resources at the router, mediator, and collector. To meet these challenges, data reduction must be performed. The reduction method must be well matched to the uses to which the reduced data is put. Three reduction methods are usually considered: Aggregation Summarizing measurements that share common properties. In the context of traffic flow measurement, header-level information on packets with the same key is aggregated into flows. Subsequent aggregation of flow records into predefined aggregates (e.g., aggregate traffic to each routing prefix) is a powerful tool for routine reporting. Filtering Selection of a subset of measurement that matches a specified criterion. Filtering is useful for drill down (e.g., to a traffic subset of interest). Sampling Selection of data points according to some nondeterministic criterion. 10 Measurements of Data Plane Reliability and Performance 331 A limitation for aggregation and filtering as general data reduction methods is the manner in which they lose visibility into the data: traffic not matching a filter is discarded; detail within an aggregate is lost (while flow records aggregate packets over time, they need not aggregate spatially, i.e., over packet header values). Of the three methods, only sampling retains the spatial granularity of the original data, and thus retains the ability to support arbitrary aggregations of the data, include those formulated after the measurements were made. This is important to support exploratory, forensic, and troubleshooting functions, where the traffic aggregates of interest are typically not known in advance. The downside of sampling is the statistical uncertainty in the resulting measurements; we address this further in Section 10.4.3.4. We now discuss sampling operations used during the construction and recovery of flow measurements. As illustrated in Fig. 10.2, packet sampling (see Section 10.4.3.2) is used in routers in order to reduce the rate of the stream of packet header information from which flow records are aggregated. The complete flow records are then subjected to further sampling (see Section 10.4.3.3) and aggregation within the collection infrastructure, at the mediator to reduce data volumes, or in the collector, for example, dynamically sampling from a flow record database in order to reduce query execution times, or permanently in order to select a representative set of flow records (or their aggregates) for archiving. We discuss the ramifications of sampling for measurement accuracy in Section 10.4.3.4, and some more recent developments in stateful sampling and aggregation the straddle the packet and flow levels in Section 10.4.3.5. Finally, we look ahead to Section 10.5, which sets random packet sampling in the broader context of packet selection operations and their applications, including filtering, both in the sense understood above, and also consistent packet selection as exemplified by hash-based sampling. 10.4.3.2 Random Packet Sampled Flows The main resource constraint for forming flow records is at the router flow cache in which the keys of active flows are maintained. To lookup packet keys at the full line rate of the router interfaces would require the cache to operate in fast, expensive memory (SRAM). Moreover, routers carry increasingly large numbers of flows concurrently, necessitating a large cache. By sampling the packet stream in advance of the construction of flow records, the cache lookup rate is reduced, enabling the cache to be implemented in slower, less expensive, memory (DRAM). A number of different sampling methods are available. Cisco’s Sampled NetFlow samples packets every N th packet systematically, where N is a configurable parameter. Random Sampled NetFlow [21] feature employs stratified sampling based on arrival count: one packet is selected at random out of every window on N consecutive arrivals. Although these two methods have the same average sampling rate, there are higher-order differences in the way multiple packets are sampled; for example, consecutive packets are never selected in Sampled NetFlow, while they can be in Random Sampled NetFlow. However, the effect of such differences on flow statistics is expected to be small except possibly for flows which that represent 332 N. Duffield and Al Morton noticeable proportion (greater than 1=N ) of the load, since the position of a given flow’s packets in the packet arrival order at an interface is then effectively randomized by the remaining traffic. In distinction, Juniper’s J-flow [55] offers the ability to sample runs of consecutive packets. Sampling and other packet selection methods have been standardized in the PSAMP working group of the IETF [24,32,33,82]. We review these in greater detail in Section 10.5. PSAMP is positioned as a protocol to select packets for reporting at an observation point, with IPFIX as the export protocol. For example, selected packets could be reported on as single packet flow records, using zero active timeout for immediate reporting. If sampling 1 out of N packets on average, then from a flow with far fewer than N packets, if any packets are sampled, typically only one packet will be sampled. In this case one might just as well sample packets without constructing flow records; this would save resources at the router since there would be no need to cache the single packet flows until expiration of the interpacket timeout. Indeed, there are many short flows: web traffic is a large component of Internet traffic, in which the average flow length is quite short, around 16 packets in one study [42]. However, there are several reasons to expect that longer flows will continue to account for much traffic. First, several prevalent applications and application classes predominantly generate long-lived flows, for example, multimedia downloads and streaming, and VoIP. Secondly, tunneling protocols such as IPSEC [56] may aggregate flows between multiple endpoints into a packet stream in which the endpoint identities are not visible in the network core; from the measurement standpoint, the stream will thus appear as a single longer flow. For these reasons, unless packet sampling periods becomes comparable with or larger than the number of packets in these flows, flow statistics will still afford useful compression of information. 10.4.3.3 Flow Record Sampling Sampling flow records present a challenge, because of the highly skewed distribution of flow sizes found in network traffic. Experimental studies have shown that the distribution of flow lengths is heavy tailed; in particular, a large proportion of the total bytes and packets in the traffic stream occur in a small proportion of the flows; see, e.g. [42]. This makes the requirements for flow record sampling fundamentally different to those for packet sampling. While packets have a bounded size, uniform and uncontrolled sampling due to transmission loss are far more problematic for flow records than for sampled packets, since omission of a single flow report can have huge impact on measured traffic volumes. This motivates sampling dependent on the size of the flow reported on. A simple approach would be to discard flow records whose byte size falls below a threshold. This gives a conservative, and hence biased measure of the total bytes, and is susceptible to subversion: an application or user that splits its traffic up into small flows could evade measurement altogether. This would be a weakness for accounting and security applications. 10 Measurements of Data Plane Reliability and Performance 333 Smart Sampling can be used to avoid the problems associated with uniform sampling of flow records. Smart Sampling is designed with the specific aim of achieving the optimal trade-off between the number of flow records actually sampled, and the accuracy of estimates of underlying traffic volumes derived from those samples. In the simplest form of Smart Sampling, called Threshold Sampling [36], each flow record is sampled independently with a probability that depends on the reported flow bytes: all records that report flow bytes greater than a certain threshold z are selected; those below threshold are selected with a probability proportional to the flow bytes. Thus, the probability to sample a flow record representing x bytes is pz .x/ D minf1; x=zg The desired optimality property described above holds in the following sense. Suppose X bytes P are distributed over some number m of flows of size x1 ; : : : ; xm so b b that X D m i D1 xi . We consider unbiased estimates X of X , i.e., X is a random b quantity whose average value is X . Suppose X is an unbiased estimate of X obtained from a random selection of a subset of n < m of the original flows, having sizes x1 ; : : : ; xn , where selection is independent according to some size-dependent probability p.x/. A standard procedure to obtain unbiased estimates is to divide the measured value by the probability that it was sampled [47]. Thus in Pour case each b D n xi =p.xi / sampled flow size is normalized by its sampling rate, so that X i D1 is an unbiased estimate of X . We express the optimal trade-off as trying to minimize a total “cost” that is a linear combination b Cz D z2 EŒn C VarŒX of the average number of samples and the estimation variance, where z is a parameter that expresses the relative importance we attach to making the number of samples small versus making the variance small. For example, when z is large, making EŒn small has a larger effect on reducing Cz . It is proved in [36] that the cost Cz is minimized for any set of flow sizes x1 ; : : : ; xm by using the sampling probabilities p.x/ D pz .x/. With the probabilities pz , each selected flow xi gives rise to an estimate xi =pz .xi / D maxfxi ; zg. Although optimal as stated, Threshold Sampling does not control the exact number of samples taken. For example, if the number of flows doubles during a burst, then on average, the number of samples also doubles (assuming the same flow size distribution). However, exact control may be required in some applications, e.g., when storage for samples has a fixed size constraint, or for sampling a specified number of representative records for archiving. A variant of Smart Sampling, called Priority Sampling [37], is able to achieve a fixed sample of size n < m, as follows. Each flow of size xi is assigned a random priority wi D xi =ai where ai is a uniformly distributed random number in .0; 1. Then the k flows of highest priority are selected for sampling, and each of them contributes an estimate maxfxi ; z0 g where z0 is now a data-dependent threshold z0 set to be .k C 1/st largest priority. It is shown in [37] that this estimate is unbiased. 334 N. Duffield and Al Morton Priority Sampling is well suited for back-end database applications serving queries that require estimation of total bytes in an arbitrary selection of flows (e.g., all those in a specific matrix element) over a specified time period. A random priority is generated once for each flow, and the records are stored in descending order of priority. Then an estimate based on k flows proceeds by reading k C 1 flow records of highest priority that match the selection criterion, forming an unbiased estimate as above. Because the flow records already are in priority-sorted order, selection is very fast (see [4]). 10.4.3.4 Estimation and the Statistical Impact of Sampling Whether sampling packets or flow records, the measured numbers of packet, bytes, or flows must be normalized in order to give an unbiased estimate of the actual traffic from which they were derived; we saw how this was done for threshold sampling in Section 10.4.3.3. For 1 in N packet sampling, byte estimates from selected packets are multiplied by N . The use of sampling for measuring traffic raises the question of how accurate estimates of traffic volumes will be. The statistical nature of estimates might be thought to preclude their use for some purposes. However, for many sampling schemes, including those described above, the frequency of estimation errors of a given size can be computed or approximated. This can help answer questions such as “if no packets matching a given key were sampled, then how likely is it that there were X or more bytes in packets with this key that were missed”. A rough indication of estimation error is the relative standard deviation (RSD), b divided by the true value X . The RSD i.e, the standard deviation of the estimator X for estimating an aggregate ofp X bytes of traffic using independent 1 in N packet sampling is bounded above by N xmax =X where xmax is the maximum p packet size. For flow sampling with threshold z, the RSD is bounded above by z=X . Observe the RSD decreases as the aggregate size increases. In cases where multiple stages of sampling and aggregation are employed – for example, packet sampled NetFlow followed by Threshold Sampling of flow records – the sampling variance is additive. In the example, the RSD becomes p .z C N xmax /=X As an example, consider 1 in N D 1;000 sampling of packets of maximum size xmax D 1;500 bytes with a flow sampling threshold of z D 50 MB. In this case z N xmax D 1:5 MB , and so Smart Sampling contributes most of the estimation error. With these sampling parameters, estimating the 10 min average rate of a 1 Gb/s backbone traffic stream on a backbone would incur a typical relative error of 3%. In fact, rigorous confidence intervals for the true bytes in terms of the estimated values can be derived (see [26, 79]), including for some cases of multistage sampling. Using an analysis of the sampling errors, the impact of flow sampling on usagebased charging, and ways to avoid or ameliorate estimation error, are described in [35]. The key idea is that a combination of (i) systematic undercounting of customer 10 Measurements of Data Plane Reliability and Performance 335 traffic by a small amount, and (ii) using sufficiently long billing periods, can reduce the likelihood over over-billing customers to an arbitrarily small probability. 10.4.3.5 Stateful Packet Sampling and Aggregation The dichotomy between packet sampling on a router and flow sampling in the measurement infrastructure, while architecturally simple, does not necessarily result in the best trade-off between resource usage and measurement accuracy. We briefly review some recent research that proposed to maintain various degrees of router state in order to select and maintain flow records for subsets of packets. Sample and Hold [41] All packets arriving at the router whose keys are not currently in the flow cache are subjected to sampling; packets that are selected in this manner have a corresponding flow cache entry created, and all subsequent packets with the same key are selected (subject to timeout). Thus, long flows are preferentially sampled over short flows, since the flow cache tends to be populated only by the longer flows. This achieves similar aims to Smart Sampling but in a purely packet-based solution. While the cache can be made smaller than would be required to measure all flows, a cache lookup is still required for each packet. Adaptive Sampling Methods Both NetFlow and Sample and Hold can be made adaptive by adjusting their underlying sampling rate and flow termination criteria in response to resource usage, e.g., to control cache occupancy and flow record export rate. Now recall from Section 10.4.3.3 that construction of unbiased estimators required normalization of sample bytes and packet counts by dividing by the sampling rate. Adjustment of the sampling rate requires matching renormalization in estimators in order to maintain unbiasedness. Partial flow records may be resampled (and further renormalized) and may be discarded in some cases (see [40]). In one variant of this approach the router maintains and exports a strictly bounded number of flow records, providing unbiased estimates of the original traffic bytes. Stepping Methods Stepping is an extension of the adaptive method in which, when downward adjustments of the sampling rate occur, estimates of the total bytes in packets of a given key that arrived since the previous such adjustment – the steps – are sampled and exported from the flow cache. Such exports can take place from the flow cache into DRAM, where the steps can be aggregated. The payoff is higher estimation accuracy, because once exported, the steps are not subject to loss (see [27]). Run-Based Estimation In its simplest form, run-based estimation involves caching in SRAM only the key of the last observed packet. If the current packet matches the key, the run event is registered in a cache in DRAM. Using a timeseries model, the statistics of the original traffic are estimated from those of the runs. A generalization of the approach can additionally utilize longer runs [45]. 336 N. Duffield and Al Morton 10.5 Packet Selection Methods for Traffic Flow Measurement 10.5.1 Packet Selection Primitives and Standards In Section 10.4.3.2 random packet sampling was presented as a necessity for reducing packet rates prior to the formation of flow statistics; moreover, random sampling has significant advantages over filtering and aggregation as a continuously operating general data reduction method. In this chapter we shift the emphasis somewhat and consider a set of packet selection primitives, and their ability to serve a variety of specific measurement applications. Following [33] we classify selection primitives as follows: Filtering Selection of packets based deterministically on their content. There are two important subcases: – Property Match Filtering Selection of a packet if a field or fields match a predefined value. – Hash-Based Selection A hash of the packet is calculated and the packet is selected if it falls in a certain range. Sampling Selection of packets nondeterministically. Some primitives of this type are provided by Cisco Flexible NetFlow [14] that allows combinations of certain random sampling and property match filters. The framework above was standardized in the Packet Sampling (PSAMP) working group of the IETF [33]. A collection of sampling primitives is described in [82], including but not limited to the fixed rate sampling from Section 10.4.3.2. Property match filtering can be based on packet header fields (such as IP address and port) and the packet treatment by the router, including interfaces traversed, and the routing state in operation during the packet’s transit of the router. Hash-based selection, including specific hash functions, is also standardized in [82]. We describe the operation and applications of hash-based selection in Section 10.5.2. From both at the implementation and standards viewpoint, packet selection is positioned as a front-end process that passes selected packets to a process that compiles and exports flow statistics. Thus, a PSAMP packet selector passes packets to an IPFIX flow reporting process. A flow record can report on single selected packets by setting the inactive flow timeout to zero. A key development in support of network management is the ability of routers and other measurement devices to support simultaneous operation of multiple independent measurements, each of which is composed of combinations of packet selection primitives. This type of capability is already present in Cisco Flexible NetFlow [14] and standardized in PSAMP/IPFIX. Each packet selection process can, in principle, be associated with its own independently configurable flow reporting process. The ability to dynamically configure or reconfigure packet selection provides a powerful tool for a variety of applications, from low-rate sampling of all traffic to supply routine reporting for Network Operation Center (NOC) wallboard displays, to targeted high-rate sampling that drills down on an anomaly in real time (see Fig. 10.3). 10 Measurements of Data Plane Reliability and Performance 337 Packet Header Fig. 10.3 Concurrent combinations of sampling and filtering packet selection primitives 10.5.2 Consistent Packet Sampling and Hash-Based Selection The aim of consistent packet sampling (also called Trajectory Sampling) is to sample a subset of packets at some or all routers that they traverse. The motivation is new measurement applications that are enabled or enhanced; see below. Consistent packet sampling can be implemented through hash-based selection. Routers calculate a hash of packet content that is invariant along the packet path, and the packet is selected for reporting if the hash values falls in a specified range. When all routers use the same hash function and range, the sampling decisions for each packet are identical at all points along its path. Thus, each packet signals implicitly to the router whether it should be sampled. Information on the sampled packet can be reported in flow records, potentially one per sampled packet. In order to aid association of different reports on the same packet by the collector, the report can include not only packet header fields, but also a packet label or digest, taking the form of a hash (distinct from that used for selection) whose input includes part of the packet payload. An ideal hash function would provide the appearance of uniform random sampling over the possible hash input values. This is important both for accurate traffic estimation purposes, and for integrity: network attackers should not be able to predict packet sampling outcomes. Use of a cryptographic hash function with private parameter provides the strongest conformance to the ideal. In practice, implementation constraints on computational resources may require weaker hash functions to be used. Hash-based packet selection has been proposed in [38], with further work on its applications passive performance monitoring in [34, 83]. Security ramifications of different hash function choices are discussed in [43]. Hash-based sampling has been standardized as part of the PSAMP standard in the IETF [82]. 338 N. Duffield and Al Morton Applications of consistent sampling include: Route Troubleshooting Direct measurements of packet paths can be used to detect routing loops and measure transient behavior of traffic paths under routing changes. This detailed view is not provided by monitoring routing protocols alone. Independent packet sampling at different locations does not provide such a fine timescale view in general, since a given packet is typically not sampled at multiple locations. Passive Performance Measurement Correlating packet samples at two or more points on a path enables direct measurement of the performance experienced by traffic on the path, such as loss (as indicated by packets present at one point on the path that are missing downstream) and latency (if reports on sampled packets include measurement timestamps from synchronized clocks). This is an attractive application for service providers since it can alert performance degradation at the level of individual customers, reflecting the same packet transit performance that customers themselves experience. 10.6 Deep Packet Inspection Sections 10.4 and 10.5 are concerned with the measurement and characterization of traffic at the granularity of a flow key that depends on the packet only through header fields. However, there are important network management tasks that depend on knowledge of packet payloads, and hence for which traffic flow monitoring is insufficient. The term DPI denotes measurement and possible treatment of packets based on their payload. We describe some broad designs policy issues associated with the deployment of DPI in Section 10.6.1; specific technologies for DPI devices are described in Section 10.6.2, and three applications of DPI for network management in Section 10.6.3: application-specific bandwidth management, network security monitoring, and troubleshooting. 10.6.1 Design and Policy Issues for DPI Deployment DPI functions are not uniformly featured in routers, and hence some uses will require additional infrastructure deployment. DPI is extremely resource intensive due to the need to access and process packet payload at line rate. This makes DPI expensive compared with flow measurement, which hinders its widespread deployment. A limited deployment may be restricted to important functional sites, or at a representative subset of different site types, e.g., a backbone link, an aggregation router, or in front of datacenter. Like all traffic measurements, DPI must maintain privacy and confidentiality of customer information throughout the measurement collection and analysis process. Although flow measurements already encode patterns of communications through 10 Measurements of Data Plane Reliability and Performance 339 source and destination IP addresses, DPI of packet payload may also encompass the content of the communications. Service provider policies must specify practices to maintain the privacy of the data, including controlled and auditable access restricted to individuals needing to work with the data, encryption, anonymization, and data retention policies. See also the discussion specific to DPI for security monitoring in Section 13.4. Furthermore, any use of DPI data must be conducted in accordance with legal regulations in force. Similar issues exist for providers of hostbased services as opposed to communications services, where servers intrinsically have access to user-specific data that may be presented by the customer in the course of using those services, e.g., email, search, or e-commerce transactions. 10.6.2 Technologies for DPI DPI functionality is realized in dedicated general-purpose traffic monitors [28], and within vendor equipment targeted at specific applications such as security monitoring [68] and application-specific bandwidth management [19]. As the value of DPI-based applications for service providers grows, DPI functionality has also appeared in some routers and switches [16]. General-purpose computing platforms have been used for DPI, e.g., using Snort [74], an open-source intrusion detection system. Some DPI devices operate in line where they perform network management functions directly, such as security-based filtering or application bandwidth management. Others act purely as monitors and require a copy of the packet stream to be presented at an interface. There are several ways by which this can be accomplished: (i) by copying the physical signal that carries the packets, e.g., with an optical splitter; (ii) by attaching the monitor to a shared medium carrying the traffic, or (iii) by having a router or switch copy packets to an interface on the monitor. The architectural challenges for all DPI platforms are: (i) the high incoming packet rate; (ii) the large number of distinct signatures against which each packet is to be matched – Snort has several hundred – and (iii) signatures that match over multiple packets, and hence require flow-level state to be maintained in the measurement device. These factors have tended to favor the use of dedicated DPI devices ahead of router-based integration in the past. They also drive architectural design for DPI devices in which aggregation and analysis if pushed down as close to the data stream as possible. Coupled with general-purpose computational platforms, tcpdump [52] is a public domain software that captures packets at an interface of the host on which it executes. Tcpdump has been widely used as both a diagnostic tool, and also to capture packet header traces in order to conduct reproducible exploratory studies. However, the enormous byte rates of network data in comparison with storage and transmission resources, generally preclude collecting packet header traces longer than a few minutes or perhaps hours. A number of anonymized packet header traces have been made available by researchers; see e.g., ([9]). Software for removal of confidential information from packet traces, including anonymization, is available (see [63]). 340 N. Duffield and Al Morton 10.6.3 Applications of DPI In this section, we motivate the importance of DPI by describing network management applications that require detail from packet payload: application characterization and management, network security, and network debugging. 10.6.3.1 Application Demand Characterization and Bandwidth Management Applications place diverse service requirements on the network. For example, realtime applications such as VoIP require relatively small bandwidth but have stringent latency requirements. Video downloads require high throughput but are elastic in terms of latency. Service providers can differentiate resources among the different service classes according to the size of the demands in each class. Hence a crucial task for network planning is to characterize and track changes in the traffic mix across application classes. In the past, application and application class could be inferred reasonably well from TCP/UDP port numbers on the basis of IANA well-known port assignments [50]. However, purely port-based identification is becoming less easy due to factors including (i) lack of adherence to port conventions by application designers, (ii) piggybacking of applications on well-known ports, such as HTTP port 80, in order to facilitate firewall traversal; and (iii) separation of control and data channels with dynamic allocation of data port during control level handshaking (see Chapter 5 for further details). On the other hand, knowledge of application operation can be used to develop packet content-level signatures. In some cases, this would involve matching strings of an application-level protocol across one or more network packets. For applications that use separate data and control channels, this could entail (a) matching a signature of the control channel in the manner just described with further inspection, then (b) identifying the data channel port communicated in the control channel, (c) using the identified data channel port to classify further packet or flow level measurements taken (see [80]). Application-based classification can be used purely passively. Knowledge of the mix and relative growth between different application classes is necessary for network planning. It can also be used actively to apply differentiated resource allocation policies to different application classes, concerning traffic shaping, dropping of outof-profile packets, or restoration priority after failures. As an example, access to a customer access channel can be prioritized so that the performance of delaysensitive VoIP traffic is not impaired by other traffic. A number of vendors supply equipment with such capabilities (see e.g. [19, 75]). 10.6.3.2 Network Security While some network attacks can be identified based on header-level information this is not true in general. As a counterexample, the well-known Slammer worm 10 Measurements of Data Plane Reliability and Performance 341 [64] was evident due to (i) its rapid growth leading to sharp increases in traffic volume; (ii) the increase was associated with particular values of the packet header field, and (iii) contextual information that the application exploited predominantly exchanges traffic across LANs or intranets rather than across the WAN. This combination of factors made it relatively easy to identify the worm and block its spread by instantiating header-level packet filters, without significantly impacting legitimate traffic. However, these conditions do not hold in general. Many network attacks exploit vulnerabilities in common applications such email, chat, p2p, and web-browsing mediated by network communications that, in contrast with the Slammer example [64], (i) are relatively stealthy, not exhibiting large changes in network traffic volume at least during the acquisition phase, (ii) are not distinguished from legitimate traffic by specific header field values, and hence (iii) blend into the background of legitimate traffic at the flow level. Examples include installation of malware such as keystroke loggers, or the acquisition and subsequent control of zombie hosts in botnets. To detect and mitigate these and other attacks, packet inspection is a powerful tool to enable matching against known signatures of malware, including viruses, worms, trojans, botnets. Indeed, a sizable proportion of the attack detection signatures commonly used in the public domain Snort packet inspection system [74] match only on the packet payload rather than the header. Similarly to Section 10.6.3.1, a network security tool may operate purely passively in order to gain information about unwanted traffic, or may be coupled to filtering functions that block specific flows of traffic (see Chapter 13 for further details). 10.6.3.3 Debugging for Software, Protocols, and Customer Support Both networking hardware and software that implement services can contain subtle dependencies and display unexpected behavior that, despite pre-deployment testing, only becomes evident in the live network. DPI permits network operators to monitor, evaluate, and correct such problems. To troubleshoot specific network or service layer issues, DPI devices could be deployed at a concentration point where specific protocol exchanges or application-layer transactions can be monitored for correctness. Operators might also use portable DPI devices, which would allow them deploy devices in specific locations to investigate suspected hardware or software bugs. Similarly, DPI enables technicians to assist customers in debugging customer equipment, and software installations and configurations. This can enable technicians to rapidly determine the nature of problems rated to network transmissions, rather than rely on potentially incomplete knowledge derived from customer dialogs. 342 N. Duffield and Al Morton 10.7 Active Performance Measurement This section is concerned with the challenges and design aspects of providing active performance measurement infrastructures for service providers. The four metric areas of common interest are: Connectivity Can a given host be reached from some set of hosts? Loss What proportion of a set of packets are lost on a path (or paths) between two hosts? Loss may be considered in an average sense (all packets over some period of loss) or granular in time (burst loss properties) or space (broken down, e.g., by customer or application). Delay The network latency over a path (or paths) between two hosts, viewed at the same granularity as for loss measurements. Throughput Bytes or packets successfully transmitted between two hosts, potentially broken down by application or protocol (e.g., TCP vs. UDP). Historically, active measurement tools such as ping and traceroute have long been used to baseline roundtrip loss and delay and map IP paths, either as standalone tools, or integrated into performance measurement systems. Bulk throughput has been estimated using the treno tool [58], which creates a probe stream that conforms to the dynamics of TCP. There is a large body of more recent research work proposing improved measurement methods and analysis (see, e.g., [29]). However, the focus of the remainder of this chapter concerns more the design and deployment issues for the components of an active measurement and reporting infrastructure of the type increasingly deployed by service providers and enterprise customers. Specifically: Performance Metric Standardization This is required in order for all parties involved in the measurement, dissemination and interpretation of results to agree on the methods of acquiring performance measurements, and their meaning. Such parties include network service providers, their customers, third-party measurement service providers, and measurement system vendors. Performance metric standardization is described in Section 10.8. Service Level Agreements Service providers must offer specific performance targets to their customers, based upon agreed metrics. Section 10.9 describes processes for establishing SLAs between service providers and customers. Deployment of Active Measurement Infrastructures Deployment issues for large-scale active measurement infrastructures are discussed in Section 10.10, together with some examples of different deployment modes. 10.8 Standardization of IP Performance Metrics In this section, we give an overview of standardization activities on IP performance metrics. There are not one, but two standard bodies that provide the authoritative view of IP network performance and on packet performance metrics in general. 10 Measurements of Data Plane Reliability and Performance 343 They are the IETF (primarily the IP Performance Metrics IPPM working group), and the International Telecommunications Union - Telecommunications Sector Study Group 12 (ITU-T SG 12, specifically the Packet Network Performance Question 17). Although there are some differences in the approaches and the metric specifications between these two bodies, they are relatively minor. The critical advantage of using standardized metrics is the same as for any good standard: the metrics can be implemented from unambiguous specifications, which ensure that two measurement devices will work the same way. They will assign timestamps at the same defined instants when a packet appears at the measurement point (such as first bit in, or the last bit out). They will use a waiting time to distinguish between packets with long delays and packets that do not arrive (because one cannot wait forever to report results, and for many applications a packet with extremely long delay is as good as lost). They will perform statistical summary calculations the same way, and when presented with identical network conditions to measure, they produce the same results. The ITU-T has defined its IP performance metrics in one primary Recommendation, Y.1540. The general approach is to define basic sections bounded by measurement points, which are Hosts at the source and destination(s) Network Sections (composed of routers and links, and usually defined by admin- istrative boundaries) Exchange Links (between the other entities) The next step is to define packet transfer reference events at the various section boundaries. There are two main types of reference events: Entry event to a host, exchange link, or network section Exit event from a host, exchange link, or network section Then, the fundamental outcomes of successful packet transfer and lost packet are defined, followed by performance parameters that can be calculated on a flow of packets (referred to using the convention “population of interest”). ITU-T’s metrics are useful in either active or passive measurement, and do not specify sampling methods. The IETF began work on network performance metrics in the mid-1990s, by first developing a comprehensive framework for active measurement [70]. The framework RFC established many important conventions and notions, including: The expanded use of the metric definition template developed in earlier IETF work on Benchmarking network devices [6]. The general concept of “packets of Type-P” to reflect the possibility that packets of different types would experience different treatment, and hence, performance as they traverse the path. A complete specification of Type-P and the source and destination addresses are usually equivalent to the ITU-T’s “population of interest”. 344 N. Duffield and Al Morton The notion of “wiretime”, which recognizes that physical devices are needed to observe packets at the IP-layer, and these devices may contribute to the observed performance as a source of error. Other important time-related considerations are detailed, too. The hierarchy of singletons (“atomic” results), samples (sets of singletons), and statistics (calculations on samples). A series of RFCs followed over the next decade, one for each fundamental metric that was identified. The IETF wisely put the various metric RFCs (RFC 2679 [2] and RFC 2680 [3]) on the Standards Track, so that the implementations could be compared with the specifications and used to improve their quality (and narrowdown some of the flexibility) over time. RFC 2330 [70] and RFC 3432 [72] specify Poisson and Periodic sampling, respectively. Throughput-related definitions are in RFC 5136 [12]. One area in which IETF was extremely flexible was its specification for delay variation, in RFC 3393[31]. This specification applies to almost any form of delay variation imaginable, and was endowed with this flexibility after considerable discussion and comparisons between the ITU-T preferred form and other methods (some of which were adopted in other IETF RFCs). This flexibility was achieved using the “selection function” concept, which allows the metric designer to compare any pair of packets (as long as each is unambiguously defined from a stream of packets). Thus, this version of the delay variation specification encouraged practitioners to gain experience with different metric formulations on IP networks, and facilitated comparison between different forms by establishing a common framework for their definition. A common selection function uses adjacent packets in the stream, and this is called “Inter-Packet Delay Variation”. In contrast, the ITU-T Recommendations of the early 1990s (for ATM networks) used essentially the same form of delay variation metric as in Y.1540 and as used today in Recommendations for the latest networking technologies. It is called the “2-point Packet Delay Variation” metric. This metric defines delay variation as the difference between a packet’s one-way delay and the delay for a single reference packet. The recommended reference is the packet with the minimum delay in the test sample, removing propagation from the delay distribution and emphasizing only the variation. This definition differs significantly from the inter-packet delay variation definition. Fortunately, an IETF project has rather completely investigated the two main forms of delay variation metrics, and is available to provide guidance on the appropriate form of metric for various tasks [66]. The comparison approach was to define the key tasks (such as de-jitter buffer size and queuing time estimation) and challenging measurement circumstances for delay variation measurements (such as path instability and packet loss), and to examine relevant literature. In summary, the ITU-T definition of “2-point Packet Delay Variation” was the best match to all tasks and most circumstances, but with a requirement for more stable timing being its only weakness. 10 Measurements of Data Plane Reliability and Performance 345 10.9 Performance Metrics in Service-Level Agreements In this section, we discuss Service-Level Agreements, or SLA, and how the key metrics defined above contribute to a successful relationship between customers and their service providers. 10.9.1 Definition of a Service-Level Agreement (SLA) For our purposes, we define a Service-Level Agreement as: A binding contract between Customer and Service Provider that identifies all important aspects of the service being delivered, constrains those aspects to a satisfactory performance level which can be objectively verified, and describes the method and format of the verification report. This definition makes the SLA-supporting role and design of active measurement systems quite clear. The measurement system must assess the service on each of the agreed aspects (metrics) according to the agreed reporting schedule and determine whether the performance thresholds have been met. The details of the SLA may even specify the points where the active measurement system will be connected to the network, the sending characteristics of the synthetic packets dedicated for verification testing, and the confidence interval beyond which the results conclusively indicate that the threshold was met/not met. 10.9.2 Process to Develop the Elements of an SLA This section describes a process to develop the critical performance aspects of an SLA. Typically, a network operator establishes a standard set of SLAs for a network service by conducting this process internally, using a surrogate for the customer. The specific details of the SLA may differ for different services, e.g., an enterprise Internet access service might have a different SLA from a premium VPN service. An SLA might specify performance metrics such as data delivery (the inverse of packet loss), site-to-site latency by region or location, delay variation or jitter, availability, etc. as well as a number of nonperformance metrics such as provisioning intervals. There are also cases in which a network operator may develop a customized SLA for a particular customer (e.g., because the size of their network or other special circumstances demand it). The process that a service provider and the customer would go through to develop a customized SLA illustrates the issues that need to be addressed when developing an SLA. We present an example of such a process here. In principle, the SLA represents a common language between the customer and service provider. The process involves collection of requirements and a meeting of 346 N. Duffield and Al Morton peers to compare the view from each side of the network boundaries. One set of steps to create agreeable requirements is given below. 1. The customer identifies the locations where connectivity to the communications service is required (Customer–Service Interfaces), and the service provider compares the location list with available services. 2. The customer and service provider agree on the performance metrics that will be the basis for the SLA. For example, a managed IP network provides a very basic service – packet transfer from source to destination. The SLA is based on packet transfer performance metrics, such as delay, delay variation, and loss ratio. If higher-layer functions are also provided (e.g., domain name to address resolution), then additional metrics can be included. 3. The customer must determine exactly how they plan to use a communications network to conduct business, and express the needs of their applications in terms of the packet performance metrics. The performance requirements may be derived from analysis of the component protocols of each customer application, from tests with simulated packet transfer impairments, or from prior experience. Sometimes, the service provider will consult on the application modeling. 4. In parallel, the service provider collects (or estimates) the levels of packet transfer performance that can be delivered between geographically dispersed service interfaces. Active measurements often serve this aspect of the process, by revealing the network performance possible under current conditions. 5. When the customer and service provider meet again, the requested and feasible performance levels for all of the performance metrics are compared. Where the requested performance levels cannot be met, revised network designs or a plan to achieve interim and long-term objectives in combination with deployment of new infrastructure may be developed, or the customer may relax specific requirements, or a combination of the two. 6. Once the performance levels of the SLA are agreed upon, it remains to decide on the formal reporting intervals and how the customer might access the ongoing measurement results. This aspect is important because formal reporting intervals are often quite long, on the order of a month. 7. If the customer needs up-to-date performance status to aid in their troubleshooting process, then monthly reports might be augmented with the ability to view a customized report of recent measurements. The active measurement system would communicate measured results on a frequent basis to support this monitoring function, as well as longer-term SLA reports. There are several process complexities worth mentioning. First, the customer may be able to easily determine the performance requirements for a single application flow, but the service providers’ measurements will likely be based on a test flow, which experiences the same treatment as the rest of the flows. The test packet flow may not have identical sending characteristics as customer flows, and will certainly represent only a small fraction of the aggregate traffic. Thus, the active test flow performance will represent the customer flow performance only on a long-term basis. Second, active measurements of throughput may have a negative affect on live 10 Measurements of Data Plane Reliability and Performance 347 traffic while they are in-progress. As a result, the throughput metric may be specified through other means, such as the information rate of the access link on each service interface, and not formally verified through active measurement. 10.10 Deployment of Active Measurement Infrastructures In this section, we describe several ways in which active measurement systems can be realized. One of the key design distinctions is the measurement device topology. We describe and contrast several of the topologies that have seen deployment, as this will be an important consideration for any system the reader might devise. We categorize the topologies according to where the devices conducting measurements are physically located. 10.10.1 Geographic Deployment at Customer–Service Interfaces In this topology, measurement devices (or measurement processes in multipurpose devices) are located as close as possible to the service interfaces. Figure 10.4a a b c Fig. 10.4 Deployment scenarios for active measurement infrastructure. MP D measurement point. (a) MP at ends of path in point-to-point service. (b) MP at network edge; no coverage of access links. (c) MP at central location with connectivity to remote locations 348 N. Duffield and Al Morton depicts this topology for a point-to-point service, with a Measurement Point (MP) at each end of the path. The Cisco Systems IP SLATM product embeds an active measurement system at routers and switches that often resides in close proximity to the Customer–Service Interfaces. The measurement results can be collected by accessing specific MIB modules using SNMP. The utility of IP SLATM capabilities was recognized for multi-vendor scenarios, and the Two-way Active Measurement Protocol (TWAMP) [46] standardizes a fundamental test control and operation capability. The primary advantage of this topology is that the measurement path covers the entire service in a single measurement, so the active test packets will experience conditions very similar to customer traffic. However, the measurement device/process must be located at a remote (customer) site to provide such coverage, so their cost is not shared across multiple services and it must be managed (and have results collected) remotely. The scale of the measurement system is also an issue. A full-mesh of two-way active measurements grows exponentially with the number of nodes, N , according to N .N 1/=2. 10.10.2 Geographic Deployment at Network Edges In Fig. 10.4b, the MPs move to intermediate nodes along the point-to-point path, the edge of the network providing service. In this scenario, the measurement devices/processes are located at the edge of the network providing service and the access links may not be covered by the measurements or the SLAs. We also show a third MP within the network cloud, which can be used to divide the path into segments. This topology makes it possible to share the measurement devices and the measurements they produce with overlapping paths that support different services, different customers, or parts of other point-to-point paths for the same customer. Of course, a process is needed to combine the results of segment measurements to estimate the edge-to-edge performance, and this problem has been successfully solved [51, 65, 67]. The key points to note are the following: The interesting cases are those where impairments are time-varying, thus we ex- pect to estimate features of time distributions, and not specific values (singletons) at particular times. Some performance metric statistics lend themselves to combination, such as means and ratios, so these should be selected for measurement and SLAs. For example, measurements of the minimum delay of path segments can usually be taken as additive when estimating the complete path performance. Average oneway delay is also additive, but somewhat more prone to estimation errors when the segment distributions are bimodal or have wide variance (a long tail). There must be a reasonable case made that (for each metric used) performance on one path segment will be independent of the other, because correlation causes the estimation methods to fail. An obvious correlation example is any metric 10 Measurements of Data Plane Reliability and Performance 349 that evaluates packet spacing differences – the measurement is dependent on the original spacing, and that spacing will change when there is any delay variation present on the path segments. We note that it is also possible to obtain complete path coverage using this topology, with assistance from low-cost test reflector devices/processes located at the service interfaces (such as those described in RFC 5357 [46]) (see [13] for more details). 10.10.3 Centralized Deployment with Remote Connectivity As alternative to remote deployment of measurement devices/processes, Fig. 10.4c shows all MPs moved to a central location with connectivity to strategic locations in the network (such as the network edges in key cities). This topology offers the advantage of easy access to the measurement devices at the central location, thus affording rapid reconfiguration and upgrade. However, reliable remote access links are needed between this single location and every network node that requires testing. Also, even if the remote access links are transparent from a packet loss perspective, they will still introduce delay that is not present on the customer’s path through the network. The mere cost of the remote access links may make remote device deployment in Fig. 10.4b more attractive. Thus, topologies like this have been deployed for remote connectivity monitors when the devices implementing a network technology do not have sufficient native support for remote device deployment (e.g., Frame Relay networks). A system exploiting this approach is described in [8] where tunneling is used to steer measurement packets on round-trip paths from a central host, via the access links. In this sense, virtual measurements are conducted between different pairs of hosts in the network core. A related approach for multicast VPN monitoring is described in [7]. 10.10.4 Collection for Infrastructure Measurements When measurement devices are geographically dispersed, there must be a means to collect the results of measurements and make them available for monitoring, reporting, and SLA compliance verification. This requires some form of protocol to fetch either the per-packet measurements, or the processed and summarized results for each intermediate measurement interval (e.g., 5–15 min). Once the measurement results have been collected at a central point, they should be stored in a database system and made available for on-going display, detailed analysis, and SLA verification/reporting. 350 N. Duffield and Al Morton 10.10.5 Other Types of Infrastructure Measurements 10.10.5.1 Independent Measurement Networks Measurement service vendors, such as Keynote [57], station measurement devices in locations of ISPs representing, e.g., typical customer access points, and conduct a variety of measurements between measurement devices or between them and service hosts, including, web and other server response times, access bandwidth, VoIP, and other access performance. Comparative performance measures are published and detailed results are made available through subscription. 10.10.5.2 Cross Provider and Network-Wide Measurements End-to-end paths commonly traverse multiple service providers. Thus, it is natural to measure the inter-provider components to performance. The most prominent example is the RIPE network [73], which has stationed measurement devices in a number of participating ISPs, conducts performance measurements between them, and disseminates selected views to the participants. Novel active measurement infrastructure is being deployed in advanced research and development networks (e.g., MeasurementLab/PlanetLab [61]), including work in developing architectures for managing access to and data recovery from measurement infrastructures. 10.10.5.3 Performance Measurement and Route Selection Router measurement capabilities may also be coupled to the operation of routing protocols themselves. Cisco Performance Routing [20] enables routers in a multiply-homed domain to conduct performance measurements to external networks. The measurements are then compared in order to determine the best egress to that network and adjust route parameters accordingly. 10.11 Outlook The challenges described in Section 10.1.3 will grow with network size and complexity. The fundamental challenges for passive measurement, that of large data volumes caused by network scale and speed, are usually addressed by sampling. Going forward, there are three related trade-offs for the measurement infrastructure. Unless the capacity of the measurement infrastructure grows commensurate with the growth in network speed and scale grows, sampling rates must decrease in order to fit the measurements within the current infrastructure. But decreasing sampling rates reduces the ability to provide an accurate fine-grained view the traffic. Although loss of detail and accuracy can be ameloriated by aggregation, that would go against the 10 Measurements of Data Plane Reliability and Performance 351 increasing demand for detailed measurements differentiated by customer, application, and service class. On the other hand, growing the infrastructure and retaining current sampling rates present its own challenges, and not just for in equipment and administration costs. Distributed measurement architectures are an attractive way to manage scale, enabling local analysis and aggregation rather than requiring recovery of data to a single central point. Then, the challenge becomes the design of distributed analysis and efficient communication methods between components of measurement infrastructure. This is particularly challenging for network security applications, which need a network-wide view in order to identify stealthy unwanted traffic. Active measurement presents analogous challenges in viewing network performance differentiated by, e.g., customer, application, traffic path, and network element. Aggregate performance measurements are no longer sufficient. There are a number of approaches to target probe packets on or onto particular paths: (i) the probe may craft the packet in order that network elements select the packet on the desired path; this approach was taken in [7, 8], or (ii) passively measuring customer traffic directly, e.g., by comparing timestamps between different points on the path to determine latency (see Section 10.5.2). Both these approaches require knowledge of the mapping between the desired entity to be measured from (customer, service class) and the observable parts of the packets. A challenge is that this mapping may be difficult to elucidate, or depend on network state that may become unstable precisely at the time a performance problem needs to be diagnosed. Tomographic methods have been proposed to infer performance on links from performance on sets of measured path that traverse them (see [1, 11]), typically under simplifying independence assumptions concerning packet loss, latency, and link failure. These approaches aim to supply indirectly, performance measurements that are not available directly. It remains a challenge to bring the early promise of these methods to fruition in production-level tools under general network conditions (see e.g. [48]). The relative utility of performance tomographic approaches will depend on the extent to which the detailed network performance measurements can be provided directly by router-based measurements in the future. This outlook stands in contrast to the state described in the opening section, where little measurement functionality was provided in the network infrastructure. As the best ideas in measurement research and development mature into standard equipment features, the challenge will be to manage the complexity and scale of the infrastructure and the data itself. References 1. Adams, A., Bu, T., Caceres, R., Duffield, N., Friedman, T., Horowitz, J., Lo Presti, F., Moon, S. B., Paxson, V., & Towsley. D. (2000). The use of end-to-end multicast measurements for characterizing internal network behavior. IEEE Communications Magazine, May 2000, 38(5), 152–159. 2. Almes, G., Kalidindi, S., & Zekauskas, M. (1999). A one-way delay metric for IPPM. RFC 2679, September 1999. 352 N. Duffield and Al Morton 3. Almes, G., Kalidindi, S., & Zekauskas, M. (1999). A one-way packet loss metric for IPPM. RFC 2680, September 1999. 4. Alon, N., Duffield, N., Lund, C., & Thorup, M. (2005). Estimating arbitrary subset sums with few probes. In Proceedings of 24th ACM Symposium on Principles of Database Systems (PODS) (pp. 317–325). Baltimore, MD, June 13–16, 2005. 5. AT&T Labs. Application traffic analyzer. http://www.research.att.com/viewProject.cfm? prjID=125. 6. Bradner, S. (1991). Benchmarking terminology for network interconnection devices. RFC 1242, July 1991. 7. Breslau, L., Chase, C., Duffield, N., Fenner, B., Mao, Y., & Sen, S. (2006). Vmscope: a virtual multicast vpn performance monitor. In INM ’06: Proceedings of the 2006 SIGCOMM Workshop on Internet Network Management (pp. 59–64). New York, NY, USA: ACM. 8. Burch, H., & Chase, C. (2005). Monitoring link delays with one measurement host. SIGMETRICS Performance Evaluation Review, 33(3):10–17. 9. CAIDA. The CAIDA anonymized 2009 internet traces dataset. http://www.caida.org/data/ passive/passive 2009 dataset.xml. 10. CAIDA. cflowd: Traffic flow analysis tool. http://www.caida.org/tools/measurement/cflowd/. 11. Castro, R., Coates, M., Liang, G., Nowak, R., & Yu, B. (2004). Network tomography: recent developments. Statistical Science, 19, 499–517. 12. Chimento, P., & Ishac, J. (2008). Defining network capacity. RFC 5136, February 2008. 13. Ciavattone, L., Morton, A., & Ramachandran, G. (2003). Standardized active measurements on a tier 1 IP backbone. IEEE Communications Magazine, pp. 90–97, June 2003. 14. Cisco Systems. Cisco IOS Flexible NetFlow. http://www.cisco.com/web/go/fnf. 15. Cisco Systems. Cisco NetFlow Collector Engine. http://www.cisco.com/en/US/products/sw/ netmgtsw/ps1964/. 16. Cisco Systems. Delivering the next generation data center. http://www.cisco.com/en/US/ products/ps9402/. 17. Cisco Systems. IOS switching services configuration guide. http://www.cisco.com/en/US/ docs/ios/12 1/switch/configuration/guide/xcdipsp.html. 18. Cisco Systems. NetFlow. http://www.cisco.com/warp/public/732/netflow/index.html. 19. Cisco Systems. Optimizing application traffic with cisco service control technology. http:// www.cisco.com/go/servicecontrol. 20. Cisco Systems. Performance Routing. http://www.cisco.com/web/go/pfr/. 21. Cisco Systems. Random Sampled NetFlow. http://www.cisco.com/en/US/docs/ios/12 0s/ feature/guide/nfstatsa.html. 22. Claffy, K. C., Braun, H.-W., & Polyzos, G. C. (1995). Parameterizable methodology for internet traffic flow profiling. IEEE Journal on Selected Areas in Communications, 13(8), 1481–1494, October 1995. 23. Claise, B. (2004). Cisco Systems NetFlow Services Export Version 9. RFC 3954, October 2004. 24. Claise, B., Johnson, A., & Quittek, J. (2009). Packet sampling (psamp) protocol specifications. RFC 5476, March 2009. 25. Claise, B., & Wolter, R. (2007). Network management: accounting and performance strategies. Cisco. 26. Cohen, E., Duffield, N., Lund, C., & Thorup, M. (2008). Confident estimation for multistage measurement sampling and aggregation. In ACM SIGMETRICS. June 2–6, 2008, Maryland, USA: Annapolis. 27. Cohen, E., Duffield, N. G., Kaplan, H., Lund, C.,& Thorup, M. (2007). Algorithms and estimators for accurate summarization of internet traffic. In IMC ’07: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (pp. 265–278). New York, NY, USA: ACM. 28. Cranor, C., Johnson, T., Spataschek, O., & Shkapenyuk, V., (2003). Gigascope: a stream database for network applications. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (pp. 647–651). New York, NY, USA: ACM. 29. Crovella, M., & Krishnamurthy, B. (2006). Internet measurement: infrastructure, traffic and applications. New York, NY: Wiley. 10 Measurements of Data Plane Reliability and Performance 353 30. Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. New York, NY, USA: Wiley. 31. Demichelis, C., & Chimento, P. (2002). Ip packet delay variation metric for ip performance metrics (ippm). RFC 3393, November 2002. 32. Dietz, T., Claise, B., Aitken, P., Dressler, F., & Carle, G. (2009). Information model for packet sampling export. RFC 5477, March 2009. 33. Duffield, N.G., Claise, B., Chiou, D., Greenberg, A., Grossglauser, M., & Rexford, J. (2009). A framework for packet selection and reporting. RFC 5474, March 2009. 34. Duffield, N.G., Gerber, A., & Grossglauser, M. (2002). Trajectory engine: A backend for trajectory sampling. In IEEE Network Operations and Management Symposium (NOMS) 2002. Florence, Italy, 15–19 April 2002. 35. Duffield, N.G., Lund, C., & Thorup, M. (2001). Charging from sampled network usage. In Proceedings of 1st ACM SIGCOMM Internet Measurement Workshop (IMW) (pp. 245–256). San Francisco, CA, November 1–2, 2001. 36. Duffield, N.G., Lund, C., & Thorup, M. (2005). Learn more, sample less: control of volume and variance in network measurements. IEEE Transactions on Information Theory, 51(5), 1756–1775. 37. Duffield, N.G., Lund, C., & Thorup, M. (2007). Priority sampling for estimation of arbitrary subset sums. Journal of ACM, 54(6), Article 32, December 2007. Announced at SIGMETRICS’04. 38. Duffield, N., & Grossglauser, M. (2001). Trajectory sampling for direct traffic observation. IEEE/ACM Transactions on Networking, 9(3), 280–292, June 2001. 39. Duffield, N., & Lund, C. (2003). Predicting resource usage and estimation accuracy in an IP flow measurement collection infrastructure. In Proceedings of Internet Measurement Conference. Miami, FL, October 27–29, 2003. 40. Estan, C., Keys, K., Moore, D., & Varghese, G. (2004). Building a better netflow. In Proceedings of the ACM SIGCOMM 04. New York, NY, 12–16 June 2004. 41. Estan, C., & Varghese, G. (2002). New directions in traffic measurement and accounting. In Proceedings of ACM SIGCOMM ’2002. Pittsburgh, PA, August 2002. 42. Feldmann, A., Rexford, J., & Cáceres, R. (1998). Efficient policies for carrying web traffic over flow-switched networks. IEEE/ACM Transactions on Networking, 6(6), 673–685, December 1998. 43. Goldberg, S., & Rexford, J. (2007). Security vulnerabilities and solutions for packet sampling. In IEEE Sarnoff Symposium. Princeton, NJ, May 2007. 44. Greer, R. (1999). Daytona and the fourth-generation language cymbal. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (pp. 525–526). New York, NY, USA: ACM. 45. Hao, F., Kodialam, M., & Lakshman, T.V. (2004). Accel-rate: a faster mechanism for memory efficient per-flow traffic estimation. In SIGMETRICS ’04/Performance ’04: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (pp. 155–166). New York, NY, USA: ACM. 46. Hedayat, K., Krzanowski, R., Morton, A., Yum, K., & Babiarz, J. (2008). A two-way active measurement protocol (twamp). RFC 5357, October 2008. 47. Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685. 48. Huang, Y., Feamster, N., & Teixeira, R. (2008). Practical issues with using network tomography for fault diagnosis. SIGCOMM Computer Communication Review, 38(5), 53–58. 49. IETF. IP Flow Information Export (ipfix) charter. http://www.ietf.org/html.charters/ipfixcharter.html. Version of 16 December 2008. 50. Internet Assigned Numbers Authority. Port numbers. http://www.iana.org/assignments/portnumbers. 51. ITU-T Recommendation Y.1540. Network performance objectives for IP-based services, February 2006. 52. Jacobson, V., Leres, C., & McCanne, S. tcpdump. 354 N. Duffield and Al Morton 53. Jacobson V. Traceroute. ftp://ftp.ee.lbl.gov/traceroute.tar.gz. 54. Jain, R., & Routhier, S. (1986). Packet trains – measurements and a new model for computer network traffic. IEEE Journal on Selected Areas in Communications, 4(6), 986–995, September 1986. 55. Juniper Networks. Junose 8.2.x ip services configuration guide: Configuring j-flow statistics. http://www.juniper.net/techpubs/software/erx/junose82/swconfig-ip-services/html/ ip-jflow-stats-config.html. 56. Kent, S., & Atkinson, R. (1998). Security architecture for the Internet Protocol. RFC 2401, November 1998. 57. Keynote Systems. http://www.keynote.com. 58. Mathis, M., & Mahdavi, J. (1996). Diagnosing internet congestion with a transport layer performance tool. In Proceedings of INET 96. Montreal, Quebec, 24–28 June 1996. 59. McCloghrie, K., & Kastenholz, F. The interfaces group mib. RFC 2863, June 2000. 60. McCloghrie, K., & Rose, M. (1991). Management Information Base for Network Management of TCP/IP-based internets: MIB-II. RFC 1213, available from http://www. ietf.org/rfc, March 1991. 61. MeasurementLab. http://www.measurementlab.net/. 62. Mills, C., Hirsh, D.,& Ruth, D. (1991). Internet accounting: background. RFC 1272, November 1991. 63. Greg Minshall. tcpdpriv. http://ita.ee.lbl.gov/html/contrib/tcpdpriv.html. 64. Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., & Weaver, N. (2003). Inside the slammer worm. IEEE Security and Privacy, 1(4), 33–39. 65. Morton, A. (2008). Framework for metric composition, June 2009. draft-ietf-ippm-frameworkcompagg-08 (work in progress). 66. Morton, A., & Claise, B. (2009). Packet delay variation applicability statement. RFC 5481, March 2009. 67. Morton, A., & Stephan, E. (2008). Spatial composition of metrics, October 2009. draft-ietfippm-spatial-composition-10 (work in progress). 68. Narus, Inc. Narusinsight secure suite. http://www.narus.com/products/security.html. 69. Packetdesign. Traffic explorer. http://www.packetdesign.com/products/tex.htm. 70. Paxson, V., Almes, G., Mahdavi, J., & Mathis, M. (1998). Framework for ip performance metrics. RFC 2330, May 1998. 71. Phaal, P., Panchen, S., & McKee, N. (2001). Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks. RFC 3176, September 2001. http://www.ietf. org/rfc/rfc3176.txt. 72. Raisanen, V., Grotefeld, G., & Morton, A. (2002). Network performance measurement with periodic streams. RFC 3432, November 2002. 73. RIPE. http://www.ripe.net. 74. Roesch, M. (1999). Snort – Lightweight Intrusion Detection for Networks. In Proceedings of USENIX Lisa ’99, Seattle, WA, November 1999. 75. Sandvine. http://www.sandvine.com/. 76. Srinivasan, C., Viswanathan, A., & Nadeau, T. (2004). Multiprotocol label switching (MPLS) label switching router (LSR) management information base (MIB). RFC 3813, June 2004. 77. Stallings, W. (1999). SNMP, SNMP v2, SNMP v3, and RMON 1 and 2 (Third Edition). Reading, MA: Addison-Wesley. 78. Stewart, R., Ramalho, M., Xie, Q., Tuexen, M., & Conrad, P. (2004). Stream control transmission protocol (sctp) partial reliability extension. RFC3758, May 2004. 79. Thorup, M. (2006). Confidence intervals for priority sampling. In Proceedings of ACM SIGMETRICS/Performance 2006 (pp. 252–263) Saint-Malo, France, 26–30 June 2006. 80. van der Merwe J., Cáceres, R., Chu, Y.-H., & Sreenan, C. (2000). mmdump: a tool for monitoring internet multimedia traffic. SIGCOMM Computer Commununication Review, 30(5), 48–59. 81. Waldbusser, S. (2000). Remote network monitoring management information base. RFC 2819, available from http://www.ietf.org/rfc, May 2000. 10 Measurements of Data Plane Reliability and Performance 355 82. Zseby, T., Molina, M., Duffield, N.G., Niccolini, S., & Raspall, F. (2009). Sampling and filtering techniques for ip packet selection. RFC 5475, March 2009. 83. Zseby, T., Zander, S., & Carle, G. (2001). Evaluation of building blocks for passive one-waydelay measurements. In Proceedings of Passive and Active Measurement Workshop (PAM 2001). Amsterdam, The Netherlands, 23–24 April 2001. Chapter 11 Measurements of Control Plane Reliability and Performance Lee Breslau and Aman Shaikh 11.1 Introduction The control plane determines how traffic flows through an IP network. It consists of routers interconnected by links and routing protocols implemented as software processes running on them. Routers (or more specifically routing protocols) communicate with one another to determine the path that packets take from a source to a destination. As a result, the reliability and performance of the control plane is critical to the overall performance of applications and services running on the network. This chapter focuses on how to measure and monitor the reliability and performance of the control plane of a network. The original Internet service model supported only unicast delivery. That is, a packet injected into the network by a source host was intended to be delivered to a single destination. Multicast, in which a packet is replicated inside the network and delivered to multiple hosts was subsequently introduced as a service. While certain multicast routing protocols leverage unicast routing information, unicast and multicast have very distinct control planes. They are each governed by a different set of routing protocols, and measurement and monitoring of these protocols consequently take different forms. Therefore, we cover unicast and multicast control plane monitoring separately in Sections 11.2 and 11.3, respectively. We start Section 11.2 with a brief overview of how unicast forwarding works, describing different routing protocols and how they work to determine paths between a source and a destination. We then look at two key components of performance monitoring: instrumentation of the network for data collection in Section 11.2.2, and strategies and tools for data analysis in Section 11.2.3. More specifically, the instrumentation section describes what data we need to collect for route monitoring along with mechanisms for collecting the data needed. The analysis section focuses on various techniques and tools that show how the data is used for monitoring the L. Breslau and A. Shaikh () AT&T Labs – Research, Florham Park, NJ, USA e-mail: breslau@research.att.com; ashaikh@research.att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 11, c Springer-Verlag London Limited 2010 357 358 L. Breslau and A. Shaikh performance of the control plane. While the focus of the section is on management and operational aspects, we also describe some of the research enabled by this data that has played a vital role in enhancing our understanding of the control plane behavior and performance in real life. We follow this up with a description of the AT&T OSPF Monitor [1] in Section 11.2.4 as a case study of a route monitor in real life. In Section 11.2.5, we describe control plane monitoring of MPLS, which has been deployed in service provider networks in the last few years and is a key enabler of Traffic Engineering (TE) and Fast Re-route (FRR) capabilities, as well as new services such as VPN and VPLS. Section 11.3 follows a similar approach in its treatment of multicast. We begin with a motivation for and historical perspective of the development and deployment of multicast. In Section 11.3.1, we provide a brief overview of the multicast routing protocols commonly in use today, PIM and MSDP. We then outline some of the challenges specific to monitoring the multicast control plane in Section 11.3.2. Section 11.3.3 provides detailed information about multicast monitoring. This includes an overview of early multicast monitoring efforts, a discussion of the information sources available for multicast monitoring, and a discussion of specific approaches and tools used in multicast monitoring. At the end of the chapter, in Section 11.4, we provide a brief summary and avenues for future work. 11.2 Unicast In this section, we focus on monitoring of unicast routing protocols. We begin by providing a brief overview of how routers forward unicast packets and the routing protocols used for determining the forwarding paths before delving into details of how to monitor these protocols. 11.2.1 Unicast Routing Overview Let us start with the description of how routing protocols enable the forwarding of unicast packets in IP networks. With unicast, each packet contains the address of the destination. When the packet arrives at a router, a table called the Forwarding Information Base (FIB), also known as the forwarding table, is consulted. This table allows the router to determine the next-hop router for the packet, based on its destination address. Packets are thus forwarded in a hop-by-hop fashion, requiring look-ups in the forwarding table of each router hop along its way to the destination. The forwarding table typically consists of a set of prefixes. Each prefix is represented by an IP address and a mask that specifies how many significant bits of a destination address need to match the address of the prefix. For example, a prefix represented as 10.0.0.0/16 would match a destination address whose first 16 bits 11 Measurements of Control Plane Reliability and Performance 359 are the same as the first 16 bits of 10.0.0.0 (i.e., 10.0). Thus, the address 10.0.0.1 matches this prefix, so do 10.0.0.2 and 10.0.1.1. It is possible, and is often the case that, more than one prefix in a FIB match a given (destination) address. In such a case, the prefix with the highest value of the mask length is used for determining the next-hop router. For example, if a FIB contains 10.0.0.0/16 and 10.0.0.0/24, and the destination is 10.0.0.1, prefix 10.0.0.0/24 is used for forwarding the packet even though both prefixes match the address. For this reason, IP forwarding is based on the longest prefix. Routers run one or more routing protocols to construct their FIBs. Every routing protocol allows a router to learn the network topology (or some part of it) by exchanging messages with other routers. The topology information is then used by a router to determine next hops for various prefixes, i.e., the FIB. Learning Topology Information Depending on how much topology information each router learns, the routing protocols can be divided into two main classes: distance-vector and link-state. In a distance-vector routing protocol at each step, every router learns the distance of each adjacent router to every prefix. Every prefix is connected to one or more routers in the network. The distance from a router to a prefix is the sum of weights of individual links on the path, where the weight of every link is assigned in the configuration file of the associated router. A router, upon learning distances from neighbors, chooses the one that is closest to a given prefix as its next-hop, and subsequently propagates its own distance (which is equal to the neighbor’s distance plus the weight of its link to the neighbor) to the prefix to all other neighbors. When a router comes up, it only knows about its directly connected prefixes (e.g., prefixes associated with point-to-point or broadcast links). The router propagates information about these prefixes to its neighbors, allowing them to determine their routes to them. The information then spreads further, and ultimately all routers in the network end up with next-hops for these prefixes. In a similar vein, the newly booted router also learns about other prefixes from its neighbors, and builds its entire FIB. The distance-vector protocols essentially implement a distributed version of the Bellman Ford shortest-path algorithm [2]. RIP [3] is an example of a distance-vector protocol. EIGRP, a Cisco-proprietary protocol, is another example. It contains mechanisms (an algorithm called DUAL [4]) to prevent forwarding loops that can be formed during network changes when routers can become inconsistent in their views of the topology. A subclass of distance-vector, called path-vector protocols include the actual path to the destination along with the distance in the updates sent to neighbors. The inclusion of the path helps in identifying and avoiding potential loops from forming during convergence. BGP [5] is an example of a path-vector protocol. With link-state routing protocols, each router learns the entire network topology. The topology is conceptually a directed graph – each router corresponds to 360 L. Breslau and A. Shaikh a node in this graph, and each link between neighboring routers corresponds to a unidirectional edge. Just like distance-vector protocols, each link also has an administratively assigned weight associated with it. Using the weighted topology graph, each router computes a shortest-path tree with itself as the root, and applies the results to compute next-hops for all possible destinations. Routing remains consistent as long as all the routers have the same view of the topology. The view of the topology is built in a distributed fashion, with each router describing its local connectivity (i.e., set of links incident on it along with their weights) in a message, and flooding this message to all routers in the network. OSPF [6] and IS-IS [7] are examples of link-state protocols. Autonomous Systems (ASes) and Hierarchical Routing The Internet is an inter-network of networks. By design, these networks are envisioned to be administered by independent entities. In other words, the Internet is a collection of independently administered networks. Roughly speaking, such networks are known as Autonomous Systems (ASes). Each autonomous system consists of a set of routers and links that are usually managed by a single administrative authority. Every autonomous system can run one or more routing protocols of its choice to route packets within the system. RIP, EIGRP, OSPF and IS-IS are typically used for routing packets within an AS and are, therefore, known as intradomain or Interior Gateway Protocols (IGPs). In addition, a routing protocol is needed to forward packets between ASes. BGP is used for this purpose and is known as an interdomain or an Exterior Gateway Protocol (EGP). Next, we present an overview of BGP and OSPF as they come up a lot in the subsequent discussions. For details on other routing protocols, please refer to [8]. 11.2.1.1 BGP Overview As mentioned in Section 11.2.1.1, BGP is the de facto routing protocol used to exchange routing information between ASes. BGP is a path-vector protocol (a subset of distance-vector protocols). In path-vector protocols, a router receives routes from its neighbors that describe their distance to prefixes, as well as the path used to reach the prefix in question. Since BGP is used to route packets between ASes, the path is described as a sequence of ASes traversed along the way to the prefix, the sequence being known as an ASPath. Thus, every route update received at a router contains the prefix and the ASPath indicating the path used by the neighbor to reach the prefix. The distance is not explicitly included; rather it implicitly equals the number of ASes in the ASPath. Apart from ASPath, BGP routes also contain other attributes. These attributes are used by a router to determine the most preferred route from all received routes to a destination prefix. Figure 11.1 shows the steps of a decision process that a 11 Measurements of Control Plane Reliability and Performance Fig. 11.1 The decision process used by BGP to select the best route to every prefix. Vendor-dependent steps are not included 1. 2. 3. 4. 5. 6. 361 Highest Local Preference Shortest ASPath Length Lowest Origin Type Lowest MED Prefer Closest Egress (based on IGP distance) Arbitrary Tie Breaking BGP-speaking router follows to select its most preferred route. The process is run independently for each prefix, and starts with all the available routes for the prefix in question. At every step, relevant attributes of the routes are compared. Routes with the most preferred values pass onto the next step while other routes are dropped from further consideration. At the end of the decision process, a router ends up with a single route for every prefix, and uses it to forward data traffic. Note that the second step of the decision process compares the length of ASPath of the routes that survived the first step, keeps the ones with the shortest ASPaths, while discarding the rest. We will not go into details of other steps except to point out that if faced with more than one route in step 5, the router selects route(s) which minimize the IGP distance a packet will have to travel to exit its AS. This process of preferring the closest egress is known as hot-potato or closest-egress routing. A router forms BGP sessions with other routers to exchange route updates. The two ends of a session can either belong to the same AS or a different AS. When the session is formed between routers in the same AS, it is known as an internal BGP (IBGP) session. In contrast, when the routers are in different ASes, the session is known as an external BGP (EBGP) session. For example, in Fig. 11.2, which shows multiple interconnected ASes and routers in them, solid lines depict IBGP sessions, whereas dashed lines represent EBGP sessions. The EBGP sessions setup between routers in neighbor ASes allow them to exchange routes to various prefixes. The routes learned over EBGP sessions are then distributed using IBGP sessions within an AS. For example, AS 2 in Fig. 11.2 learns routes from ASes 1, 3, and 4 over EBGP sessions, which are then distributed among its routers over IBGP sessions. In order to disseminate all routes learned via EBGP to every router, routers inside an AS like AS 1 need to form a full-mesh of IBGP sessions. A router receiving a route update over an EBGP session propagates it to all other routers in the mesh, however, route updates received over IBGP sessions are not forwarded back to the routers in the mesh (see [9] for full details). An IBGP full-mesh does not scale for ASes with a large number of routers. To improve scalability, large ASes use an IBGP hierarchy such as route reflection [10]. Route reflection allows the re-announcement of some routes learned over IBGP sessions. However, it sacrifices the number of candidate routes learned at each router for improved scalability. For example, AS 2 in Fig. 11.2 employs a route reflector hierarchy. 362 L. Breslau and A. Shaikh AS 2 AS 3 AS 1 AS 4 IBGP Session EBGP Session BGP Router BGP Route Reflector Fig. 11.2 Example topology with multiple ASes and BGP sessions 11.2.1.2 OSPF Overview As noted in Section 11.2.1.1, OSPF is a link-state protocol, which is widely used to control routing within an Autonomous System (AS).1 With link-state routing protocols, each router learns the entire view of the network topology represented as a weighted graph, uses it to compute a shortest-path tree with itself as the root, and applies the results to construct its forwarding table. This assures that packets are forwarded along the shortest paths in terms of link weights to their destinations [11]. We will refer to the computation of the shortest-path tree as an SPF computation, and the resultant tree as an SPF tree. For scalability, an OSPF network may be divided into areas determining a twolevel hierarchy as shown in Fig. 11.3. Area 0, known as the backbone area, resides at the top level of the hierarchy and provides connectivity to the non-backbone areas (numbered 1, 2, etc.). OSPF assigns each link to one or more areas.2 The routers that have links to multiple areas are called border routers. For example, routers C , D, and G are border routers in Fig. 11.3. Every router maintains a separate copy of the topology graph for each area to which it is connected. The router performs the SPF computation on each such topology graph and thereby learns how to reach nodes in all adjacent areas. A router does not learn the entire topology of remote areas. Instead, it learns the total weight of the shortest paths from one or more border routers to each prefix in 1 Even though an IGP like OSPF is used for routing within an AS, the boundary of an IGP domain and an AS do not have to coincide. An AS may consist of multiple IGP domains; conversely, a single IGP domain may span multiple ASes. 2 The original OSPF specification [6] required each link to be assigned to exactly one area, but a recent extension [12] allows a single link to be assigned to multiple areas. 11 Measurements of Control Plane Reliability and Performance 363 x Area 0 G 2 1 1 31 E F I H 1 12 1 1 J D C 5 y A 1 4 C OSPF domain Area 1 B 1 1 D 2 1 F Area 0 E 1 1 1 G 1 2 3 H 1 I 1 1 5 J Area 2 1 B 1 A x 4 y 1 E 1 D 1 B 1 A 5 x Area 1 G 1 F 1 C 2 I 1 H J y Border router AS border router OSPF Network Topology Topology View of Router G Shortest Path Tree at G Fig. 11.3 An example OSPF topology, the view of the topology from router G, and the shortestpath tree calculated at G. Although we show the OSPF topology as an undirected graph here for simplicity, the graph is directed in reality remote areas. Thus, after computing the SPF tree for each area, the router learns which border router to use as an intermediate node for reaching each remote node. In addition, the reachability of external IP prefixes (associated with nodes outside the OSPF domain) can be injected into OSPF (e.g., X and Y in Fig. 11.3). Roughly, reachability to an external prefix is determined as if the prefix was a node linked to the router that injects the prefix into OSPF. The router that injects the prefix into OSPF is called an AS Border Router (ASBR). For example, router A is an ASBR in Fig. 11.3. Routers running OSPF describe their local connectivity in Link State Advertisements (LSAs). These LSAs are flooded reliably to other routers in the network. The routers use LSAs to build a consistent view of the topology as described earlier. Flooding is made reliable by mandating that a router acknowledge the receipt of every LSA it receives from every neighbor. The flooding is hop-by-hop and hence does not itself depend on routing. The set of LSAs in a router’s memory is called the link-state database and conceptually forms the topology graph for the router. Two routers are neighbor routers if they have interfaces to a common network (i.e., they have a direct path between them that does not go through any other router). Neighbor routers form an adjacency so that they can exchange LSAs with each other. OSPF allows a link between the neighbor routers to be used for forwarding only if these routers have the same view of the topology, i.e., the same link-state database for the area the link belongs to. This ensures that forwarding data packets over the link does not create loops. Thus, two neighbor routers make sure that their link-state databases are in sync by exchanging out-of-sync parts of their link-state databases when they establish an adjacency. 364 L. Breslau and A. Shaikh 11.2.2 Instrumentation for Route Monitoring As mentioned, routers exchange information about the topology with other routers in the network to build their forwarding tables. As a result, understanding control plane dynamics requires collecting these messages and analyzing them. In this section, we focus on the collection aspect, leaving analysis for the next section. We first focus on how to instrument a single router, before turning our attention to the network-wide collection of messages. 11.2.2.1 Collecting Data from a Single Router Even though the kind of information exchanged in routing messages varies from protocol to protocol, the flow of messages through individual routers can be modeled in the same manner, as depicted in Fig. 11.4. Every router basically receives messages from its neighbors from time to time. These messages are sent by neighbors in response to events occurring in the network or expiration of timers; again, the exact reasons are protocol specific. As described in Section 11.2.1, the message describes some aspect of the network topology or reachability to a prefix along with a set of attributes. Upon receiving the message, the router runs its route selection procedure taking the newly received message into account. The procedure can change the best route to one or more prefixes in the FIB. A router also sends messages to neighbors as network topology and/or reachability to prefixes change – the trigger and contents of the messages depend on the protocol. Given this, to understand routing dynamics of a router would require instrumenting the router to collect (i) incoming messages into a router over all its links, (ii) the changes induced to the FIB, and (iii) outgoing messages to all the neighbors. Some protocols such as BGP allow routers to apply import policies to incoming messages; applying these policies results in either dropping of messages or modifications to the attributes. In such a scenario, it might be beneficial to collect incoming messages before and after application of import policies. In a similar vein, BGP Incoming Routing Message Outgoing Routing Message Router Incoming Routing Message Route Selection Process Incoming Routing Message Best Route Outgoing Routing Message Fig. 11.4 Message flow through a router FIB 11 Measurements of Control Plane Reliability and Performance 365 applies export policies to outgoing messages before they are sent to neighbors in which case messages can be collected before and after the application of export policies. Ideally, one would like the router to “copy” every incoming and outgoing message, as well as changes to the FIB to a management station. In reality, no standardized way for achieving this exists, and as a result no current router implementations support it. Despite this, one could get an approximate version of the required information in several different ways. One such way is to use splitters to read messages directly off a link. Unfortunately, this option is often impractical, expensive, and does not scale beyond a few routers and links. For this reason, this option is rarely used in practice. Another option is to log into the router through its CLI (Command Line Interface) or query SNMP MIBs [13] to extract the required information. Routers and (routing protocols running on them) often store a copy of the most recently received and transmitted messages in memory and allow them to be queried via CLI or SNMP MIBs. Thus, a network management station can periodically pull the information out of a router. Unfortunately, it is almost impossible to capture every incoming/outgoing message this way since even the most frequent polling supportable by routers fall far short of the highest frequency at which routing messages are exchanged. Even so, this option is used in practice at times since it provides a fairly inexpensive and practical way of getting some information about the routing state of a router. For example, the Peer Dragnet [14] tool uses information captured via the CLI to analyze inconsistent routes sent by EBGP peers of an AS. A third option to collect routing messages is to establish a routing session with a router just like any other router. This forces the router to send messages as it would to any other router.3 Obviously, this approach does not give information about incoming messages and changes to the FIB. Even for an outgoing message, the management station does not receive the message at the time a router sends it to other neighbors. Despite this, the approach provides valuable information about route dynamics. For distance-vector protocols, the outgoing message is usually the route selected by the router and for link-state protocols, these messages describe updates to the topology view of the router. As a result, this approach is used quite extensively in practice. For example, RouteViews [15] and RIPE [16] collect BGP updates from several ASes and their routers, as does the OSPF Monitor described in [1], and later in Section 11.2.4. One serious practical issue with this approach is the potential injection of routing messages from the management station, which could disrupt the functioning of the control plane. For protocols that allow import policies (e.g., BGP) one could apply a policy to drop any incoming messages from the management station, but for other protocols (e.g., OSPF, IS-IS) the only way to protect against injection of messages is to rely on the correctness of the software running on the management station. 3 A router running a distance-vector protocol sends its selected route for a given prefix to all its neighbors, except the next-hop of the route when split horizon [8] is implemented. It is this selected route that we are interested in, and will receive, at the management station. 366 L. Breslau and A. Shaikh 11.2.2.2 Collecting Network-Wide Data In Section 11.2.2.1, we discussed ways in which routing messages can be collected from a single router. In this section, we expand our focus to the entire network. The key question we focus on is: how many routers does one need to collect routing messages from? The naive answer is: from all routers of the network. Indeed, if the aim is to learn about each and every message flowing between routers and the exact state of routers at every instance of time, then there is no choice but to collect messages from all routers. In reality, collecting messages from all routers is extremely challenging due to scale issues. Thus, in practice the answer depends on the kind of routing protocol and the analysis requirement. Let’s go into some details. The kind of routing protocol – whether link-state or distance-vector – plays a major role in deciding how many routers one needs to collect data from. In a link-state protocol, every router learns the entire view of the network topology, and so collecting messages from even a single router is enough to determine the overall state of the network topology. As we will see later in Sections 11.2.3 and 11.2.4, even this seemingly “limited” data enables a rich set of management applications. Some examples are (and we will talk about these in more detail in subsequent sections): (i) ability to track network topology and its integrity (against design rules) in realtime, (ii) ability to determine events such as router/link up/downs and link weight changes as they unfold, (iii) ability to determine how forwarding paths evolve in response to network events, and (iv) ability to determine workload imposed by the routing messages. We should emphasize here that for all the applications, the data is providing the “view” from the router from which the data is being collected at that point of time. Other routers’ views can be somewhat different due to message propagation and processing delays. The exact nature of these delays, how they are affected by other events in the network, and their implications for the analysis/application at hand are poorly understood. Our belief is that these delays are small (on the order of milliseconds) in most cases, and thus can be safely ignored for all practical purposes. The story is different for distance-vector protocols since every router gets a partial view of the topology: only the distance of prefixes from neighbors. As a result, one often needs views from multiple, if not all, routers. The exact set depends on the network configuration and on the kind of analysis being performed. For example, if one wants to learn external routes coming into an AS, it suffices to monitor BGP routes from the routers at the edge of the network. In fact, numerous studies on BGP dynamics, inter-AS topology and relationships between ASes have been carried out based on BGP data collected from a fairly small set of ASes at RouteViews and RIPE. Although the completeness and representativeness of these studies is debatable, there is no doubt that such studies have tremendously increased awareness about BGP and its workings in the Internet. Furthermore, by combining routing data collected from a subset of routers with other network data, one can often determine routing state of other routers – at least in steady state once routing has converged after a change. For example, a paper by Feamster and Rexford [17] describes a 11 Measurements of Control Plane Reliability and Performance 367 methodology to determine BGP routes at every router inside an AS based on routes learned at the edge of the network, and configuration of IBGP sessions. 11.2.3 Applications of Route Monitoring In this section, we demonstrate the utility of the data collected by route monitors. We first describe the basic functionality enabled by the data. We then describe how this basic functionality can be used in various network management tasks. Finally, we describe how the data has been used in advancing the understanding of the behavior of routing protocols in real life. 11.2.3.1 Information Provided by Route Monitors Routing State and Dynamics Route monitors capture routing messages, and so they naturally provide information about the current state of routing and how it evolves over time. This information is useful for a variety of network management tasks such as troubleshooting and forensics, capacity planning, trending, and traffic engineering to name a few. For link-state protocols, the routing messages provide information about the topology (i.e., set of routers, links and link weights), whereas for distance-vector protocols, the information consists of route tables (i.e., set of destinations and the next-hop and distance from the router in question). Both pieces of information are useful. Furthermore, calculating routing tables from topology is straightforward: one just needs to emulate route calculation for every router in the topology. Going in the other direction from routing tables to topology is easy if information from all routers (running the distance-vector protocol) is available. In practice though, information is often collected from a subset of routers, in which case, deriving a complete topology view may not be possible. End-to-End Paths Knowing what path traffic takes in the network (from one router to another) is crucial for network management tasks such as fault localization and troubleshooting. For example, a link failure can affect performance of all paths traversing the link. If the only way of detecting such failures is through end-toend active probing, then knowing paths would allow operators to quickly localize the problem to the common link. Routing messages collected by route monitors allow one to determine these paths and how they evolve in response to routing events. Note that active probes (e.g., traceroute) also allow one to determine end-to-end paths in the network. However, tracking path changes in response to network events using active probing suffers from major scalability problems. First of all, the number of router pairs in a large network can be in the range of hundreds of thousands to millions. This makes probing every path at a fine time scale prohibitively expensive. A second problem arises due to the use of multiple equal cost paths (known as ECMP) between router pairs. ECMP arises when more than one path with smallest 368 L. Breslau and A. Shaikh weight exist between router pairs. Most intradomain protocols such as OSPF use all the paths by spreading data traffic across them.4 Since service providers often have redundant links in their networks, router pairs are more likely to have multiple paths than not. ECMP unfortunately exacerbates the scalability problem for active probing. Furthermore, engineering probes so that all ECMPs are covered is next to impossible since how routers would spread traffic across multiple paths is almost impossible to determine a priori. 11.2.3.2 Utility of Route Monitors in Network Management The data provided by route monitors and the basic information gleaned from them aid several network management tasks such as troubleshooting and forensics, network auditing, and capacity planning. Below we provide a detailed account of how this is done for each of these three tasks. Network Troubleshooting and Forensics Route monitors provide a view into routing events as they unfold. This view can be in the form of topology, routing tables, or end-to-end paths as mentioned in the previous sections; which form proves useful often depends on the specific troubleshooting task at hand. For example, if a customer complains about loss of reachability to certain parts of the Internet, looking at BGP routes and their history can provide clues about causes of the problems. Similarly, if performance issues are seen in some parts of the network, knowing what routing events are happening and how they are affecting paths can provide an explanation for the issues. Note that the route monitors’ utility not only stems from the current view of routing they provide (after all operators can always determine the current view by logging into routers), but from the historical data they provide which allows operators to piece together sequence of events leading to the problems. Routers do not store historical state, and so cannot provide such information. Going back to the debugging of customer complaining about lost reachability, it is rarely enough to determine the current state of the route, especially if no route exists to the prefix. To effectively pinpoint the problem, the operator might also need to know the history of route announcements and withdrawals for the prefix, and that data can only be provided by route monitors. Figure 11.5 shows snapshot of a tool that allows operators to view sequence of BGP route updates captured by a monitor deployed in a tier-1 ISP. Network Auditing and Protocol Conformance Another use of route monitors is for auditing the integrity of the networks and conformance of routing protocols to their specifications. To audit the integrity of the network, one needs to devise rules against which the actual routing behavior can be checked. For example, network administrators often have conventions and rules about weights assigned to links. 4 The exact algorithm for spreading traffic across ECMPs is implemented in the forwarding engine of routers. 11 Measurements of Control Plane Reliability and Performance 369 BGP Route History for 0.0.0.0/0 and its Subnets Prefix Time (GMT) Router Event ASPath Local Pref Origin MED Next-hop 1 Wed Apr 1 18:32:50 2009 10.0.0.1 WITHDRAW 192.168.0.0/24 ---- --2 Wed Apr 1 18:32:50 2009 10.0.0.1 ANNOUNCE 172.16.3.0/23 65001 65010 65145 90 IGP 0 10.0.1.3 3 Wed Apr 1 18:32:52 2009 10.0.0.1 ANNOUNCE 10.1.123.0/12 65001 65126 80 IGP 25 10.0.1.8 4 Wed Apr 1 18:32:55 2009 10.0.0.1 ANNOUNCE 192.168.3.0/18 65001 65324 65002 65121 65084 80 IGP 0 10.0.2.1 5 Wed Apr 1 18:32:58 2009 10.0.0.1 ANNOUNCE 192.168.0.0/24 65001 65223 65145 65 IGP 100 10.0.1.1 6 Wed Apr 1 18:33:31 2009 10.0.0.1 ANNOUNCE 172.23.4.0/21 65001 65132 90 IGP 10 10.0.2.1 7 Wed Apr 1 18:33:44 2009 10.0.0.1 ANNOUNCE 10.231.34.64/20 65001 65010 65192 65034 65 IGP 12 10.0.1.45 8 Wed Apr 1 18:33:47 2009 10.0.0.1 ANNOUNCE 192.168.0.0/24 65001 65023 65145 90 IGP 0 10.0.1.1 Count 9 Wed Apr 1 18:34:08 2009 10.0.0.1 ANNOUNCE 172.22.73.0/25 65001 65420 65321 65005 10 Wed Apr 1 18:34:21 2009 10.0.0.1 ANNOUNCE 172.172.72.0/21 65001 65014 65105 70 IGP 0 10.0.2.12 110 IGP 10 10.0.1.109 Fig. 11.5 Screen-shot of a tool to view BGP route announcement/withdrawals It then becomes necessary to monitor the network for potential deviations (that happen intentionally or due to mistakes) from these rules. Since (intradomain) routing messages provide current information about link weights, they provide a perfect source for checking whether network’s actual state conforms to the design rules or not. Checking that the network state matches the design rules is especially crucial during maintenance windows when a network undergoes significant change. Similar to network auditing, routing messages can also be used to verify that protocol implementations conform to the specifications. At the very least, one could check whether message format is correct as per the specifications or not. Another check is to compare the rate and sequence of messages against the expected behavior. The “Refresh LSA bug” caught by the OSPF Monitor [1] where OSPF LSAs were being refreshed much faster than the recommended value [6] is an instance of this. Capacity Planning Capacity planning, where network administrators determine how to grow their network to accommodate growth, is another task where routing data is extremely useful. In particular, the data allows planners to see how routing traffic is growing over time, which can then be used to predict resources required in the future. As such, the growth of two parameters is very important: the number of routes in the routing table, and the rate at which routing messages are disseminated. The former has significant bearing on the memory required on the routers, whereas the latter affects the CPU (and sometimes bandwidth) requirements for routers. For service providers, accurately knowing how long current CPU/memory configuration on routers can last, and when upgrades will be needed is extremely important for operational and financial planning. The growth patterns revealed by routing data play a key role in forming these estimates. These estimates also allow service providers to devise optimization techniques to reduce resource consumption. For example, consider layer-3 MPLS VPN [18] service, which allows enterprise customers to interconnect their (geographically distributed) sites via secure, dedicated tunnels over a provider network. Over the last few years, this service has witnessed a widespread deployment. This has led to tremendous growth in the number of BGP routes a VPN service provider has to keep track of, resulting in heavy memory usage on its 370 L. Breslau and A. Shaikh routers. Realizing this scalability problem, Kim et al. [19] have proposed a solution that allows a service provider to tradeoff direct connectivity between sites (e.g., from any-to-any to a more restricted hub-and-spoke where traffic between two sites now has to go through one or more hub sites) with number of routes that need to be stored. The data collected by the route monitors was crucial in this work: first, to realize that there is a problem, and next, to evaluate the efficacy of the scheme in realistic settings. In particular, Kim et al. show 90% reduction in the memory usage while limiting path stretch between sites to only a few hundred miles, and extra bandwidth usage by less than 10%. 11.2.3.3 Performance Assessment of Routing Protocols Routing data is key to understanding how routing protocols behave and perform in real life. We have already talked about one aspect of this behavior above, namely conformance to the specifications. Here we would like to talk about other aspects of the performance such as stability and convergence, which are key to quantifying the overall performance of the routing infrastructure. For example, numerous BGP studies detailing its behavior in the Internet have been enabled thanks to the data collected by RouteViews [15], RIPE [16], and other BGP monitors. We briefly describe some studies to illustrate the point. Route updates collected by BGP monitors have led to several studies analyzing the stability (or lack thereof) of BGP routing in the Internet.5 Govindan and Reddy [20] were the first to study the stability of BGP routes back in 1997 – a couple of years after commercialization of the Internet started. Their study analyzed BGP route updates collected from a large ISP and a popular Internet exchange point (where several service providers are interconnected to exchange routes and traffic). The study found a clear evidence of deteriorating stability of BGP routes which it attributed to the rapid growth – doubling of the number of ASes and prefixes in about 2 years – of the Internet. Subsequently, Labovitz et al. [21] observed a higher than expected number of BGP updates in the data collected at five US public Internet exchange points. The real surprising aspect of their study was the finding that about 99% of these updates did not indicate real topological changes, and had no reason to be there. The authors found that some of these updates were due to bugs in the BGP software of a router vendor at that time. Fixing of these bugs by the vendor led to an order of magnitude reduction in the volume of BGP route updates [22]. Convergence, the time taken by a routing protocol to recalculate new paths after a network change, is another critical performance metric. Labovitz et al. [23] were the first to systematically study this metric for BGP in the Internet. They found that BGP often took tens of seconds to converge – an order of magnitude more than what was thought at that time. The problem as they showed stems from the 5 The term stability refers to the stability of BGP routes, which roughly corresponds to how frequently they undergo changes. 11 Measurements of Control Plane Reliability and Performance 371 inclusion of ASPath in BGP route announcements (i.e., the very thing that makes BGP a path-vector protocol). The purpose of including the ASPath is to prevent loops and “count-to-infinity” problem6 that BGP’s distance-vector brethren (e.g., RIP) suffer from. However, this leads to “path exploration” as shown by Labovitz et al., where routers might cycle through multiple (often transient) routes with different ASPaths before settling on the final (stable) routes, thereby exacerbating the convergence times. Several ways of mitigating this problem have been proposed since then, essentially by including more information in BGP routes [24–28], but none of them have seen deployment to date. Mao et al. [29] tied hitherto independently explored stability and convergence aspects of BGP together by showing how route flap damping (RFD) [30] used for improving stability of BGP could interact with path exploration to adversely impact convergence of BGP. RFD is a mechanism that limits propagation of unstable routes, thereby mitigating adverse impact of persistent flapping of network elements and mis-configurations, which improves overall stability of BGP, and was a recommended practice [31] in early 2000. Unfortunately, as Mao et al. showed, RFD can also suppress relatively stable routes by treating route announcements received during path exploration as evidence of instability of a route. Specifically, the study showed that a route needs to be withdrawn only once and then re-announced for RFD to suppress it for up to an hour in certain circumstances. This work coupled with manifold increase in router CPU processing capability resulted in a recommendation by RIPE [32] to disable RFD. Routing data is not only valuable in analyzing performance of protocol separately, but also useful for understanding how they interact with one another as Teixeira et al. [33] did by focusing on how OSPF distance changes in a tier-1 ISP affected BGP routing. Their study showed that despite the apparent separation between intra and interdomain routing protocols, OSPF distance changes do affect BGP routes due to what is known as the “hot-potato routing”. 7 The extent of the impact depended on several factors including location and timing of a distance change. Even more surprisingly, BGP route updates resulting from such changes could lag by as much as a minute in some cases, resulting in large delays in convergence. In closing, these and numerous other studies have not only enhanced our knowledge of how routing protocols behave in the Internet, but have also led to improvements in their performance (such as reduction in unwarranted BGP updates or disabling of RFD as mentioned earlier). 6 With distance-vector protocols, two or more routers can get locked into a cyclical dependency where each router in the cycle uses the previous router as a next-hop for reaching a destination. The routers then increment their distance to the destination in a step-wise fashion until all of them reach infinity, which is termed as “counting to infinity”. For more details, refer to [8]. 7 As explained in Section 11.2.1.1, hot-potato routing refers to BGP’s propensity to select the shortest way out of its local AS to a prefix when presented with multiple equally good routes (i.e., ways out of the AS). This allows an AS to hand off data packets as quickly as possible to its neighboring AS much like a hot potato. 372 L. Breslau and A. Shaikh 11.2.4 Case Study of a Route Monitor: The AT&T OSPF Monitor Several route monitoring systems are available both as academic/research endeavors as well as commercial products. RouteViews [15] and RIPE [16] collect BGP route updates from several ISPs and backbones around the world. The data is used extensively for both troubleshooting and academic studies of the interdomain routing system. The corresponding web sites also list several tools for analysis of the data. On the intradomain side, a paper by Shaikh and Greenberg [1] describes an OSPF monitor. The paper provides detailed description of the architecture and design of the system and follows it up with a performance evaluation and deployment experience. On the commercial side, Packet Design’s Route Explorer [34] and Packet Storm’s Route Analyzer [35] are route monitoring products. The Route Explorer provides monitoring capability for several routing protocols including OSPF, IS-IS, EIGRP and BGP, whereas Route Analyzer provides similar functionality for OSPF. Out of various route monitoring systems mentioned above, we focus on the OSPF Monitor described by Shaikh and Greenberg [1] as a case study in this section since the paper provides extensive details about system architecture, design, functionality, and deployment. This is something not readily available for other route monitoring systems, especially the architecture and design aspects, which are key to understanding how control plane monitoring is realized in practice. From here on, we will refer to the OSPF Monitor described in [1] as the AT&T OSPF Monitor, and go into details of the system in terms of data collection and analysis aspects next. The AT&T OSPF Monitor separates data (specifically, LSAs) collection from data analysis. The main reasoning behind this is to keep data collection as passive and simple as possible due to the collector’s proximity to the network. The component used for LSA collection is called an LSA Reflector (LSAR). The data analysis on the other hand is divided into two components: LSA aGgregator (LSAG) and OSPFScan. The LSAG deals with LSA streams in real time, whereas OSPFScan provides capabilities for off-line analysis of the LSA archives. This three component architecture is illustrated in Fig. 11.6. We briefly describe these three components now. The LSAR supports three modes for capturing LSAs: the host mode, the full adjacency mode, and the partial adjacency mode. With the host mode, which only works on a broadcast media such as Ethernet LAN, the LSAR subscribes to a multicast group to receive LSAs being disseminated. This is a completely passive way of capturing LSAs, but suffers from reliability issues, slow initialization of link-state database and only works on broadcast media. With the full adjacency mode, the LSAR establishes an OSPF adjacency with a router to receive LSAs. This allows LSAR to leverage OSPF’s reliable flooding mechanism, thereby overcoming both the disadvantages of the host mode. However, the main drawback of this approach is that instability of LSAR or its link to the router can trigger SPF calculations in the entire network, potentially destabilizing the network. The reason for SPF calculation stems from the fact that with a full adjacency, the router includes a link to the LSAR in its LSA sent to the network. The partial adjacency mode of collecting LSAs provides a way to circumvent this problem while retaining all the benefits of having an adjacency. In this mode, the LSAR establishes adjacency with a router, 11 Measurements of Control Plane Reliability and Performance LSAG Real−time Monitoring LSAs OSPFScan Off−line Analysis LSAs TCP connection LSAR 1 ‘‘Reflect’’ LSAs LSAR 2 ‘‘Reflect’’ LSAs LSA Cache LSA Cache Area 1 373 LSA Archive Area 0 Area 2 OSPF Domain Fig. 11.6 The architecture of the AT&T OSPF monitor described in [1] but only allows it to proceed to a stage where LSAs can be received over it from the router, but it cannot be included in the LSA sent by the router to the network. To keep the LSAR-router adjacency in the intermediate state, the LSAR describes its own Router-LSA8 to the router during the link-state database synchronization process but never actually sends it out to the router. As a result, the database is never synchronized, the adjacency stays in OSPF’s loading state [6], and is never fully established. Keeping the adjacency in the loading state protects the network from the instability of the LSAR or its link to the router. Having described data collection by the LSAR, let us now turn our attention to the LSAG, which processes LSAs in real time. The LSAG populates a model of the OSPF network topology as it processes the LSAs. The model captures elements such as OSPF areas, routers, subnets, interfaces, links, and relationship between them (e.g., an area object consists of a set of routers that belong to the area, a router object in turn consists of a set of interfaces belonging to the router, etc.). Using the model as a base, the LSAG identifies changes (such as router up/down, link up/down, link cost changes, etc.) to the network topology and generates messages about them. Even though there are only about five basic network events, about 30 different types of messages are generated by the LSAG because of how broadcast media (such as Ethernet) are supported in OSPF, how a change in one area propagates to other areas, and how external information is redistributed into OSPF. In addition to identifying changes to the network topology, the LSAG also identifies elements that are unstable, and generates messages about such flapping elements. The LSAG also generates messages for non-conforming behavior, such as when 8 A Router-LSA in OSPF is originated by every router to describe its outgoing links to adjacent routers along with their associated weights. 374 L. Breslau and A. Shaikh refresh LSAs are observed too often. Apart from using the topology model to identify changes, the LSAG also uses it to produce snapshots of the topology periodically and when network changes occur. One use of these snapshots is for performing an audit of link weights as described in Section 11.2.3.2. Finally, we turn our attention to OSPFScan, which supports off-line analysis of LSA archives. One thing worth mentioning about the AT&T OSPF Monitor is that the capabilities supported by OSPFScan for off-line analysis are mostly a superset of the ones supported in real time by the LSAG with the underlying idea being anything that can be done in real time can be performed off-line as a playback. In terms of processing of LSAs, OSPFScan follows a three-step process: parse the LSA, test the LSA against a user-specified query expression, and analyze the LSA according to user interest if it satisfies the query. The parsing step converts each LSA record into what is termed a canonical form to which the query expression and subsequent analysis is applied. The use of a canonical form makes it easy to adapt OSPFScan to support LSA archive formats other than the native one used by the LSAR. The query language resembles C-style syntax; an example query expression is “areaid == ‘0.0.0.0”’. When a query is specified, OSPFScan matches every LSA record against the query, carrying out subsequent analysis for the matching records, while filtering out the non-matching ones. For example, the expression above would result in the analysis of only those LSAs that were collected from area 0.0.0.0. In terms of analysis, OSPFScan provides the following capabilities: 1. Modeling Topology Changes Recall that OSPF represents the network topology as a graph. Therefore, OSPFScan allows modeling of OSPF dynamics as a sequence of changes to the underlying graph where a change represents addition/deletion of vertices/edges to this graph. Furthermore, OSPFScan allows a user to analyze these changes by saving each change as a single topology change record. Each such record contains information about the topological element (vertex/edge) that changed along with the nature of the change. For example, a router is treated as a vertex, and the record contains the OSPF router-id to identify it. We should point out that the topology change records and LSAG message logs essentially describe the same thing, but the former is geared more for computer processing, whereas the latter is aimed at humans. 2. Emulation of OSPF Routing OSPFScan allows a user to reconstruct a routing table of a given set of routers at any point of time based on the LSA archives. For a sequence of topology changes, OSPFScan also allows the user to determine changes to these routing tables. Together, these allow calculation of end-to-end paths through the OSPF domain at a given time, and see how this path changed in response to network events over a period of time. The routing tables also facilitate analysis of OSPF’s impact on BGP through hot-potato routing [33]. 3. Classification of LSA Traffic OSPFScan allows various ways of “slicingand-dicing” of LSA archives. For example, it allows isolating LSAs indicating changes from the background refresh traffic. As another example, it also allows classification of LSAs (both change and refresh) into new and duplicate instances. This capability was used in a case study that analyzed one month LSA traffic for an enterprise network [36]. 11 Measurements of Control Plane Reliability and Performance 375 11.2.5 MPLS Recall that MPLS has been deployed widely in service provider networks over the last few years. It has played a key role in evolving best-effort service model of IP networks by enabling traffic engineering (TE), fast reroute (FRR), and class of service (CoS) differentiation. In addition, MPLS has also allowed providers to offer value-added services such as VPN and VPLS. Unlike traditional unicast forwarding in IP networks where routers match destination IP address to the longest matching prefix, MPLS uses a label switching paradigm. Each (IP) packet is encapsulated in an MPLS header, which contains among other things the label which is used by a router to determine the outgoing interface. The value of the label changes along every hop. Thus, while determining the outgoing interface, the router also determines the label with which it replaces the incoming label of the packet. This means that a router running MPLS has to maintain an LFIB (Label Forwarding Information Base), which contains mapping between incoming label and (outgoing interface, outgoing label) pairs. The sequence of routers an MPLS packet follows is known as an LSP (Label Switched Path). The first router along the LSP encapsulates a packet into an MPLS header, while the last router removes the MPLS header and forwards the resulting packet based on the underlying header. The LFIB used for MPLS switching is populated by its control plane. This is done by creating and distributing mapping between a label and an FEC or a Forwarding Equivalence Class. An FEC is defined as a set of packets that need to receive the same forwarding treatment inside an MPLS network. A router running MPLS first generates a unique label for each FEC it supports, and uses one of the control plane protocols to distribute the label-FEC mappings to other routers. The dissemination of this information allows each router to determine incoming and outgoing labels and outgoing interface for each FEC, and thereby populate its LFIB. MPLS currently uses three routing protocols for distributing label-FEC mappings: LDP (Label Distribution Protocol) [37], RSVP-TE (Resource reSerVation Protocol) [38], and BGP [39, 40]. With LDP, a router exchanges label-FEC mappings with each of its neighbors using a persistent session. FECs, in case of LDP, are generally IP prefixes. The labels learned from the neighbors allow the router to determine mapping between incoming and outgoing labels. To determine the outgoing interface, LDP relies on the IGP (such as OSPF, IS-IS etc.) running in the underlying IP network. Thus, LSPs created by LDP follow the paths calculated by the IGP from source router to the destination prefix. RSVP, on the end, is used for “explicitly” created and routed LSPs between two end points; the path need not follow the IGP path. The first router of the LSP initiates path setup by sending an RSVP message. The message propagates along the (to be established) LSP to the last router. Every intermediate router processes the message, creating an entry in its LFIB for the LSP. RSVP also allows reservation of bandwidth along the LSP, making it ideal for TE and CoS routing. Finally, BGP is used for distributing prefix to label mappings (mostly) in the context of VPN services. With VPNs, different 376 L. Breslau and A. Shaikh customers of a VPN service provider can use overlapping IP address blocks, and BGP-distributed label to prefix mapping allows a provider’s egress edge router to determine which customer a given packet belongs to. The flow of control messages through individual routers running LDP and RSVPTE can be modeled in the same manner as traditional unicast routing protocols as shown in Fig. 11.4. Thus, to monitor these protocols, one needs to collect incoming messages, outgoing messages, and changes occurring to the LFIB at every router. As a result, various techniques described in Section 11.2.2 for data collection apply to these protocols as well. One caveat applies to RSVP though since it does not have a notion of a protocol session. Given this, it is not possible to collect information about RSVP messages through a session with an RSVP router. To collect information about RSVP dynamics thus requires some mechanism for routers to send messages to a monitoring session when tunnels are setup and torn down – SNMP traps defined in RFC 3812 [41] provide such a capability. Once routing data is collected from LDP or RSVP routers, it can be used in similar fashion as described in Section 11.2.3. For example, knowing label binding messages sent by LDP routers allows an operator to know if LSPs are established correctly or not. As another example, knowing the size of an LFIB (i.e., the number of LSPs traversing a router) and how it is evolving can be a key parameter in capacity planning. 11.3 Multicast Throughout its relatively brief but rapidly evolving history, the Internet has primarily provided unicast service. A datagram is sent from a single sender to a single receiver, where each endpoint is identified by an IP address. Many applications, however, involve communication between more than two entities, and often the same data needs to be delivered to multiple recipients. As examples, software updates may be distributed from a single server to multiple recipients, and streaming content, such as live video, may be transmitted to many receivers simultaneously. When the network layer only supports one-to-one communication, it is the responsibility of the end systems to replicate data and transmit multiple copies of the same packet. This solution is inefficient both with respect to processing overhead at the sender and bandwidth utilization within the network. Multicast [42], on the other hand, presents an efficient mechanism for network delivery of the same content to multiple destinations. In IP multicast, the sender transmits a single copy of a packet into the network. The network layer replicates the packet at appropriate routers in the network such that copies are delivered to all interested receivers and at most one copy of the packet traverses any network link. Multicast is built around the notion of a multicast group, which is a 32bit identifier taken from the Class D portion of the IP address space (224.0.0.0 – 239.255.255.255). In multicast packets, the group address is contained in the destination IP address field in the header. Receivers make known their interest in 11 Measurements of Control Plane Reliability and Performance 377 receiving packets sent to the group address via a group membership protocol such as IGMP [43], and multicast routing protocols enable multicast packets to be delivered to the interested receivers. Multicast was first proposed in the 1980s and was deployed on an experimental basis in the early 1990s. This early deployment, known as the MBone [44] (for Multicast Backbone), consisted of areas of the Internet in which multicast was deployed. These areas were connected together using IP-in-IP tunnels enabling multicast packets to traverse unicast-only portions of the Internet. The predominant applications used in the MBone, videoconferencing and video broadcast, primarily supported small group collaboration and broadcast of technical meetings and conferences. After rapid initial growth, the MBone peaked and then began to flounder. The technology, while initially promising, did not find its way into service provider networks. Several reasons have been given for this. These include the lack of a clear business model (i.e., who would be charged for packets that are replicated and delivered to many receivers), security concerns (i.e., the original any-to-any IP multicast service model allowed any host in the network to transmit packets to a multicast group), and concerns about manageability (i.e., lack of tools to monitor, troubleshoot and debug this new technology). More recently, deployment of network layer multicast service within IP networks has been increasing. This deployment has occurred primarily in enterprise networks, in which some of the earlier concerns with multicast (e.g., security, business model) are more easily mitigated. Common multicast applications in enterprise networks include software distribution and dissemination of financial trading information. The deployment of multicast within enterprise networks has also driven deployment in service provider networks in order to support the needs of Virtual Private Network (VPN) customers who use multicast in their networks. The Multicast VPN solution defined for the Internet [45, 46] requires customer multicast traffic to be encapsulated in a second instance of IP multicast for transport across the service provider backbone. Finally, the widespread deployment of IPTV, an application that benefits greatly from multicast service, is creating further growth of IP multicast. Forwarding multicast packets within a network makes use of a separate FIB from the unicast FIB and depends on a new set of routing protocols to create and maintain these FIB entries. As such, the set of tools used to monitor unicast routing cannot be used. In this section, we review the basics of multicast routing, identify issues that make monitoring and managing multicast more difficult than monitoring unicast routing, and finally describe tools and strategies for monitoring this technology. 11.3.1 Multicast Routing Protocols A multicast FIB entry is indexed by a multicast group and a source specification, where the latter consists of an address and mask. Packets that match the group address and source specification will be routed according to the FIB entry. The FIB entry itself contains an incoming interface over which packets matching the source 378 L. Breslau and A. Shaikh and group are expected to arrive, and a set of zero or more outgoing interfaces over which copies of the packets should be transmitted. The union of FIB entries pertaining the same group and source(s) across all routers forms a tree, denoting the set of links over which a packet is forwarded to reach the set of interested receivers. It is the job of multicast routing protocols to establish the appropriate FIB entries in the routers and thereby form this multicast tree. Over the last two decades, several multicast routing protocols have been proposed and in some cases implemented and deployed. These include DVMRP [47], MOSPF [48], CBT [49], MSDP [50], and PIM [51, 52]. In this section, we give an overview of PIM and MSDP as they are the most widely deployed multicast routing protocols. 11.3.1.1 PIM Protocol Independent Multicast, or PIM, is the dominant multicast routing protocol deployed in IP networks. PIM does not exchange reachability information in the sense that unicast routing protocols, such as OSPF and BGP, do. Rather, it leverages information in the unicast FIB in order to construct multicast trees, and it is agnostic as to the source of the unicast routing information. There are multiple variants of PIM, including PIM Sparse Mode (PIM-SM), PIM Dense Mode (PIM-DM), Source Specific PIM (PIM-SSM), and Bidirectional PIM (PIM-Bidir). In this section, we present a brief overview of the basic operation of PIM-SM and PIM-SSM, as they are the most commonly deployed variants of PIM, in order to motivate the challenges in multicast monitoring and their solutions. Before turning to PIM we discuss one key aspect of multicast trees and the protocols that construct them. Multicast trees can be classified as shared trees or source trees. A shared tree is one that is used to forward packets from multiple sources. In this case, the multicast routing entry is denoted by a group and a set of sources (e.g., using an address and a mask). For a shared tree, the set of sources usually includes all sources, and the routing table entry is denoted by the . ; G/ pair, where G denotes the multicast group address and ‘*’ denotes a wildcard (indicating all sources). A source tree, on the other hand, is used to forward packets from a single source, and is denoted as .S; G/, where G again refers to the multicast group and S refers to a single source. PIM-SM uses both shared and source trees, depending on both the variant and how it is configured. In both cases, multicast trees are constructed by sending Join messages from the leaves of the tree (the routers that are directly connected to hosts that want to receive packets transmitted to the multicast group) toward the root of the tree. In the case of a source tree, the root is a source that transmits data to the multicast group and the Join message is referred to as an .S; G/ Join. For a shared tree, the root is a special node referred to as a Rendezvous Point, or RP, and the Join message is referred to as a . ; G/ Join. The RP for a group, which can be configured 11 Measurements of Control Plane Reliability and Performance 379 statically at each router or determined by a dynamic protocol such as BSR [53], must be agreed upon by all routers in a PIM domain.9 PIM Join messages are transmitted hop-by-hop toward the root of the tree. At each router, the next hop is determined using the unicast FIB. Specifically, the Join message is transmitted to the next hop on the best route (as determined by the unicast routing table) toward the root (i.e., source or RP). As such, the Join message follows the shortest path from the receiver to the root of the tree. At each hop, the router keeps track of the neighbor router from which the Join message was received and the neighbor router to which it was forwarded. The latter is denoted as the upstream neighbor in the multicast FIB and the former is denoted as a downstream neighbor. When subsequent multicast data packets are received from the upstream neighbor, they will be forwarded to the downstream neighbor. When a router receives a subsequent . ; G/ or .S; G/ Join message for a FIB entry that already exists, the router from which the Join message is received is added to the list of downstream neighbors. However, the Join message need not be forwarded upstream as a Join message will have already been forwarded toward the root of the tree. In this way, Join messages from multiple downstream neighbors are merged, and when data packets are received, they will be replicated with a copy forwarded to each downstream neighbor. PIM uses soft state, so that Join messages are retransmitted hop-by-hop periodically, and state that is not refreshed is deleted when an appropriate timer expires. In PIM-SM, all communication begins on a shared tree. Last hop routers transmit Join messages toward the RP, forming a shared tree with the RP at the root and last hop routers as leaves. This process is depicted in Steps 1–3 in Fig. 11.7a, in which router R2 transmits a Join message toward the RP. This message is then forwarded by R1 to the RP. R3 subsequently transmits a Join message toward the RP, which is received by R1 and not forwarded further. When a source wants to transmit packets to the group, it encapsulates these packets in PIM Register messages transmitted using unicast to the RP. The RP decapsulates these packets and transmits them on the shared tree, so that they are delivered to all routers that joined the tree. The RP then sends an .S; G/ Join message toward the source, building a source tree from the source to the RP. Steps 4–5 in Fig. 11.7a depict a Register message from a source S to the RP followed by a subsequent Join from the RP to S. Once this source tree is established, packets are sent using native multicast from the source to the RP and from the RP to the leaf routers, as shown in Fig. 11.7b. When multiple sources have data to send to the multicast group, each will send PIM Register messages to the RP, which in turn will send PIM Join messages to the sources, thereby creating multiple .S; G/ trees. While all communication, in PIM-SM begin on shared trees, the protocol allows for the use of source trees. Specifically, when a last hop router receives packets from a source, it has the option to switch to a source tree for that source. It does this by 9 A PIM domain is defined as a contiguous set of routers all configured to operate within a common boundary. All routers in the domain must map a group address to the same RP. 380 L. Breslau and A. Shaikh a b 4 Register S RP S (S,G) Join 5 RP 2 (*,G) Join Data Packets R1 R1 1 3 (*,G) Join (*,G) Join R2 R3 R2 Shared Tree Creation Shared Tree Data Flow c S R3 d RP S RP 2 (S,G) Join Data Packets R1 1 R1 3 (S,G) Join R2 (S,G) Join R3 Source Tree Creation R2 R3 Source Tree Data Flow Fig. 11.7 Example PIM Operation: (a) Sequence of control messages for shared tree creation. (b) Resulting flow of data packets. (c) Sequence of control messages for switchover to source tree. (d) Resulting flow of data packets sending an .S; G/ Join toward the source, joining the source tree (just as the RP did in the description above). Once it has received packets on the source tree, it then sends a Prune message for the source on the shared tree, indicating that it no longer wants to receive packets from that source on the shared tree. The Join messages needed to switch from the shared to source tree are shown in Fig. 11.7c, and the resulting flow of data packets is shown in Fig. 11.7d. Source trees allow for more efficient paths from the source to receiver(s) at the expense of higher protocol and state overhead. PIM-SSM (Source Specific Multicast) does away with the need for RPs, thereby simplifying multicast tree construction and maintenance while using a subset of 11 Measurements of Control Plane Reliability and Performance 381 the PIM-SM protocol mechanisms. PIM-SSM only uses source trees. The source of traffic is known to hosts interested in joining the multicast group (e.g., via an out-ofband mechanism). These receivers signal their interest in the group via IGMP, and their directly connected routers send .S; G/ Join messages directly to the source, thereby building a source tree rooted at the sender. 11.3.1.2 MSDP In PIM-SM, there is a single RP that acts as the root of a shared tree for a given multicast group. (Note that a single router may act as an RP for many groups.) This provides a mechanism for rendezvous and subsequent communication between sources and receivers without either having any pre-existing knowledge of the other. However, there are two situations in which multiple RPs for a group may be desirable. The first involves multicast communication between domains. Specifically, two or more service providers may wish to enable multicast communication between them. If there is only a single RP for a group, failure of the RP in one provider’s network may impact service in the other’s network, even if all of the sources and receivers are located in the latter’s network. Service providers may not be willing to depend on a critical resource (e.g., the RP) located in another service provider’s network for what may be purely intradomain communication. Further, even without RP outages, performance may be suboptimal if purely intradomain communication is required to follow interdomain paths. That is, a multicast tree between senders and receivers in one ISP’s network may traverse another ISP’s network. Thus, each provider may wish to have an RP located within its own domain. The second situation in which multiple RPs may be useful involves communication within a single PIM domain. Specifically, redundant RPs provide a measure of robustness, and this can be implemented using IP anycast [54]. Each RP is configured with the same IP address, and the RP mapping mechanism identifies this anycast address as the RP address. Each router wishing to join a shared tree sends a . ; G/ Join message toward the RP address. By virtue of anycast routing, which uses unicast routing to route the message to the “closest” RP, the router will join a shared tree rooted at a nearby RP. As a result, multiple disjoint shared trees will be formed within the domain. Similarly, when a source transmits a PIM Register message to an anycast RP address, this message will only reach the nearest RP. As such, sources and receivers will only communicate with those subsets of routers closest to the same RP, and the required multicast connectivity will not be achieved. The problem of enabling multicast communication when multiple RPs exist for the same group (whether within or between domains) is solved by the Multicast Source Discovery Protocol (MSDP) [50]. MSDP enables multicast communication between different PIM-SM domains (e.g., operated by different service providers) as well as within a PIM-SM domain using multiple anycast RPs. MSDP-speaking RPs form peering relationships with each other to inform each other of active sources. Upon learning about an active source for a group for which there are interested 382 L. Breslau and A. Shaikh receivers, an RP joins the source tree of that source so that it can receive packets from the source and transmit them within its own domain or on its own shared tree. We give a brief overview of MSDP. Each RP forms an MSDP peering relationship with one or more other RPs using a TCP connection. These MSDP connections form a virtual topology among the various RPs. RPs share information about sources as follows. For each source from which it receives a PIM Register message, an RP transmits an MSDP Source-Active (SA) message to its MSDP peers. This SA message, which identifies a source and the group to which it is sending, is flooded across the MSDP virtual topology so that it is received by all other MSDP-speaking RP routers. Upon receipt of an SA message, an RP (in addition to flooding the message to its other MSDP peers) determines whether there are interested receivers in its domain. Specifically, if the RP has previously received a Join message for the shared tree indicated by the group in the SA message, the RP will transmit a PIM Join message to the source. In this way, the RP joins the source tree rooted at the source in question, receives multicast packets from it, and multicasts these packets on the shared tree rooted at the RP. Thus, multicast communication is enabled when multiple RPs exist for the same group, whether within or across domains. 11.3.2 Challenges in Monitoring Multicast In the early days of multicast, one of the often cited reasons for its slow deployment was the difficulty of monitoring and managing the service; commercial routers implemented the protocols, but network operators had little way of knowing how the service was working when they deployed it. While this was by no means the only impediment to its deployment, it did present a significant challenge to network operators. To some degree, the problems cited early on with multicast management remain true today. Before turning to specific tools and techniques used to monitor and manage multicast in order to provide a stable and reliable network service, we identify some of the generic challenges for managing the technology, while deferring some of the protocol-specific issues to Section 11.3.3. While multicast is by no means a new technology, it is not yet mature. Because it has only been deployed in a significant way in the last few years, there does not yet exist the experience and knowledge surrounding it as exists with unicast service. This manifests itself in two related ways. First, engineers and operators in many cases are unfamiliar with the technology and face a steep learning curve in troubleshooting and monitoring multicast. Second, due to a rather limited deployment experience, the kinds of tools that have evolved in the unicast world and that have been essential in route monitoring do not yet exist for multicast. Putting aside the relative newness of the technology, there are aspects of multicast that make it inherently more challenging to manage than unicast. Most obviously, the nature of what constitutes a route followed by a packet has changed. In unicast routing, the path taken by a packet from source to destination consists of a sequence 11 Measurements of Control Plane Reliability and Performance 383 of routers (usually no more than 20 or 30). This path is easily identifiable (e.g., using tools such as traceroute) and can be presented to a network operator in a way that is easy to understand. In multicast routing, a packet no longer traverses an ordered sequence of routers, but rather follows a tree of routers from a source to multiple destinations. The tree can be very large, consisting of hundreds of routers. Identifying the tree becomes more challenging, and perhaps more significantly, presenting it to a network operator in a useful manner is difficult. In addition to being large, multicast trees are not static. That is, they are driven by application behavior, and the set of senders and receivers may change during the lifetime of an application. As such, branches may be added to and pruned from multicast trees over time, and these changes can happen on short timescales. Thus, understanding the state of multicast is made more difficult by the dynamic nature of the multicast trees. Finally, the multicast routing state used to forward a packet from a source to a set of receivers can be data driven. That is, the state may not be instantiated until an application starts sending traffic or expresses interest in receiving it. In contrast, with unicast routing, the FIB entries used to route a packet from a source to a destination are independent of the existence of application traffic. Thus, routing table entries can be queried (either directly with SNMP or indirectly with a utility like traceroute) in order to discover or verify a route. With multicast the analogous routing state may not exist until applications are started. Using PIM-SM as an example, the shared tree from the RP to receivers is formed as a result of receivers joining a multicast group. Similarly, the state needed to route a packet from a source to the RP is not created until the source sends a PIM Register message to the RP and the RP subsequently sends an .S; G/ Join to the source. Given this, answering such questions (as one might want to do in advance of a streaming broadcast) as “how would packets be routed from the source to receivers” is problematic. Given the inherent difficulties in monitoring and managing multicast routing, there exists a need for new tools, methods and capabilities to assist in this process. We now turn to the challenges of monitoring specific protocols and the ways in which these challenges can be met. 11.3.3 Multicast Route Monitoring Multicast routing involves complex protocols. In order to understand, troubleshoot and debug the state of multicast in a network, operators need to be able to answer several key questions. These include: What is the FIB entry for a particular source and group at a router? What is the multicast tree for a .S; G/ or . ; G/ pair? What route will a packet take from a source to one or more receivers? (As will be explained below, this question differs subtly from the preceding one.) Are multicast trees stable or dynamic? 384 L. Breslau and A. Shaikh Are packets transmitted by source S to group G being received where they should be? Is multicast routing properly configured in the network? Answering these and other questions about multicast requires a new set of management tools and capabilities. In this section, we describe how monitoring tools can be used to answer these questions. Before doing so, we briefly review the network management capabilities developed during earlier experiences with multicast. 11.3.3.1 Early MBone Tools The MBone grew from a few dozen subnets in 1992 to over 3,000 four years later [55]. At its inception, it connected a small community of collaborating researchers, but it expanded to include a much broader set of users and applications. It was initially maintained by a few people who knew administrators at all the participating sites. Therefore, monitoring and debugging of the infrastructure developed in an ad hoc manner. As the MBone grew, it faced an increasing set of management challenges. To meet these challenges, the researchers who managed and used it developed a broad set of tools. While we avoid an exhaustive review of these tools we give a few representative examples here which encompass both application and network layer tools. mrinfo discovered the multicast topology by querying multicast routers for their neighbors. mtrace was used to discover the path packets traversed to reach a receiver from a source. rtpmon was an application-level monitoring tool that provided end-to-end performance measurements for a multicast group. The DVMRP Route Monitor [56] monitored routing exchanges between multicast routers in the MBone. The tools mentioned here, and the many others that were developed (see [57, 58] for a more complete list) provided great value to the early MBone users. They addressed real problems and allowed operators and users to understand, monitor, and troubleshoot the experimental network. While in many cases they provided insight and lessons, which inform current efforts, they are unable to form the basis for a current multicast management strategy. Many of the tools use RTCP and monitor application performance. Others were built specifically to monitor mrouted, the public domain multicast routing daemon used in the early MBone. Neither of these support the needs of large ISPs to monitor their multicast infrastructure. Instead, today’s multicast management and monitoring strategy must be built around tools that work in the context of the multi-vendor commercial routers managed by the ISPs. 11 Measurements of Control Plane Reliability and Performance 385 11.3.3.2 Information Sources While the earlier experience with the MBone provided some valuable insight as to the challenges with managing multicast, it also showed the need for tools that worked with commercial routers and that could be deployed by service providers at scale. Such tools must work in the confines of the capabilities available on the routers that support multicast. We discuss the options for gathering information about multicast in this section, in order to motivate the kinds of solutions described later. As described in Section 11.2.3, route monitors provide enormous capability with respect to monitoring unicast routing. BGP monitors peer with BGP speaking routers to collect routing updates and thereby monitor network reachability and stability, possibly from multiple vantage points. Similarly, OSPF monitors collect flooded LSAs to learn the topology of a network and emulate its route computation. Unfortunately, analogous route monitoring is more difficult with multicast. PIM is not a conventional routing protocol per se. That is, PIM routers do not exchange reachability information, nor do they flood information about their local topology or routing state. Instead, PIM makes use of the routes computed by another routing protocol, such as OSPF. Specifically, PIM uses the routes in a unicast FIB to forward PIM Join messages toward the root of a multicast tree. These Join messages cause the router to instantiate the multicast FIB entries needed to forward multicast packets. We do note that in contrast, MSDP is amenable to monitoring akin to what is feasible with unicast. It is built upon information exchanges over peering connections (themselves using TCP). These advertisements are flooded to all MSDP speakers, therefore an MSDP monitor could collect (possibly from multiple other routers) and analyze these exchanges. Since a route monitor cannot collect information about a PIM domain, other sources of information are needed upon which to build appropriate multicast monitoring and management capabilities. We review the two most readily available sources of information here: SNMP and CLI. SNMP provides a mechanism upon which to build multicast management applications. It is an Internet standard presenting a common interface upon which to access information from routers. Service providers use it in other network management functions. Therefore, libraries, pollers and related expertise are abundant. Several multicast-related MIBs have been defined providing extensive information about multicast routing (e.g. [59, 60]). These MIBs provide information about interfaces on which multicast is enabled, multicast routing adjacencies, and multicast FIB entries. SNMP is not without its shortcomings. We identify three in particular. First, except for a relatively small number of traps defined in multicast-related MIBs, all SNMP-related information must be polled. Hence, changes in multicast routing entries, as occur when a tree changes, can only be discovered through polling. Learning about such changes in a timely and scalable manner may be challenging. Second, while SNMP is defined as an Internet standard, vendors can define 386 L. Breslau and A. Shaikh and implement their own proprietary MIBs. By availing themselves of this option, vendors make the development of vendor-independent management tools more difficult. Finally, a single vendor may support different MIBs on different devices, as can be the case when a vendor undergoes a major revision of its operating system. Scripts that directly access the command line interfaces of routers present an alternative way of collecting multicast related information from routers. However, the command line interface does not return information in a structured machine readable format (as SNMP does) and therefore requires parsing of the output to obtain specific items. Further, because the command line interfaces are not standardized, building portable vendor-independent tools (and even tools that work with different platforms of a single vendor) can be difficult. 11.3.3.3 Multicast Monitoring SNMP is generally a more useful and flexible platform upon which to base multicast-related management and monitoring tools. Using SNMP, monitoring tools can retrieve relevant multicast routing information from routers and produce the kinds of reports and output that one might get from a conventional route monitor. In this section, we present examples of the kind of functionality that can be implemented. As a first example, SNMP-based tools can discover the multicast topology, i.e., the contiguous set of routers that implement PIM within a domain. Specifically, each PIM router will report its set of PIM neighbors (those adjacent routers that also run PIM). By starting at any router within a domain and recursively querying for lists of PIM neighbors, the entire topology can be discovered. The multicast topology can be used to verify that multicast is configured as expected (e.g., all routers are reachable in the multicast topology) and to track topology changes as they occur. As a next example, multicast-related MIBs can be mined to report and explore specific multicast FIB entries at a router. When multicast was first deployed on commercial routers, a common monitoring and debugging technique employed by operators was to logon to a router and to use the command-line interface to observe routing table entries. In particular, the show ip mroute command provides detailed information about one or more .S; G/ or . ; G/ entries. This includes information about the upstream router, outgoing interfaces by which downstream neighbors are reached, the RP (in the case of PIM-SM), and various timers related to the entry. In fact, much of the information provided by the command line is also exported in MIBs. Gathering the information in machine-readable format provides an ability to emulate the existing command-line output, while at the same time augmenting it and producing more valuable output using a graphical or web-based interface. Further, when gathering MIB data from a router, output need not be constrained to the format provided on router command lines. Groups can be filtered, for example, based on their importance, traffic volume or dynamicity, and automatic reports on critical information can be generated. Figure 11.8 depicts an example of 11 Measurements of Control Plane Reliability and Performance 387 Router: attga-rtr1 (10.20.1.1) IP Multicast Routing Table Flags: D - Dense, S - Sparse, s - SSM Group C - Connected, L - Local, P - Pruned R - RP-bit set, F - Register Flag, T - SPT-bit set, J - Join SPT Timers: Uptime/Expires Interface state: Interface, Next-Hop, State/Mode Application: Customer 1 VPN (*, 239.1.23.5), uptime 49d17h, expires 00:02:56, AnycastRP is stlmo-rtr3 (10.21.3.2) Incoming interface: POS15/0, RPF neighbor attga-rtr2 (10.21.17.1) Outgoing interface list: Loopback0, Forward/Sparse, 49d17h/00:00:00 Source: attga-rtr1 (10.20.1.1, 239.1.23.5), uptime 49d17h, expires 00:03:29 Incoming interface: Loopback0, RPF neighbor 0.0.0.0 Outgoing interface list: attga-rtr2 (POS0/0), Forward/Sparse, 42d15h/00:02:51 attga-rtr3 (POS15/0), Forward/Sparse, 44d12h/00:03:10 Fig. 11.8 SNMP-generated (*, G) and (S, G) multicast routing table entries SNMP-generated multicast routing state at a router. Adjacent routers in the display, as well as the Rendezvous Point, are clickable, yielding the analogous routing state at those routers. While viewing multicast routing information gathered from a single router is valuable, the real power of SNMP comes from its ability to collect and synthesize data from multiple routers simultaneously. In the case of multicast, it can be used to discover multicast distribution trees in the network. This is critical in giving operators the ability to understand their networks, locate problems and develop solutions. The automated collection of information from many routers simultaneously enables tree discovery at a scale not feasible using a manual router-by-router approach. Multicast tree discovery works as follows. A management application using SNMP can gather local state for a multicast FIB entry starting at any router on the tree (the router can be a known source, receiver, or a transit router such as an RP). The key pieces of state here include the upstream neighbor of the router, the downstream routers to which it will forward packets, and perhaps the identity of the RP (in the case of shared trees as in PIM-SM). Beginning from this initial router, the entire tree can be discovered by repeating queries recursively at upstream and downstream routers until the source and all leaf routers are reached. An example of the output of such a tree discovery is shown in Fig. 11.9. The routers in this tree display are clickable, allowing the user to drill down to router-specific state for the group (like that shown in Fig. 11.8). The ability to easily discover an entire tree is invaluable. It enables operators to see how packets will be forwarded, and in the case of problems, provides guidance as to where faults may be located and troubleshooting should begin. In the case of very large trees, graphical displays (like that shown in Fig. 11.10) are needed. In addition, zooming, panning and searching become critical as the number of routers on a tree exceeds a few tens. 388 L. Breslau and A. Shaikh Initial Router: nycny-rtr1 (10.20.167.22) Source: 0.0.0.0 Group: 239.16.88.39 Application: Customer 2 VPN --cgcil-rtr2 (10.20.14.27) *Anycast RP (10.21.3.2)* | +--seawa-rtr1 (10.20.121.12) | | | +--ptdor-rtr4 (10.20.33.15) | | | | | +--ptdor-rtr7 (10.20.33.87) | | | +-- seawa-rtr8 (10.20.52.119) | | | | | +-- seawa-rtr10 (10.20.51.42) | | | +-- ptdor-rtr6 (10.20.16.100) | | | +-- ptdor-rtr15 (10.20.61.115) | +-- cgcil-rtr2 (10.20.122.14) | +-- nycny-rtr1 (10.20.14.188) | +-- nycny-rtr6 (10.20.4.110) | +-- nycny-rtr5 (10.20.121.106) | +-- nycny-rtr10 (10.20.52.27) Fig. 11.9 SNMP-generated multicast tree Monitoring PIM-SM presents challenges beyond discovering a single multicast tree. Recall that receivers join a shared tree rooted at an RP and that the RP independently joins a source-specific tree rooted at each source. Thus, for a single multicast group, multiple FIB entries may exist at each router, corresponding to one or more .S; G/ pairs and a . ; G/ entry. Each entry may have different sets of incoming and outgoing interfaces. Since the packet forwarding rules are extremely complex, it may not always be easily understood how a particular packet will be forwarded. A packet may initially be forwarded along a source-specific tree, and then be replicated and transition to a shared tree at one or more points. This challenge can be addressed through the simultaneous discovery and display of multiple multicast trees, as shown in Fig. 11.11. This shows the source tree from a single sender to multiple RPs and shared trees from each of the RPs to associated last hop routers. Note that the trees may overlap and in many cases branches from different trees will flow in different directions on the same link. Use of a display 11 Measurements of Control Plane Reliability and Performance 389 dlstx-rtr4 (RP) dlstx-rtr6 dlstx-rtr11 phmaz-rtr4 attga-rtr4 losca-rtr6 attga-rtr9 sffca-rtr6 ormfl-rtr4 attga-rtr7 sffca-rtr11 ormfl-rtr11 phmaz-rtr7 houtx-rtr6 houtx-rtr11 houtx-rtr4 Fig. 11.10 Graphical display of a multicast tree attga-rtr4 (SRC) attga-rtr2 stlmo-rtr8 stlmo-rtr2 (RP) nsvtn-rtr2 attga-rtr8 kcymo-rtr5 dlstx-rtr8 dlstx-rtr2 (RP) dlstx-rtr5 phmaz-rtr2 houtx-rtr8 phmaz-rtr4 houtx-rtr5 houtx-rtr2 attga-rtr5 Fig. 11.11 Tree display depicting a single source tree (dashed lines) and two shared trees (solid lines) for a single multicast group like this can show how packets are transmitted from a sender to a receiver, and can illustrate where problems in the tree(s) exist. The multicast-related MIBs provide a level of detail not available with unicast. Specifically, whereas unicast MIBs provide information about destination networks, 390 L. Breslau and A. Shaikh the multicast MIB entries, identified by the combination of a source and a group, or just a group, provide information about a particular application since at any point in time, a multicast group is generally used by a single application instance. This level of detail would be akin in the unicast world to a MIB entry per TCP connection. Hence, instead of asking a router about unicast routes to a destination, we can look at multicast routes used by a particular application. Such fine-grained information will clearly present scaling challenges as the scope of multicast deployment continues. However, it also presents real opportunities in the area of multicast monitoring. This becomes especially relevant as network operators transition from viewing their jobs as monitoring routers and links to managing end-to-end services. Given the ability to associate a multicast group with an application, the provider can perform application-specific monitoring. In fact, the tools outlined above already do this to some extent – the multicast trees being discovered and displayed give information about routing for the specific applications that use them. In addition, traffic or performance monitoring is also possible. That is, a provider can monitor a well-known application to verify both that multicast routing state exists, and that traffic is being received. For example, in an IPTV network, each TV channel is generally transmitted on its own multicast group. If the network is engineered so that groups are statically joined at certain routers, or if routers are monitored to dynamically determine where channels are being distributed, group-specific MIB variables, such as packet and byte counters, can be gathered to monitor application performance. While the majority of attention in the area of multicast management is focused on PIM, the deployment of MSDP will likely expand, as multicast grows within domains and as providers explore interdomain multicast. As a protocol, MSDP bears some resemblance to BGP (in the sense that routers form explicit peering connections), and lessons and techniques that have evolved in the management of BGP can be applied to MSDP. Recall that MSDP routers share information about active sources across domains. MSDP speaking routers in different domains form peering relationships and exchange information over reliable TCP connections (as with BGP). An MSDP router (which is generally also an RP), sends SA messages to its peers, informing them of the active sources for each group in the domain. These messages are flooded throughout the virtual topology formed by the MSDP peering relationships so that all MSDP routers have global information. An MSDP monitor can form a peering relationship with one or more MSDP speaking routers. The peering relationships would be entirely passive, so that the monitor learns information, but does not inject announcements into the network. The monitor could thus learn about active sources in other domains. This dynamically learned information could be used to drive other monitoring functions, such as tree discovery or per-application monitoring. In addition, peerings with multiple MSDP speakers could be used to verify the consistency of views at different vantage points in the network. 11 Measurements of Control Plane Reliability and Performance 391 11.4 Summary and Future Directions In this chapter, we have described tools and techniques to monitor the control plane for unicast as well as multicast services. The control plane, consisting of the routing protocols running in the network, determines how packets flow from a source to a destination. The proper functioning of the control plane is key to the overall functioning of the Internet, and as a result it is critical to monitor its performance on an ongoing basis. This chapter in particular focused on what data needs to be collected and how it can be used for effective monitoring of the control plane. Although we have come a long way from the early days of the Internet when routing protocols were deployed with little thought given to their management, many challenges remain. This is due in part to the fact that network management needs to continually catch up with a constantly evolving infrastructure. The advent of MPLS and multicast are cases in point. Control plane monitoring for these new services could borrow techniques and lessons learned from earlier experiences, but they present their own unique requirements as well. Similarly, while we expect today’s tools to provide a foundation to support future control plane monitoring as the Internet evolves, new challenges demand additional tools and strategies. We provide a few examples of what we see as likely future challenges and requirements before closing the chapter. As the Internet continues its explosive growth, scale remains a major challenge. The number of routing table entries and messages in the control plane has been increasing rapidly as more and more people and businesses come online. This growth not only poses scalability challenges to the control plane itself, but also to the systems that monitor and manage it. Imagine monitoring and maintaining end-to-end paths between hundreds of thousands of router pairs in real time, and updating them as the network undergoes changes. Similarly, consider current SNMP-based tools that can query the entire multicast routing tables of a router. As the use of multicast expands and consequently multicast routing tables grow (e.g., to rival the size of unicast routing tables), the existing tools and techniques will not scale. The ability of network operators to gain information about individual groups and applications will diminish, and existing tools will need to be extended or replaced. Another challenge, related in part to scale, lies in the area of inter-provider monitoring. Most control plane monitoring today takes place within the context of a single provider. There do exist facilities like RouteViews [15] for monitoring interdomain BGP changes across providers. However, understanding control plane behavior across providers could benefit from advances in tools to support inter-provider monitoring. Such tools would, of necessity, preserve privacy across providers and adhere to strict security requirements. The Internet’s best-effort service model works well for applications such as file transfer and electronic mail. However, many new applications running on the Internet have significantly more stringent performance requirements. These performance requirements place additional requirements on the control plane as well. For example, as we saw in this chapter, traditional unicast routing protocols often take several tens of seconds to converge after a change. This, unfortunately, is totally inadequate 392 L. Breslau and A. Shaikh for applications such as VoIP and IPTV. To fill this gap, service providers have attempted to improve the convergence time of existing routing protocols through better implementation and configurations. In addition, they have introduced new technologies such as IP and MPLS-based fast reroute (FRR) schemes. As providers deploy technology aimed to improve routing protocol behavior, there is a corresponding need for more advanced tools to monitor the performance and reliability of the control plane so that the results of routing protocol changes can be verified or measured. During the last decade we have made substantial progress in our ability to monitor and measure the performance of the control and data planes. An exciting avenue for future work lies in closing the feedback loop by automatically adjusting network configuration to optimize its performance as information is gleaned from the monitoring systems. Achieving this will require sophisticated models to represent current network performance, and its performance under various “what-if” scenarios. These models then need to be translated into mechanisms for reliably and scalably adjusting device configurations and resource allocations, as well as for redesigning and re-architecting networks at various timescales in an automated fashion. Apart from application performance and resource usage, an automated “measure-modelcontrol” loop will be crucial in running the networks in a more efficient and reliable manner. References 1. Shaikh, A., & Greenberg, A. (2004). OSPF monitoring: architecture, design and deployment experience. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, California, March 2004. 2. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms, second ed. Cambridge, MA: MIT Press. 3. Malkin, G. (1998). RIP Version 2. IETF Request for Comments (RFC) 2453, November 1998. 4. Garcia-Luna-Aceves, J. (1989). A unified approach to loop-free routing using distance vector or link states. In Proceedings of ACM SIGCOMM, Austin, Texas, September 1989. 5. Rekhter, Y., Li, T., & Hares, S. (2006). A border gateway protocol 4 (BGP-4). IETF Request for Comments (RFC) 4271, January 2006. 6. Moy, J. (1998). OSPF Version 2. IETF Request for Comments (RFC) 2328, April 1998. 7. Callon, R. (1990). Use of OSI IS-IS for routing in TCP/IP and dual environments. IETF Request for Comments (RFC) 1195, December 1990. 8. Huitema, C. (1999). Routing in the Internet. Prentice Hall PTR, second ed., Upper Saddle River, New Jersey, December 1999. 9. Stewart, J. W. (1998). BGP4: inter-domain routing in the Internet. Addison-Wesley, Upper Saddle River, New Jersey, December 1998. 10. Bates, T., Chen, E., & Chandra, R. (2006). BGP route reflection: an alternative to full mesh Internal BGP (IBGP). IETF Request for Comments (RFC) 4456, April 2006. 11. Moy, J. (1998). OSPF: Anatomy of an Internet routing protocol. Addison-Wesley, Reading, Massachusetts, February 1998. 12. Mirtorabi, S., Psenak, P., Lindem, A., & Oswal, A. (2008). OSPF multi-area adjacency. IETF Request for Comments (RFC) 5185, May 2008. 13. Mauro, D., & Schmidt, K. (2005). Essential SNMP. O’Reilly & Associates, second ed., Sebastopol, California, September 2005. 11 Measurements of Control Plane Reliability and Performance 393 14. Patrick, N., Scholl, T., Shaikh, A., & Steenbergen, R. (2006). Peering dragnet: examining BGP routes received from peers. North American Network Operators’ Group (NANOG) presentation, October 2006. 15. University of Oregon Route Views Project. http://www.routeviews.org/. 16. RIPE Routing Information Service (RIS). http://www.ripe.net/ris/index.html. 17. Feamster, N., & Rexford, J. (2007). Network-wide prediction of BGP routes. IEEE/ACM Transactions on Networking, pp. 253–266, April 2007. 18. Rosen, E., & Rekhter, Y. (2006). BGP/MPLS IP Virtual Private Networks (VPNs). IETF Request for Comments (RFC) 4364, February 2006. 19. Kim, C., Gerber, A., Lund, C., Pei, D., & Sen, S. (2008). Scalable VPN routing via relaying. In Proceedings of ACM SIGMETRICS, Annapolis, Maryland, June 2008. 20. Govindan, R., & Reddy, A. (1997). An analysis of Internet inter-domain topology and route stability. In Proceedings of IEEE INFOCOM, Kobe, Japan, pp. 850–857, 1997. 21. Labovitz, C., Malan, G. R., & Jahanian, F. (1998). Internet routing instability. IEEE/ACM Transactions on Networking, 6, pp. 515–528, October 1998. 22. Labovitz, C., Malan, G. R., & Jahanian, F. (1999). Origins of Internet routing instability. In Proceedings of IEEE INFOCOM, New York, New York, pp. 218–226, 1999. 23. Labovitz, C., Ahuja, A., Bose, A., & Jahanian, F. (2001). Delayed Internet routing convergence. IEEE/ACM Transactions on Networking, 9, pp. 293–306, June 2001. 24. Pei, D., Zhao, X., Wang, L., Massey, D., Mankin, A., Wu, S. F., & Zhang, L. (2002). Improving BGP convergence through consistency assertions. In Proceedings of IEEE INFOCOM, New York, New York, 2002. 25. Bremler-Barr, A., Afek, Y., & Schwarz, S. (2003). Improved BGP convergence via ghost flushing. In Proceedings of IEEE INFOCOM, San Francisco, California, 2003. 26. Pei, D., Azuma, M., Massey, D., & Zhang, L. (2005). BGP-RCN: improving BGP convergence through root cause notification. Computer Networks Journal, 48, pp. 175–194, June 2005. 27. Chandrashekar, J., Duan, Z., Krasky, J., & Zhang, Z.-L. (2005). Limiting path exploration in BGP. In Proceedings of IEEE INFOCOM, Miami, Florida, 2005. 28. Pei, D., Zhang, B., Massey, D., & Zhang, L. (2006). An analysis of path-vector convergence algorithms. Computer Networks Journal, 50, February 2006. 29. Mao, Z. M., Govindan, R., Varghese, G., & Katz, R. (2002). Route flap damping exacerbates Internet routing convergence. In Proceedings of ACM SIGCOMM, Pittsburgh, Pennsylvania, 2002. 30. Villamizar, C., Chandra, R., & Govindan, R. (1998). BGP route flap damping. IETF Request for Comments (RFC) 2439, November 1998. 31. Panigl, C., Schmitz, J., Smith, P., & Vistoli, C. (2001). RIPE routing-WG recommendations for coordinated route-flap damping parameters. RIPE document ripe-229, October 2001. ftp:// ftp.ripe.net/ripe/docs/ripe-229.txt. 32. Smith, P., & Panigl, C. (2006). RIPE routing working group recommendations on route-flap damping. RIPE document ripe-378, May 2006. http://www.ripe.net/ripe/docs/ripe-378.html. 33. Teixeira, R., Shaikh, A., Griffin, T. G., & Rexford, J. (2008). Impact of hot-potato routing changes in IP networks. IEEE/ACM Transactions on Networking, 16, pp. 1295–1307, December 2008. 34. Route Explorer from Packet Design Inc. http://www.packetdesign.com/products/rex.htm. 35. Route Analyzer from PacketStorm Communications, Inc. http://www.packetstorm.com/route. php. 36. Shaikh, A., Isett, C., Greenberg, A., Roughan, M., & Gottlieb, J. (2002). A case study of OSPF behavior in a large enterprise network. In Proceedings of ACM SIGCOMM Internet Measurement Workshop (IMW), Marseille, France, November 2002. 37. Andersson, L., Minei, I., & Thomas, B. (2007). LDP specification. IETF Request for Comments (RFC) 5036, October 2007. 38. Awduche, D., Berger, L., Gan, D., Li, T., Srinivasan, V., & Swallow, G. (2001). RSVP-TE: extensions to RSVP for LSP tunnels. IETF Request for Comments (RFC) 3209, December 2001. 394 L. Breslau and A. Shaikh 39. Rekhter, Y., & Rosen, E. (2001). Carrying label information in BGP-4. IETF Request for Comments (RFC) 3107, May 2001. 40. Rosen, E., & Rekhter, Y. (2006). BGP/MPLS IP virtual private networks (VPNs). IETF Request for Comments (RFC) 4364, February 2006. 41. Srinivasan, C., Viswanathan, A., & Nadeau, T. (2004). Multiprotocol label switching (MPLS) traffic engineering (TE) management information base (MIB). IETF Request for Comments (RFC) 3812, June 2004. 42. Deering, S. (1988). Multicast routing in internetworks and extended LANs. In Proceedings of ACM SIGCOMM, Stanford, California, pp. 55–64, August 1988. 43. Cain, B., Deering, S., Kouvelas, I., Fenner, B., & Thyagarajan, A. (2002). Internet group management protocol, Version 3. IETF Request for Comments (RFC) 3376, October 2002. 44. Casner, S., & Deering, S. (1992). First IETF Internet audiocast. ACM Computer Communication Review, 22, July 1992. 45. Multicast in MPLS/BGP IP VPNs. Internet draft, July 2008. http://www.ietf.org/ internet-drafts/draft-ietf-l3vpn-2547bis-mcast-07.txt. 46. Multicast virtual private networks. White paper, Cisco Systems, 2002. http://www.cisco.com/ warp/public/cc/pd/iosw/prodlit/tcast wp.pdf. 47. Waitzman, D., Partridge, C., & Deering, S. (1988). Distance vector multicast routing protocol. IETF Request for Comments (RFC) 1075, November 1988. 48. Moy, J. (1994). Multicast Extensions to OSPF. IETF Request for Comments (RFC) 1584, March 1994. 49. Ballardie, T., Francis, P., & Crowcroft, J. (1993). Core based trees (CBT): an architecture for scalable inter-domain multicast routing. In Proceedings of ACM SIGCOMM, San Francisco, California, pp. 85–95, September 1993. 50. Fenner, B., & Meyer, D. (2003). Multicast source discovery protocol (MSDP). IETF Request for Comments (RFC) 3618, October 2003. 51. Fenner, B., Handley, M., Holbrook, H., & Kouvelas, I. (2006). Protocol independent multicast – sparse mode (PIM-SM): protocol specification (Revised). IETF Request for Comments (RFC) 4601, August 2006. 52. Adams, A., Nicholas, J., & Siadak, W. (2005). Protocol independent multicast – dense mode (PIM-DM): protocol specification (Revised). IETF Request for Comments (RFC) 3973, January 2005. 53. Bhaskar, N., Gall, A., Lingard, J., & Venaas, S. (2008). Bootstrap router (BSR) mechanism for protocol independent multicast (PIM). IETF Request for Comments (RFC) 5059, January 2008. 54. Partridge, C., Mendez, T., & Milliken, W. (1993). Host anycasting service. IETF Request for Comments (RFC) 1546, November 1993. 55. McCanne, S. (1999). Scalable multimedia communication using IP multicast and lightweight sessions. IEEE Internet Computing, 3(2), pp. 33–45. 56. Massey, D., & Fenner, B. (1999). Fault detection in routing protocols. In Proceedings of International Conference on Network Protocols (ICNP), Toronto, Canada, 1999. 57. Saraç, K., & Almeroth, K. C. (2000). Supporting multicast deployment efforts: a survey of tools for multicast monitoring. Journal of High Speed Networks, 9(3,4), pp. 191–211. 58. Namburi, P., Saraç, K., & Almeroth, K. C. (2006). Practical utilities for monitoring multicast service availability. Computer Communications Special Issue on Monitoring and Measurement of IP Networks, 29, pp. 1675–1686, June 2006. 59. McCloghrie, K., Farinacci, D., Thaler, D., & Fenner, B. (2000). Protocol independent multicast MIB for IPv4. IETF Request for Comments (RFC) 2934, October 2000. 60. McCloghrie, K., Farinacci, D., & Thaler, D. (2000). IPv4 multicast routing MIB. IETF Request for Comments (RFC) 2932, October 2000. Part VI Network and Security Management, and Disaster Preparedness Chapter 12 Network Management: Fault Management, Performance Management, and Planned Maintenance Jennifer M. Yates and Zihui Ge 12.1 Introduction As the Internet grew from a fledgling network interconnecting a few University computers to a massive infrastructure deployed across the globe, the focus was primarily on providing connectivity to the masses. And IP has certainly achieved this. IP is used today to connect businesses and consumers across the globe. An increasingly diverse set of services have also come to use IP as the underlying communications protocol, including e-commerce, voice, mission-critical business applications, TV (IPTV), as well as e-mail and Internet web browsing. Nowadays, even people’s lives depend on IP networks – as IP increasingly supports emergency services and other medical applications. This tremendous diversity in applications places an equally diverse set of requirements on the underlying network infrastructure, particularly with respect to bandwidth consumption, reliability, and performance. At one extreme, e-mail is resilient to network impairments – the key requirement being basic connectivity. E-mail servers continually attempt to retransmit e-mails, even in the face of potentially lengthy outages. In contrast, real-time video services typically require high bandwidth and are sensitive to even very short-term “glitches” and performance degradations. IP is also a relatively new technology – especially if we compare it with technology such as the telephone network, which has now been around for over 130 years. As IP technology has matured over recent years, network availability has been driven up – a result of maturing hardware and software, as well as continued improvements in network management practices, tools, and network design. These improvements have enabled a shift in emphasis to focus beyond availability and faults to managing performance – for example, eliminating short-term “glitches,” which may not be at all relevant to many applications (e.g., e-mail), but can cause J.M. Yates () and Z. Ge AT&T Labs – Research, Florham Park, NJ, USA e-mail: jyates@research.att.com; gezihui@research.att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 12, c Springer-Verlag London Limited 2010 397 398 J.M. Yates and Z. Ge degradation in service quality to applications such as video streaming and online gaming. The transformation of IP from a technology designed to support best-effort data delivery to one that supports a diverse range of sensitive real-time and missioncritical applications is a testament to the industry and to the network operators who have created technologies, automation, and procedures to ensure high reliability and performance. In this chapter, we focus on the network management systems and the tasks involved in supporting the day-to-day operation of an ISP network. The goal of network operations is to keep the network up and running and performing at or above designed levels of service performance. Achieving this goal involves responding to issues as they arise, proactively making network changes to prevent issues from occurring, and evolving the network over time to introduce new technologies, services, and capacity. We can loosely classify these tasks into categories using traditional definitions discussed within the literature – namely, fault management, performance management, and planned maintenance. At a high level, fault management is easy to understand: it includes the “break/fix” functions – if something breaks, fix it. More precisely, fault management encompasses the systems and workflows necessary to ensure that faults are rapidly detected, the root causes are diagnosed, and the faults are rapidly resolved. It can also include predicting failures before they occur, and remediating the problem before it actually happens. Performance management can be defined in several different ways: for example, (1) designing, planning, monitoring, and taking action to prevent and recover from overload conditions once they happen, and (2) monitoring both end-network performance and per element performance and taking actions to address performance degradation. Performance can be measured and managed on the links between network elements (e.g., bandwidth utilization), on the network elements themselves (e.g., CPU utilization), on the traffic flow (e.g., packet loss, latency, or jitter), or on the quality of the service transactions (e.g., voice call quality or time to deliver a text message). The first definition focuses on traffic management and encompasses roles executed by network engineering organizations (capacity planning) and operations (real-time responses to network conditions). In this chapter, we will focus on the second definition of performance management: monitoring network performance and taking appropriate actions when performance is degraded. This performance degradation may be the result of an unplanned surge in traffic that exceeds engineered capacity (whether legitimate traffic or “attack traffic” from a Distributed Denialof-Service attack), loss of available capacity (e.g., owing to a hardware failure), intermittent problems that cause high bit error rates on a link, or logical problems such as incorrect configurations or software errors that create a “black hole.” Both network faults and degraded performance may require intervention by network operations, and the line between fault and performance management is blurred in practice. Sometimes, a fault can occur with no performance degradation, such as when a circuit board fails but its function is taken over by a redundant card. Alternatively, a performance problem can occur without any corresponding 12 Fault Management, Performance Management, and Planned Maintenance 399 hardware fault, such as when a surge of customer traffic exceeds available capacity. Then in other situations, a fault can occur and result in degraded performance, such as when a link fails and results in the network’s inability to carry all customer traffic without traffic loss or degradation. In this chapter, we refer to a network condition that may require the intervention of network operations as a network event. Events are triggered by an underlying incident; a single incident can result in multiple network events (e.g., link down, congestion, and packet loss). The events in question may have a diverse set of causes, including hardware failures, software bugs, external factors (e.g., flash crowds, outages in peer networks), or combinations of these. The impact resulting from different incidents also varies drastically, ranging from outages during which affected customers have no connectivity for lengthy periods of time (known as “hard outages”), through to those which result in little or no customer impact (e.g., loss of network capacity where traffic re-routes around it). In between these two extremes lie incidents which result in customers experiencing degraded performance to differing extents – e.g., sporadic packet loss, or increased delay and/or jitter. In addition to rapidly reacting to issues that arise, daily network operation also incorporates taking planned actions to proactively ensure that the network continues to deliver high service levels, and to evolve network services and technologies. We refer to such scheduled activities as planned maintenance. Planned maintenance includes a wide variety of activities, such as changing a fan filter in a router, hardware or software upgrades, capacity augmentation, preventive maintenance, or rearranging connections in response to external factors, including even highway maintenance being carried out where cables are laid. Such activities are usually performed at a specific, scheduled time, typically during nonbusiness hours when network utilization is low and the impact of the maintenance activity on customers can be minimized. This chapter covers both fault and performance management, and planned maintenance. Section 12.2 focuses on fault and performance management – predicting, detecting, troubleshooting, and repairing network faults and performance impairments. Section 12.3 examines how process automation is incorporated in fault and performance management to automate many of the tasks that were originally executed by humans. Process automation is the key ingredient that enables a relatively small operations group to manage a rapidly expanding number of network elements and customer ports, and growing network complexity. Section 12.4 discusses tracking and managing network availability and performance over time, looking across large numbers of network events to identify opportunities for performance improvements. Section 12.5 then focuses on planned maintenance. Finally, in Section 12.6, we discuss opportunities for new innovations and technology enhancements, including new areas for research. We conclude Section 12.7 with a set of “best practices.” 400 J.M. Yates and Z. Ge 12.2 Real-Time Fault and Performance Management Fault and performance management comprise a large and complex set of functions that are required to support the day-to-day operation of a network. As mentioned in the previous section, network events can be divided into two broad categories: faults and performance impairments. The term fault is used to refer to a “hard” failure – for example, when an interface goes from being operational (up) to failed (down). Performance events denote situations where a network element is operating with degraded performance – for example, when a router interface is dropping packets, or when there is an undesirably high loss and/or delay experienced along a given route. We use the term “event management” to generically define the set of functions that detect, isolate, and repair network malfunctions – covering both faults and performance events. The primary goal of event management is to rapidly respond to network incidents as they occur so that any resulting customer impact can be minimized. However, achieving this within the complex environment of ISP networks today requires a carefully designed and scalable event management framework. Figure 12.1 presents a simplified view of such a framework. The goal of the design is to enable rapid and reliable detection and notification of network events, so that action can be taken to troubleshoot and mitigate them. However, as anyone who has experience with large networks can attest, this is in itself a challenging problem. Event manager Ticketing System Event Management System (event correlation) Troubleshooting and repair Alarms/Alerts Syslog Collector Application and Service Monitoring Route Monitors Traffic (e.g., Netflow) Router Performance (SNMP) Router Liveness Testing Instrumentation layer IP/MPLS network Fig. 12.1 Simplified event management framework Router Faults 12 Fault Management, Performance Management, and Planned Maintenance 401 At the base of the event management framework lies an extensive and diverse network instrumentation layer. The instrumentation layer illustrated in Fig. 12.1 collects network data from both network elements (routers, switches, lower-layer equipment) and network monitoring systems designed to monitor network performance from vantage points across the network. The measurements collected are critical for a wide range of different purposes, including customer billing, capacity planning, event detection and notification, and troubleshooting. The latter two functions are critical parts of daily network operations. With regard to this, the goal is to ensure that the wide range of potential fault and performance events are reliably detected and that there are sufficient measurements available to troubleshoot network incidents. Both the routers and the collectors contained within the instrumentation layer detect events that can trigger notifications to the central event management system. Events detected include faults (e.g., link or router down) and performance impairments (such as excessive network loss or delay). These performance events are identified by looking for anomalous, “bad” conditions among the mounds of performance data, and logs collected within the instrumentation layer. Given the diversity of the instrumentation layer, any given network incident will likely be observed as multiple different events. For example, a fiber cut (the incident) could result in events detected in the lower layer (e.g., SONET layer failures) and in events detected in the IP/MPLS network layer (router links failing and potentially congestion, packet loss, and excessive end-to-end delay). Each event, in turn, could be detected by multiple monitoring points. Thus, it is likely that a single incident will result in a deluge of event notifications, which would swamp the operations personnel trying to investigate. Significant automation is thus introduced in the event management system to suppress and correlate multiple event notifications associated with the same incident. The resulting correlated event is used to trigger the creation of a ticket in the network ticketing system – these tickets are used to notify operations personnel about the occurrence of the incident. Once notified of an issue, operations personnel are responsible for troubleshooting, restoration, and repair. Troubleshooting, restoration, and repair have two primary goals: (1) restoring customer service and performance, and (2) fully resolving the underlying issue and returning network resources into service. These two goals may or may not be distinct – where redundancy exists, the network is typically designed to automatically reroute around failed elements, thus negating the need for manual intervention in restoring customer service. Troubleshooting and repair can then instead focus on returning the failed network resources into service so that there is sufficient capacity available to absorb future failures. In other situations, operations must resolve the issue to restore customer service. Rapid action from operations personnel is then imperative – until they can identify what is causing the problem and repair it, the customer may be out of service or experiencing degraded performance. This typically occurs at the edge of a network, where the customer connects to the ISP and redundancy may be limited. For example, if the nonredundant line-card to which a customer connects fails, the customer may be out of service until a replacement line-card can be installed. 402 J.M. Yates and Z. Ge We next delve into each of the different layers and functions of the event management framework in greater detail. 12.2.1 Instrumentation Layer, Event Detection, and Event Notification The foundation of an event management system is the reliable detection and notification of network faults and performance impairments. We thus start by considering how these network events are detected, and how the corresponding event notifications are generated. The primary goal is to ensure that the event management system is timely and reliably informed of network events. But each underlying incident may trigger a number of different events, and may occur anywhere across the vast network(s) being monitored. Depending on the impairments at hand, it may be that the routers themselves observe the events and report them accordingly. However, limitations in router monitoring capabilities, the inability to incorporate new monitoring capabilities without major router upgrades, and the need for independent network monitoring have driven a wide range of external monitoring systems into the network. These systems have the added benefit of being able to obtain different perspectives on the network when compared with the individual routers. For example, end-to-end performance monitoring systems obtain a network-wide view of performance; a view not readily captured from within a single router. Chapters 10 and 11 have discussed the wide range of network monitoring capabilities available in large-scale operational IP/MPLS networks. These are incorporated within the instrumentation layer to support a diverse range of applications, ranging from network engineering (e.g., capacity planning), customer billing, to fault and performance management. Within fault and performance management, these measurements support real-time event detection, as well as other tasks such as troubleshooting and postmortem analysis, which are discussed later in this chapter. The instrumentation layer, as illustrated in Fig. 12.1, is responsible for collecting measurements from both network elements (e.g., routers) and from external monitoring devices, such as route monitors and application/end-to-end performance measurement tools. Although logically depicted as a single layer, the instrumentation layer actually consists of multiple different collectors, each focused on a single monitoring capability. We next discuss these collectors in more detail. 12.2.1.1 Router Fault and Event Notifications Network routers themselves have an ideal vantage point for detecting failures local to them. They are privy to events that occur on inter-router links terminating on them, and also to events that occur inside the routers themselves. They can thus identify all sorts of hardware (link, line–card, and router chassis failures) and 12 Fault Management, Performance Management, and Planned Maintenance 403 software issues (including software protocol crashes and routing protocol failures). Routers themselves detect events and notify the event management systems – one can view the instrumentation layer for these events as residing directly within the routers themselves as opposed to in an external instrumentation layer. In addition to creating notifications about events detected in the router, routers also write log messages, which describe a wide range of events observed on the router. These are known as syslogs; they are akin to the syslogs created on servers. The syslog protocol [1] is used to deliver syslog messages from network elements to the syslog collector depicted in Fig. 12.1. These syslog messages report a diverse range of conditions observed within the network element, such as link and protocolrelated state changes (down, up), environment measurements (voltage, temperature), and warning messages (e.g., denoting when customers send more routes than the router is configured to allow). Some of these events relate to conditions that are then reported to the event management system, while others provide useful information when it comes to troubleshooting a network incident. Syslog messages are basically free-form text, with some structure (e.g., indicating date/time of event, location, and event priority). The form and format of the syslog messages vary between router vendors. Figure 12.2 illustrates some example syslog messages taken from Cisco routers – details regarding Cisco message formats can be found in [2]. 12.2.1.2 External Fault Detection and Route Monitoring Router fault detection mechanisms are complemented by other mechanisms to identify issues that may not have been detected and/or reported by the routers – either because the routers do not have visibility into the event, their detection mechanisms fail to detect it (maybe due to a software bug), or because they are unable to report the issue to the relevant systems. For example, basic periodic ICMP Mar 15 00:00:06 ROUTER_A 627423: Mar 15 00:00:00.554: %CI-6ENVNORMAL:+24 Voltage measured at 24.63 Mar 15 00:00:06 ROUTER_B 289883: Mar 15 00:00:00.759: %LINK-3UPDOWN: Interface Serial13/0.6/16:0, changed state to up Mar 15 00:00:13 ROUTER_C 2267435: Mar 15 00:00:12.473: %CONTROLLER-5-UPDOWN: Controller T3 10/1/1 T1 18, changed state to DOWN Mar 15 00:00:06 ROUTER_D 852136: Mar 15 00:00:00.155: %PIM-6INVALID_RP_JOIN: VRF 13979:28858: Received (*, 224.0.1.40) Join from 1.2.3.4 for invalid RP 5.6.7.8 Mar 15 00:00:07 ROUTER_E 801790: -Process= "PIM Process", ipl= 0, pid=218 Mar 15 00:25:26 ROUTER_Z bay0007:SUN MAR 15 00:25:24 2009 [03004EFF] MINOR:snmp-traps:Module in bay 4 slot 6, temp 65 deg C at or above minor threshold of 65 degC. Fig. 12.2 Example (normalized) syslog messages 404 J.M. Yates and Z. Ge “ping” tests are used to validate the liveness of routers and their interfaces – if a router or interface becomes unexpectedly unresponsive to ICMP pings, then this warrants concern and an event notification should be generated. Route monitors, as discussed in Chapter 11, also provide visibility into controlplane activity; activity that may not always be reported by the routers. IGP monitors, such as OSPFmon [3], learn about link and router up/down events and can be used to complement the same events reported by routers. However, route monitors extend beyond simple detection of link up/down events, and can provide information about logical routing changes affecting traffic routing. Even further, the route(s) between any given source/destination can be inferred using routing data collected from route monitors. This information is vital when trying to understand the events that have affected a given traffic flow. 12.2.1.3 Router-Reported Performance Measurements Although hardware faults have traditionally been the primary focus of event management systems, performance events can also cause significant customer distress, and thus must be addressed. The goal here is to identify and report when network performance deviates from desired operating regions and requires investigation. However, the reliable detection of performance issues is really quite different from detecting faults. Performance statistics are continually collected from the network; from these measurements, we can then determine when performance departs from the desired operating regions. As discussed in Chapter 10, routers track a wide range of different performance parameters – such as counts of the number of packets and bytes flowing on each router interface, different types of interface error counts (including buffer overflow, malformatted packets, CRC check violations), and CPU and memory utilization on the central router CPU and its line-cards. These performance parameters are stored within the router in the Simple Network Management Protocol (SNMP) Management Information Bases (MIBs). SNMP [4] is the Internet community’s de facto standard management protocol. SNMP was defined by the IETF, and is used to convey management information between software agents on managed devices (e.g., router) and the managing systems (e.g., event management systems in Fig. 12.1). External systems poll router MIBs on a regular basis, such as every 5 minutes. Thus, with a 5-min polling interval, the CPU measurement represents the average CPU utilization over a given 5-min interval; in polling packet counts, the poller can create measurements representing the total number of packets that have flowed on the interface in the given 5-min interval. The SNMP information collected from routers is used for a wide variety of purposes, including customer billing, detecting anomalous network conditions that could impact or risk impacting customers (congestion, excessively high CPU utilization, router memory leaks), and troubleshooting network events. 12 Fault Management, Performance Management, and Planned Maintenance 405 12.2.1.4 Traffic Measurements While SNMP measurements provide aggregate statistics about link loads – e.g., counts of total number of bytes or equivalently link utilization over a fixed time interval (e.g., 5 min), they provide little insight into how this traffic is distributed across different applications or different source/destination pairs. Instead, as discussed in Chapter 10, Netflow and deep packet inspection (DPI) are used to obtain much more detailed measurements of how traffic is distributed across applications, network links, and routes. These measurement capabilities are especially critical in troubleshooting various network conditions. DPI, in particular, can be used to obtain unique visibility into the traffic carried on a link, which is useful in trying to understand what and how traffic may be related to the given network issues. For example, DPI could be used to identify malformed packets and where they came from, or identify what traffic is destined to an overloaded router CPU. 12.2.1.5 End-to-End Network, Application and Service Monitoring Although monitoring the state and performance of individual network elements is critical to managing the overall health of the network, it is not in itself sufficient to understanding the performance as perceived by customers. It is imperative to monitor and detect events based on the end-to-end network and service-level performance. End-to-end measurements – even as simple as that achieved by sending test traffic across the network from one edge of the network to another – provide the closest representation of what customers experience as can be measured from within the ISP network. End-to-end measurements were discussed in more detail in Chapter 10. In the context of fault and performance management, they are used to identify anomalous network conditions, such as when there is excessive end-to-end packet loss, delay, and/or jitter. These events are reported to the event management system, and used to trigger further investigation. As discussed in Chapter 10, performance monitoring can be achieved using either active or passive measurements. Active measurements send test probes (traffic) across the network, collecting statistics related to the network and/or service performance. In contrast, passive measurements examine the performance of traffic being carried across the network, such as customer traffic. Ideally, such measurements would be taken out as close to the customers as possible, even into the customer domains. However, this is not always possible, particularly if the ISP does not have access to the customer’s end device or domain. End-to-end performance measurements can be used both to understand the impact of known network incidents, and to identify events that have not been detected via other means. When known incidents occur, end-to-end performance measurements provide direct performance measures, which present insight into the incident’s customer impact. In addition end-to-end measurements also provide an overall test of the network health, and can be used to identify the rare but potentially 406 J.M. Yates and Z. Ge critical issues (e.g., faults) which the network may have failed to detect. When it comes to faults, these are known as silent failures, and have historically been an artifact of immature router technologies – where the routers simply fail to detect issues that they should have detected. For example, consider an internal router fault (e.g., corrupted memory on a line-card), which is causing a link to simply drop all traffic. If the router fails to detect that this is occurring, then it will fail to reroute the traffic away from the failed link, and to report this as a fault condition. Thus, traffic will continue to be sent to a failed interface and be dropped – a condition known as black-holing. End-to-end measurements provide a means to proactively detect these issues; in the case of active measurements, the test probes will be dropped along with the customer traffic – this would be detected and appropriate notifications would be generated. Thus, the event can be detected even when the network elements fail to report them and (hopefully!) before customers complain. In addition to auditing the integrity of the network, simple test probes can also be used to estimate service performance (e.g., estimating how well IPTV services are performing). However, extrapolating from simple network loss and delay measures to understand the impact of a network event on any given network-based application is most often an extremely complex, if not impossible, task involving the intimate details of the application in question. For example, understanding how packet loss, delay and jitter impact the video streams is an area of active research. Ideally, we would like to directly measure the performance of each and every application that operates across a network. This may be an impossible task in networks supporting a plethora of different services and applications. However, if a network is critical to a specific application – such as IPTV – then that application also needs to be monitored, and appropriate mechanisms must be in place to detect issues. After all, the goal is to ensure that the service is operating correctly – ascertaining this requires direct monitoring of the service of interest. 12.2.1.6 Event Detection and Notification The instrumentation layer thus provides network operators with an immense volume of network measurements. How do we transform these data to identify network issues that require investigation and action? Let us start by identifying the events that we are interested in detecting. We clearly need to detect faults – conditions that may be causing customer outages or taking critical network resources out of service. We also aim to detect performance issues that may be causing degraded customer experiences. And finally, we wish to detect network element health concerns, which are either impacting customers or are risking customer impact. The majority of faults are relatively easy to detect – we need to be able to detect when network elements or components are out of service (down). It gets a little more complicated when it comes to defining performance impairments of interest. If we are too sensitive in the issues identified – for example, reporting very minor, short conditions – then operations personnel risk expending tremendous effort attempting to troubleshoot meaningless events, potentially missing events that are 12 Fault Management, Performance Management, and Planned Maintenance 407 of great significance among all the noise and false alarms. Short-term impairments, such as packet losses or CPU anomalies, are often too short to even enable real-time investigation – the event has more often than not cleared before operations personnel could even be informed about it, let alone investigate. And short-term events are expected – no matter how careful network planning and design is, occasional short-term packet losses, for example, will occur. However, even a transient problem might warrant attention if it is recurring and/or a leading indicator of a more serious incident that is likely to occur. Instead, we need to focus on identifying events that are sufficiently large, chronic (recurring), or persist for a significant period of time. Simple thresholding is typically used to achieve this – an event is identified when a parameter of interest exceeds a predefined value for a specified period of time. For example, an event may be declared when packet loss exceeds 0.3% over three consecutive polling periods across an IP/MPLS backbone network. However, note that more complicated signatures can be and are indeed used to detect some conditions. Section 12.4.1 discusses this in more detail, including how appropriate thresholds can be identified. Once we have detected an issue – whether it is a fault or performance impairment – our goal is to report it so that appropriate action can be taken. Event notification is realized through the generation of an alarm or an alert, which is sent to the event management platform as illustrated in Fig. 12.1. We distinguish between these two – an alarm traditionally describes the notification of a fault, while an alert is a notification of a performance event. Alarms and alerts themselves have a life span – they start with a SET and typically end through an explicit CLEAR. Thus, explicit notifications of both the start of an event (the SET) and the end of an event (CLEAR) generally need to be conveyed to the event management system. SNMP traps or informs (depending on the SNMP version) are used by routers to notify the event management system of events. Given the predominance of SNMP in IP/MPLS, this same mechanism is also often used between other collectors in the instrumentation layer and the event management layer. Traps/informs represent asynchronous reports of events and are used to notify network management systems of state changes, such as links and protocols going down or coming up. For example, if a link fails, then the routers at either end of the link will send SNMP traps to the network management system collecting these. These traps typically include information about the source of the event, the details of the type of event, the parameters, and the priority. Figure 12.3 depicts logs detailing traps collected from a router. The logs report the time and location of the event, the type of event that has occurred and the SNMP MIB Object ID (OID). In this particular case, we are observing the symptoms associated with a line-card failure, where the line-card contains a number of physical and logical ports. Each port is individually reported as having failed. This includes both the physical interfaces (denoted in the Cisco router example below as T3 4/0/0, T3 4/0/1, Serial4/1/0/3:0, Serial4/1/0/4:0, Serial4/1/0/5:0, Serial4/1/0/23:0, Serial4/1/0/24:0, Serial4/1/0/25:0 and the logical interfaces (denoted in the example below as Multilink6164 and Multilink6168). In the example shown in Fig. 12.3, the OID that denotes the link down event is “4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0” – and can be seen on each event notification. 408 J.M. Yates and Z. Ge 1238081113 10 Thu Feb 21 03:10:07 2009 Router_A- Cisco LinkDown Trap on Interface T3 4/0/0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081117 10 Thu Mar 26 15:25:17 2009 Router_A - Cisco LinkDown Trap on Interface T3 4/0/1;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081118 10 Thu Mar 26 15:25:18 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/3:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081118 10 Thu Mar 26 15:25:18 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/4:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081119 10 Thu Mar 26 15:25:19 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/5:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081124 10 Thu Mar 26 15:25:24 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/23:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081124 10 Thu Mar 26 15:25:24 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/24:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDown Trap on Interface Serial4/1/0/25:0;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDown Trap on In terface Multilink6164;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 1238081125 10 Thu Mar 26 15:25:25 2009 Router_A - Cisco LinkDownTrap on Interface Multilink6168;4.1.3.6.1.6.3.1.1.5.3.1.3.6.1.4.1.9.1.46 0 Fig. 12.3 Example logs from (anonymized) SNMP traps 12.2.2 Event Management System The preceding section discussed the vast monitoring infrastructure deployed in modern IP/MPLS networks and used to detect network events. However, the number of event notifications created by such an extensive monitoring infrastructure would simply overwhelm a network operator – completely obscuring the real incident in an avalanche of alarms and alerts. Manually weeding out the noise from the true root cause could take hours or even longer for a single incident – time during which critical customers may be unable to effectively communicate, watch TV, and/or access the Internet. This is simply not an acceptable mode of operation. By way of a simple example, consider a fiber cut impacting a link between two adjacent routers. The failure of this link (the fault) will be observed at the physical layer (e.g., SONET) on the routers at both ends of the link. It will also be observed in the different protocols running over that link – for example, in PPP, in the intradomain routing protocol (OSPF or IS-IS), in LDP and in PIM (if multicast is enabled), and potentially even through BGP. These will all be separately logged via syslog messages; failure notifications (traps) would be generated by the routers at both ends of the link in question to indicate that the link is down. The same incident 12 Fault Management, Performance Management, and Planned Maintenance 409 would be captured by route monitors (in this case, by the intradomain route protocol monitors [3]). Network management systems monitoring the lower-layer (e.g., layer one/physical layer) technologies will also detect the incident, and will alarm. Finally – should congestion result from this fault – performance monitoring tools may report excessively high load and packet losses observed from the routers, end-to-end performance degradations may be reported by the end-to-end performance monitoring, and application monitors would send alerts if customer services are impacted. Thus, even a single fiber cut can result in a plethora of event notifications. The last thing we need is for the network operator to manually delve into each one of these to identify the one that falls at the heart of the issue – in this case, the alarm that denotes a loss of incoming signal at the physical layer. Instead, automated event management systems as depicted in Fig. 12.1 are used to automatically identify both an incident’s origin and its impact from among the flood of incoming alarms. The key to achieving this is event correlation – taking the incoming flood of event notifications, automatically identifying which notifications are associated with a common incident, and then analyzing them to identify the most likely explanation and any relevant impact. The resulting correlated events are input to the ticket creation process; tickets are used to notify operations personnel of an issue requiring their attention. In the above example, the event’s origin can be crisply identified as being a physical layer link failure; the impact of interest to the network operator being any resulting congestion and the extent to which any service and/or network performance degradation is impacting customers, and how many and which customers are impacted. The greater the impact of an event, the more urgent is the need for the network operator to instigate repair. 12.2.2.1 Managing Event Notification Floods: Event Correlation So how can we effectively and reliably manage this onslaught of event notifications? First of all, not all notifications received by an event management system need to be correlated and translated into tickets to trigger human investigation. For example, expected notifications corresponding to known maintenance events can be logged and discarded – there is no need to investigate these, as they are expected sideeffects of the planned activities. Similarly, duplicate notifications can be discarded – if a notification is received multiple times for the same condition, then only one of these notifications needs to be passed on for further analysis. Thus, as notifications are received by the event management system, they are filtered to identify those which should be forwarded for further correlation. Similarly, notifications of one off events that are very short in nature – effectively “naturally” resolving themselves – can often be discarded as there is no action to be taken by the time that the notification makes it for further analysis. Thus, if an event notification SET is followed almost immediately by a CLEAR, then the event management system may chose to discard the event entirely. Such events can occur, for example, because of someone accidentally touching a fiber in an office, causing a very rapid, one off impairment. There is little point to expending significant 410 J.M. Yates and Z. Ge resources investigating this event – it has disappeared and is not something that someone in operations can do anything about – unless it continues to happen. Short conditions that continually reoccur are classified as chronics, and do require investigation. Thus, if an event keeps occurring, then a chronic event is declared and a corresponding ticket is created so that the incident can be properly investigated. Event correlation, sometimes also referred to as root cause analysis1 within the event management literature, follows the event filtering process. Put simply, event correlation examines the incoming stream of network event notifications to identify those which occur at approximately the same time and are physically or logically related, and can be associated with a common explanation. These are then grouped together as being a correlated event. The goal here is to identify the originating event – effectively identifying the type and location of the underlying incident being reported. Note that the originating event may not have been directly observed – it is entirely possible that only symptoms of the incident were reported, without a direct event reporting the origin. For example, consider a router line-card that supports multiple different interfaces. It may not actually be possible to directly detect a linecard failure. Instead, a line-card failure may be inferred from the observation of multiple (or all) interfaces failing within that line-card. The events reported to the event management system are thus the individual interface failures; the originating event is the line-card failure and must be inferred through event correlation. There are numerous commercial products available that implement alarm correlation for IP/MPLS networks, including HP’s Operations Center (previously OpenView) [5], EMC’s Ionix [6], and IBM’s Tivoli [7]. The basic idea behind these tools is the notion of causal relationships – understanding what underlying events cause what symptoms. Building this model requires detailed real-time knowledge and understanding of the devices and network behavior, and of the network topology. Given that network topology varies over time, the topology must be automatically derived. An event correlation engine then uses the discovered causal relationships to correlate the event notifications received. However, how the products implement event correlation varies. Some tools use defined rules or policies to perform the correlation. The rules are defined as “if condition(s) then conclusion” – conditions relate to observed events and information about the current network state, while the conclusion here for any given rule is the correlated event. These rules capture domain expertise from humans and can become extremely complex. The Codebook approach [8, 9] applies experience from graphs and coding to event correlation. A codebook is precalculated based on network topology and models of network elements and network behavior to model the signatures of different possible originating events. The event notifications received by an event management system implemented using codebooks are referred to as symptoms. 1 Note that although root cause analysis is a term often used by event management system vendors, we prefer to use the term “event correlation” here, as root cause more generally implies a far more detailed explanation than can be provided by event management systems. More details are provided later in this chapter. 12 Fault Management, Performance Management, and Planned Maintenance 411 These symptoms are compared against the set of known signatures; the signature that best matches the observed symptoms is selected to identify the originating event. Event signatures are created by determining the unique combination of event notifications that would be received for each possible originating event. This can be inferred by looking at the network topology and how components within a router and between routers are connected. A model of each router is (automatically) created – denoting the set of line-cards deployed within a router, the set of physical ports in existence on a line–card, and the set of logical interfaces that terminate on a given physical port. The network topology then indicates which interfaces on which line-cards on which routers are connected together to form links. In large IP/MPLS networks, such information is automatically discovered. Once the set of routers in the network is known (either through external systems or via autodiscovery), SNMP MIBs can be walked to identify all the interfaces on the router, and the relationships between interfaces (for example, which interfaces are on which port and which ports are on which line-card). Given knowledge of router structure and topology, event signatures can be identified by examining which symptoms would be observed upon each individual component failure. To illustrate this, let us consider the simple four-node scenario depicted in Fig. 12.4. In this example, we consider what would be observed if line-card C1 failed on router 2. Such a failure would be observed on all of the four interfaces contained within the first line-card on router 2 (card C1), and also on interface I1 on port P1 and interface I1 on port P2 on card C1 of router 1, and on interface I2 on port P1 and interface I2 on port P2 on card C1 of router 3. These are all interfaces that are directly connected to interfaces on the failed line-card on router 2. If the failure of line-card C1 on router 2 were to happen, then alarms would be generated by the three routers involved and sent to the event management system. This combination of symptoms thus represents the signature of the line-card failure Router 2 Router 1 P1 C1 P2 P1 C2 P2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 U1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 I1 I2 P1 P2 C1 P1 P2 C2 Router 4 Router 3 P1 C1 P2 P1 C2 P2 Fig. 12.4 Signature identification P1 P2 C1 P1 C2 P2 412 J.M. Yates and Z. Ge on router 1. This signature, or combination of event notifications, is unique – if this is observed, we conclude that the incident relates to the failure of line-card 1 on router 2. Let us now return to the SNMP traps provided in Fig. 12.3. This example depicted the traps sent by an ISP router Router A to the event management system upon detecting the failure of a set of interfaces. Traps were sent corresponding to eight physical interfaces and two logical (multilink) interfaces – in this example, all of these interfaces connect to external customers, as opposed to other routers within the ISP. The ISP thus has visibility into only the local ISP interfaces (the ones reported in the traps in Fig. 12.3), and does not receive traps from the remote ends of the links – the ends within customer premises. Thus, the symptoms observed for the given failure mode are purely those coming from the one ISP router – nothing from the remote ends of the links as was the case for the previous example illustrated in Fig. 12.4. This set of received alarms is compared with the set of available signatures; comparison will provide a match between this set of alarms (symptoms) and the signature associated with the failure of a line-card (Serial4) on Router A. The resulting correlated event output from the event management system identifies the line-card failure, and incorporates the associated symptoms (namely the individual interface failures). Figure 12.5 depicts an example log of the correlated event that would have been output by an event management system for this failure example. The format of the alarm log here is consistent with alarms generated by a production event management system. In this particular example, slot (line-card) Serial4 on Router A was reported to be down. Supporting information included in the alarm log indicates that ten out of a total of ten configured interfaces on this line–card were down. This is a critical alarm, indicating that immediate attention was required. Correlation can also be used across layers or across network domains to effectively localize events for more efficient troubleshooting. Consider the example illustrated in Fig. 12.6. Let us consider an example of a fiber cut – this occurs in the layer one (L1) network, which is used to directly interconnect two routers. If the layer one and the IP/MPLS networks are maintained by different organizations (a common situation), then there is little that can be done in the IP organization to repair the failure. However, the IP routers both detect the issue. From an IP perspective, cross-layer correlation can be used to identify that the incident occurred in a different layer; the IP organization should thus be notified, but with the clarification that this is informational – another organization (layer one) is responsible for repair. Such correlations can save a tremendous amount of unnecessary resource expenditure, by accurately and clearly identifying the issue, and notifying the appropriate organization of the need to actuate repair. 03/26/2009 15:25:28: Incoming Alarm: Router_A:Interfaces|Slot Threshold Alarm: Router_A:Serial4 Down. There are 10 out of 10 interfaces down. Down Interface List: Serial4/0/0, Serial4/0/1, Serial4/1/0/3:0, Serial4/1/0/4:0, Serial4/1/0/5:0, Serial4/1/0/23:0, Serial10/1/0/24:0, Serial10/1/0/25:0, Multilink6164, Multilink6168|Critical Fig. 12.5 Correlated event associated with the traps illustrated in Fig. 12.3 12 Fault Management, Performance Management, and Planned Maintenance Router interface connected to L1 network 413 Router interface connected to L1 network L1 network Fig. 12.6 Cross-layer correlation The correlated (originating) event output by the event management system effectively identifies approximately where an event occurred and what type of event it was. However, this is still a long way from identifying the true root cause of an incident, and thus being able to rectify it. If we consider the example depicted in Fig. 12.4, event correlation is able to successfully isolate the problem as being related to line-card 1 on router 2. However, we still need to determine how and why the line-card failed. Was the problem in hardware or software? What in the hardware or software failed and why? Tremendous work is still typically required before reaching true issue resolution. Assuming that human investigation is required, this is achieved by automatically opening a ticket. 12.2.3 Ticketing Tickets are used to notify operations personnel of events that require investigation, and to track ongoing investigations. Tickets are at the heart of troubleshooting network events, and record actions taken and their results. If the issue being reported has been detected within the network, then the tickets are automatically opened by the event management systems. However, if the issue is, for example, first reported by a customer before it has been detected by the event management systems, then the ticket will likely have been opened by a human, presumably the customer or a representative from the ISP’s customer care organization. The tickets are opened with basic information regarding the event at hand, such as the date, time, and location of where the issue was observed, and the details of the correlated events and original symptoms that triggered the ticket creation. From there, operations personnel carefully record in the ticket the tests that they execute in troubleshooting the event, and the observations made. They record interactions with other organizations, such as operations groups responsible for other technologies in the network, or employees from equipment manufacturers (vendors). Clearly, carefully tracking observations made while troubleshooting a complex issue is critical for the person(s) investigating. Moreover, the tickets also serve the purpose of allowing easy communication about investigations across personnel, such as between the network management team and the customer service team, or to hand off across, say, a change of shift. 414 J.M. Yates and Z. Ge Once an investigation has reached its conclusion and the issue has been rectified, the corresponding ticket is closed. A resolution classification is typically assigned, primarily for use in tracking aggregate statistics and for offline analysis of overall network performance. Network reliability modeling is discussed in more detail in Chapter 4 and in Section 12.4 of this chapter. Tickets provide the means to track and communicate issues and their current status. However, the real challenge lies in troubleshooting these issues. 12.2.4 Troubleshooting Troubleshooting a network issue is analogous to being a private investigator – hunting down the offender and taking appropriate action to rectify the issue. Drilling down to root cause can require keen detective instincts, knowing where to look and what to look for. Operations teams often draw upon vast experience and domain knowledge to rapidly delve into the vast depths of the network and related network data to crystallize upon the symptoms, and theorize over potential root causes. This is often under extreme pressure – the clock is ticking; customer service may be impaired until the issue is resolved. The first step of troubleshooting a network incident is to collect as much information about the event as possible that may help with reasoning about what is happening or has happened. Clearly, a fundamental part of this involves looking at the event symptoms and impact. The major symptoms are generally provided in the correlated event that triggered the ticket’s creation. Additional information can be collected as the result of tests that the operator performs in the network during further investigation of the issue. This can also be complemented by historical data pulled from support systems or from analysis of actions taken previously within the network (e.g., maintenance activities) that could be related. The tests invoked by the network operator range considerably in nature, depending on the type of incident being investigated. In general, they may include ping tests (“ping” a remote end to see if it is reachable), different types of status checks (executed as “show” commands on Cisco routers, for example), and on-demand endto-end performance tests. In addition to information about the event, it may also be important to find out about potentially related events and activities – for example, did something change in the network that could have invoked the issue? This could have been a recent change, or something that changed a while ago, waiting like a ticking time bomb until conditions were ripe for a dramatic appearance. Armed with this additional information, the network operator works toward identifying what is causing the event, and how it can be mitigated. In the majority of situations, this can be achieved quickly – leading to rapid, permanent repair. However, some incidents are more complex to troubleshoot. This is when the full power of an extended operations team comes into force. Let us consider a hypothetical example to illustrate the process of troubleshooting a complex incident. In this scenario, let us assume that a number of edge routers 12 Fault Management, Performance Management, and Planned Maintenance 415 across the network fail one after another over a short period of time. As these are the routers that connect to the customers, it is likely that the majority of customers connected to these routers are out of service until the issue is resolved – the pressure is really on to resolve the issue as fast as possible! Given that we are assuming that routers are failing one after another, it takes two or more routers to fail before it becomes apparent that this is a cascading issue, involving multiple routers. Once this becomes apparent, the individual router issues should be combined to treat the event as a whole. As discussed earlier, the first goal is to identify how to bring the routers back to being operational and stable so that customer service is restored. Achieving this requires at least some initial idea about the incident’s underlying root cause. Then, once the routers have been brought back in service, it is critical to drill down to fully understand the incident and root cause so that permanent repair can be achieved – ideally permanently eliminating this particular failure mode from the network. The first step in troubleshooting such an incident is to collect as much information as possible regarding the symptoms observed, identify any interesting information that may shed light on the trouble, and create a timeline of events detailing when and where they occurred. This information is typically collated from the alarms, alerts, and tickets, by running diagnostic commands on the router, and from information collected within the various collectors contained within the instrumentation layer. Syslogs provide a huge amount of information about events observed on network routers. This information is complemented by that obtained from route monitors and performance data collected over time, and from logs detailing actions taken on the router by operations personnel and automated systems. The biggest challenge now becomes how to find something useful among the huge amount of data generated by the network. Operations personnel would painstakingly sift through all these data, raking through syslogs, examining critical router statistics (CPU, memory, utilization, etc.), identifying what actions were performed within the network during the time interval before the event, and examining route monitoring logs and performance statistics. Depending on the type of incident, and whether the routers are reachable (e.g., out of band), diagnostic tests and experimentation with potential actions to repair the issue are also performed within the network. In a situation where multiple elements are involved, it is also important to focus on what the routers involved have in common and how they may be different from other routers that were not impacted. Are the impacted routers all made by a common router vendor, are they a common router model? Do they share a common software version? Are they in a common physical or logical location within the network? Do they have similar number of interfaces, or load, or anything that may relate to the issue at hand? Has there been any recent network changes made on these particular routers that could have triggered the incident? Are there customers that are in common across these routers? Identifying the factors that are common and those that are not can be critical in focusing into what may be the root cause of the issue at hand. For example, if the routers involved are from different vendors, then it is less likely to be a software bug causing the issue. And if there is a common 416 J.M. Yates and Z. Ge customer associated with all the impacted routers, then it would make sense to look further into whether there is anything about the specific customer that may have induced the issue or contributed in some way. For example, was the customer making any significant changes at the time of the incident? The initial goal is to extract sufficient insight into the situation at hand to indicate how service can be restored. Once this is achieved, appropriate actions can be taken to rectify the solution. However, in many situations, such as those induced by router bugs, the initial actions may restore service, but do not necessarily solve the underlying issue. Further (lengthy) analysis – potentially involving extensive lab reproduction and/or detailed software analysis – may be necessary to truly identify the real root cause. This is often known as postmortem analysis. Large ISPs typically maintain labs containing scaled down networks containing the same types of hardware and software configurations as are operating in production. The labs provide a controlled environment in which to experiment without the risk of impacting customer traffic. The labs are used both to extensively test the hardware and software before it is deployed in the production network, and to understand issues as they arise in the production network. When it comes to troubleshooting issues, testers use the lab environment to reproduce the symptoms observed in the field, evaluate the conditions under which they occur and take additional measurements in a bid to uncover the underlying root cause. The lab environment is also often used to experiment with potential solutions to address the issue, and to certify solutions before they are deployed in the production network. Detailed analysis of software and hardware designs and implementations can also provide tremendous insight into why problems are occurring and how they can be addressed. These are typically done by or in collaboration with the vendor in question. 12.2.5 Restore Then Repair As discussed previously, recovering from an incident typically involves two goals: (1) restoring customer service and performance, and (2) fully resolving the underlying issue and returning network resources into service. These two goals may or may not be distinct – where redundancy exists, the network is typically designed to automatically reroute around failures, thus negating the need for manual intervention in restoring customer service. Troubleshooting and repair can instead focus on returning the failed network resources into service so that there is sufficient capacity available to absorb future failures. In other situations, customer service restoral and failure resolution may be one and the same. Let us consider two examples in more detail. 12.2.5.1 Core Network Failure Consider the example of a fiber cut impacting a link between two routers in the core of an IP/MPLS network. IP network cores are designed with redundancy and 12 Fault Management, Performance Management, and Planned Maintenance F E C 417 D A H I B IP/MPLS network G Fig. 12.7 Core network fiber cut spare capacity, so that traffic can be rerouted around failed network resources. This rerouting is automatically initiated by the routers themselves – the exact details of which depend on the mechanisms deployed, as discussed in Chapter 2. In the example in Fig. 12.7, traffic between the routers E and D is normally routed via router C. However, in the case where the link between routers C and D fails, the traffic is then rerouted via F and H. Assuming that the network has successfully rerouted all traffic without causing congestion, the impetus for rapidly restoring the failed resources is to ensure that the capacity is available to handle future failures and/or planned maintenance events. If, however, congestion results from the failure, then immediate intervention is required by operations personnel to restore the customer experience. Immediate action would likely be taken to reroute or load-balance traffic, in a bid to eliminate the performance issues while the resources are being repaired – for example, by tuning the IGP weights (e.g., OSPF weight tuning). In the example of Fig. 12.7, if there were congestion observed on, say, the link between routers F and H, then operations personnel may need to reroute some of the traffic via routers F, I, and G. Note that this requires that operations have an appropriate network modeling tool available to simulate potential actions before they are taken. This is necessary to ensure that the actions to be taken will achieve the desired result of eliminating the performance issues being observed. In this example, permanent repair is achieved when the fiber cut is repaired. This requires that a technician travel to the location of the cut, and resplice the impacted fiber(s). 12.2.5.2 Customer-Facing Failure Let us now consider the failure of a customer-facing line-card in a service provider’s edge router. We focus on a customer that has only a single connection between the customer router and the ISP edge router, as illustrated in Fig. 12.8. Cost-effective line-card protection mechanisms simply do not exist for router technologies today. Instead, providing redundancy on customer-facing interfaces requires that an ISP deploy a dedicated backup line-card for each working line-card. 418 J.M. Yates and Z. Ge IP/MPLS network Provider edge router Customer-facing interface failure Customer router Fig. 12.8 Failure on customer-facing interface However, this may be prohibitively expensive; instead customers that need higher reliability can choose a multihomed redundancy option where the customer purchases two separate connections to either a common ISP or two different ISPs. In situations where redundancy is not deployed and customers are not multihomed, a failure of the ISP router line-card facing the customer will result in the customer being out of service until the line-card can be returned to service. If the line-card failure is caused by faulty hardware, the customer may be out of service until the failed hardware can be replaced, necessitating a rapid technician visit to the router in question. However, if the issue is in software, for example, service can potentially be restored via software rebooting of the line-card. Although this apparently fixes the issue and restores customer service, it is not a permanent repair. If this is likely to occur again, then the issue must be permanently resolved – the software must be debugged, recertified (tested) for network deployment, and then installed on each relevant router network-wide before permanent repair is achieved. This could involve upgrading potentially hundreds of routers – a major task, to say the least. In this case, the repair of the underlying root cause takes time, but is necessary to ensure that the failure mode does not occur again within the network. If the issue is extreme and a result of a newly introduced software release on the router, then the software may be “rolled back” to a previous version that does not suffer from the software bug. This can provide a temporary solution while the newer software is debugged. 12.3 Process Automation The previous section described the basic systems and processes involved in detecting and troubleshooting faults and performance impairments. These issues often need to be resolved under immense time pressure, especially when customers are being directly impacted. However, humans are inherently slow at many tasks. In contrast, computer systems can perform well-defined tasks much more rapidly, and are necessary 12 Fault Management, Performance Management, and Planned Maintenance 419 to support the scale of a large ISP network. Although human reasoning about extremely complex situations is often difficult, if not impossible, to automate; automation can be used to support human analysis and to aid in performing simple, well-defined and repetitive tasks. This is referred to here as process automation. Process automation is widely used in many aspects of network management, such as in customer service provisioning. Relevant to this chapter, it is also applied to the processes executed in network troubleshooting and even repair. Process automation in combination with the event filtering and correlation discussed in Section 12.2.2.1 are what enable a small operations team to manage massive networks that are characterized by tremendous complexity and diversity. The automation also speeds trouble resolution, thereby minimizing customer service disruptions, and eliminates human errors, which are a fact of life, no matter how much process is put in place in a bid to minimize them. Over time, operations personnel have identified a large number of tasks that are executed repeatedly and are very time-consuming when executed by hand. Where possible, these tasks are automated in the process automation capabilities illustrated in the modified event management framework of Fig. 12.9. The process automation system lies between the event management system and the ticketing system. One of its major roles is to provide the interface between these two systems – listening to incoming correlated events and opening, closing, and Ticketing System Event manager Process Automation Inventory Database Event Management System (event correlation) Alarms/Alerts Application Route Syslog and Service Monitors Collector Monitoring Traffic (e.g., Netflow) Router Router Performance Liveness (SNMP) Testing Troubleshooting and repair Router Faults Network Interface Instrumentation layer IP/MPLS network Fig. 12.9 Event management framework incorporating process automation 420 J.M. Yates and Z. Ge updating tickets. But rather than simply using the output of the event management system to trigger ticket creation and updates, process automation can take various actions based on network state and the situation at hand. The system implementing the process automation collects additional diagnostic information, and reasons about relatively complex scenarios before taking action. On creating or updating a ticket, the process automation system automatically populates the ticket with relevant information that it collected in evaluating the issue. This means that there is already a significant amount of diagnostic information available in the ticket as it is opened by a human being to initiate investigation into an incident. This can dramatically speed up incident investigation, by eliminating the need for humans to go out and manually execute commands on the network elements to collect initial diagnostic information. The process automation system interfaces with a wide range of different systems to execute tasks. In addition to the event management and ticketing systems, a process automation system also interacts closely with network databases and with the network itself (either directly, or indirectly through another system). For example, collecting diagnostic information related to an incoming event will likely involve reaching out to the network elements and executing commands to retrieve the information of interest. This could be done either directly by the process automation system, or via an external network interface, as is illustrated in Fig. 12.9. Process automation is often implemented using an expert system – a system that attempts to mimic the decision-making process and actions that a human would execute. The logic used within the expert system is defined by rules or policies, which are created and managed by experts. The number of rules in a complex process automation system is typically in the order of 100s–1,000s given the complexity and extensive range of the tasks at hand. Rules are continually updated and managed by the relevant experts as new opportunities for automation are conceived and implemented, processes are updated and improved, and the technologies used within the network evolve and change. Figure 12.10 depicts an example that demonstrates the process automation steps executed upon receipt of a basic notification that a slot (or line card) is down (failed). A slot refers to where a single line card is housed within a router; the line-card in turn is assumed to support multiple interfaces. Each active interface terminates a connection to a customer, peer, or adjacent network router. As can be seen from this example, the system executes a series of different tests and then takes different actions depending on the outcome(s) of each test. The tests and actions executed are specific to the type of event triggering the automation, and also to the router model in question. Note that in this particular case, the router in question is a Cisco router, and thus Cisco command line interface (CLI) commands are executed to test the line-card. The process automation example in Fig. 12.10 is for supporting the network operations team. This team is focused on managing the network elements and is not responsible for troubleshooting individual customer issues – those are handled by the customer care organization. Thus, a primary goal of the process automation in this context is to identify issues with network elements; weeding out individual 12 Fault Management, Performance Management, and Planned Maintenance Event Manager Ticketing Process Automation Event Management System Instrumentation Layer k wor Net Network Interface 421 PROCESS AUTOMATION SLOT DOWN RULES: 1. Alarm is captured by specific process automation rules for trouble. 2. Network interface script is executed to telnet to affected Network Element (NE). 3. SHOW DIAG command is executed on NE to obtain the affected slot diagnostics. Results are parsed. 4. Affected slot diagnostics are tested. If Slot reports not OK, ticket is generated. 5. Otherwise, if Slot reports ok, SHOW INTERFACE BRIEF command is executed to verify the state of the configured interfaces and results are parsed for associated interfaces on slot. 6. Status of associated interfaces on slot is checked. If interface status is UP/DN or DN/DN script is halted and ticket is generated. 7. Otherwise, if interface status is UP/UP, a PING attempt is made to the interface address to verify if it is accessible or not. 8. If PING fails, ticket is generated. 9. Otherwise, if PING is successful, check is made for additional interfaces and repeat from 6 until no additional interfaces 10. Directory of bootflash is retrieved and parsed looking for a crash dump file. If crash dump is found, file is held and TFTP to network interface system and path to file is provided for ticket generation. 11. SHOW TECH SUPPORT command is executed and output is provided for ticket creation as well. 12. Ticket is generated with all captured information populated within ticket comments / log for further investigation. Fig. 12.10 Example process automation rule – router “slot down” customer issues (and not creating tickets on them for this particular team). In this example of a customer-facing line-card with multiple customers carried on different interfaces on the same card, the automation ensures that there are multiple interfaces that are simultaneously experiencing issues, as opposed to being associated with a single customer. This makes it likely that the issue at hand relates to the line-card (part of the network), rather than the customer(s). The results of the tests executed by the process automation system are presented to the operations team through the ticket that is created or updated. Thus, as an operations team member opens a new ticket ready to troubleshoot an issue, he/she is immediately presented with a solid set of diagnostic information, eliminating the trouble and delay associated with logging into individual routers and running these same tests. This significantly reduces the investigation time, the customer impact, and the load on the operations team. In addition to ensuring that only relevant tickets are created, process automation also distinguishes between new issues and new symptoms associated with an existing, known problem. Thus, if a ticket has already been created for a current incident, when a new symptom is detected for this same ongoing event, the process automation will associate the new information with the existing ticket, as opposed to creating a new ticket. For a more complex scenario, now consider a customer-initiated ticket, created either directly by the customer or by a customer’s service representative. The process automation system picks up the ticket automatically and, with support from other systems, launches tests in an effort to automatically analyze the issue. If the system can localize the issue, it may either automatically dispatch workforce to the field or refer the ticket to the appropriate organization, which may even be 422 J.M. Yates and Z. Ge another company (e.g., where another telecommunications provider may be providing access). On receiving confirmation that the problem has been resolved, the process automation system also executes tests to validate, and then closes the ticket if all tests succeed. If the expert system is unable to resolve the issue, it then collates information regarding the diagnostic tests and creates a ticket to trigger human investigation and troubleshooting. The opportunity space for process automation is almost limitless. As technologies and modeling capabilities improve, automation can and is being extended into service recovery and repair and even incident prevention. We refer to this as adaptive maintenance. Although it is clearly hard to imagine automated fiber repairs in the near future, there are scenarios in which actions can be automatically taken to restore service. For example, network elements or network interfaces can be rebooted automatically should such action likely restore service. Consider one scenario where this may be an attractive solution – a known software bug is impacting customers and the operator is waiting for a fix from the vendor; in the meantime, if an issue occurs, then the most rapid way of recovering service is through the automated response – not waiting for human investigation and intervention. As another example, let us consider errors in router configurations (misconfigurations), which can be automatically fixed via process automation. Consider a scenario where regular auditing such as that discussed in Chapter 9 identifies what we refer to as “ticking time bombs” – misconfigurations that could cause significant customer impact under specific circumstances. These misconfigurations are automatically detected, and can also be automatically repaired – the “bad” configuration being replaced with “good” configuration, thereby preventing the potentially nasty repercussions. While adaptive maintenance promises to greatly reduce recovery times and eliminate human errors that are inevitable in manual operations, a flawed adaptive maintenance capability can create damage at a scale and speed that is unlikely to be matched by humans. It is thus crucial to carefully design and implement such an adaptive maintenance system, and ensure that safeguards are introduced to prevent potentially larger issues from arising with the wrong automated response. Meticulous tracking of automated actions is also critical, to ensure that automated repair does not hide underlying chronic issues that need to be addressed. However, even with these caveats and warnings, automation often offers an opportunity for far more rapid recovery from issues than a human being could achieve through manually initiated actions. The value of automation throughout the event management process (from event detection through to advanced troubleshooting) is unquestionable – reducing millions of event notifications to a couple of hundred or fewer tickets that require human investigation and intervention. This automation is what allows a small network operations team to successfully manage a massive network, with rapidly growing numbers of network elements, customers, and complexity. 12 Fault Management, Performance Management, and Planned Maintenance 423 12.4 Managing Network Performance Over Time The event management systems discussed in Section 12.2 and 12.3 primarily focus on real-time troubleshooting of individual large network events that persist over extended periods of time, such as link failures, or recurring intermittent (chronic) flaps on individual links. However, this narrow view of looking at each individual event in isolation risks leaving network issues flying under the event management radar, while potentially impacting customers’ performance. Let us consider an analogy of financial management within a large corporate or government organization. In keeping a tight reign over a budget, an organization would likely be extremely careful about managing large transactions – potentially going to great lengths to approve and track each of the individual large purchases made across the organization. However, tracking large transactions without considering the broader picture can result in underlying issues flying under the radar. A single user’s individual transaction may appear fine in isolation, but analysis over a longer time interval may uncover an excessively large number of such transactions – something that may justify further investigation. Focusing only on the large transactions, and allowing smaller transactions to proceed without attention allows the system to scale – it would simply be impractical to have each individual request approved, independent of its cost. However, if no one is tracking the bottom line – the total expenditure – it may well be that the small expenditures add up considerably, and could lead to financial troubles in the long run. Instead, careful tracking of how money is spent across the board is critical, characterizing at an aggregate level how, where, and why this money is spent, whether it is appropriate and (in situations where money is tight) where there may be opportunities for reductions in expenditure. New processes or policies may well be introduced to address issues identified. However, this can only be seen with careful analysis of longer-term spending patterns covering both large and small expenditures. Returning to the network, carefully managing network performance also requires examining network events holistically – exploring series of events – large and small – instead of purely focusing on each large event in isolation. The end goal is to identify actions that can be taken to improve overall network performance and reliability. Such actions can take many forms, including software bug fixes, hardware redesigns, process changes, and/or technology changes. An important step toward driving network performance is to carefully track performance over time, with periods of poor performance identified so that intervention can be initiated. For example, if it is observed that the network has recently been demonstrating unacceptably high unavailability due to excessive numbers of linecard failures, then investigation should be initiated in an effort to determine why this is occurring, and to take appropriate actions to rectify the situation. However, we do not need to wait for performance to degrade – regular root cause analysis of network impairments can identify areas for improvements, and potentially even uncover previously unrecognized yet undesirable network behaviors that could be eliminated. For example, consider a scenario where ongoing root cause analysis of network packet loss uncovered chronic slower-than-expected recovery times in the 424 J.M. Yates and Z. Ge face of network failures. Once identified, efforts can be initiated to identify why this is occurring (router software bug?, hardware issue?, fundamental technology limitations?), and to then drive either new technologies or enhancements to existing technologies into the network to permanently rectify this situation. Tracking network performance over time and drilling into root causes of network impairments typically involves delving into large amounts of historical network data. Exploratory Data Mining (EDM) is thus used to complement real-time event management through detailed analysis across large numbers of events, identifying patterns in failures and performance impairments and in the root causes of these network events. However, this is clearly a challenging goal; the volumes of network data are tremendous, the data sources are very diverse, and the patterns to be identified can be complex. 12.4.1 Trending Key Performance Indicators (KPIs) So let us start by asking a seemingly simple question – how well is my network performing? The first step here is to clearly define what we mean by network performance, so that we can provide metrics that can be evaluated using available network measurements. We refer to these metrics as Key Performance Indicators, or KPIs. When tracking how well the network is performing, it is important to ensure that metrics obtain a view that is as close as possible to what customers are experiencing. However, how well the network is performing is in the eye of the beholder – and different beholders have different vantage points, and different criteria. Some applications (e.g., e-mail or file transfers) are extremely resilient to short outages while others, such as video, are extremely sensitive to even very short-term impairments. Thus, KPIs need to capture and track a range of different performance metrics, which reflect the diversity of applications being supported. KPIs should track application measures, such as the frequency and severity of video impairments that would be observable to viewers. However, network-based metrics are also critical, particularly in networks where there is a vast array of different applications being supported. Thus, KPIs should include, but not be limited to, metrics tracking network availability (DPM – see Chapter 3 for details), application performance, and end-to-end network performance (packet loss, delay, and jitter). KPIs can also capture noncustomer-impacting measures of network health, such as the utilization of network resources. These provide us with the ability to track network health before we hit customer-impacting issues. There are many limited resources in a router – link capacity is an obvious one; router processing power and memory are two other key examples. Link capacity is traditionally tracked as part of the capacity management process discussed in Chapter 5, and is therefore not discussed further here. However, router CPU and memory utilization – both on the router’s central route processor and on individual line-cards – are also limited yet critical resources that are often less well analyzed than link capacity. One of the most critical functions that the central route processor is responsible for is control-plane management – ensuring that routing updates are successfully received and sent by 12 Fault Management, Performance Management, and Planned Maintenance 425 the router, and that the internal forwarding within the router is appropriately configured to ensure successful routing of traffic across the network. Thus, the integrity of the network’s control plane is very much dependent on router CPU usage – if CPUs become overloaded with high-priority tasks, the integrity of the network control plane could be put at risk. Router memory is similarly critical – if memory becomes fully utilized within a router, then the router is at risk of crashing, causing a nasty outage. Thus, these limited resources must be tracked to ensure that they are not approaching exhaust either over the long-term, or over shorter periods of time. KPIs must be measurable – thus, a given KPI must map to a set of measurements that can be made on an ongoing basis. Availability-based KPIs can be readily calculated based on logs from the event management system and troubleshooting analyses. Chapters 3 and 4 discuss availability modeling and associated metrics; they are thus not discussed further here. Application-dependent performance metrics can be obtained through extensive monitoring at the application level. Such measurements can be implemented either using “test” measurements executed from sample devices strategically placed across the network (active measurements), or by collecting statistics from network monitors or from user devices where accessible (passive measurements). Such application measurements are a must for networks that support a limited set of critical applications – such as an IPTV distribution network. However, it is practically impossible to scale this to every possible application type that may ride over a general-purpose IP/MPLS backbone, especially if application performance depends on a plethora of different customer end devices. Instead, end-to-end measurements of key network performance criteria, namely packet loss, delay, and jitter, are the closest general network measures of customer-perceived network performance. These measurements provide a generic, application-independent measure of network performance, which can (ideally) be applied to estimate the performance of a given application. End-to-end packet loss, delay, and jitter would likely be captured from an endto-end monitoring system, such as described in Chapter 10 and [10]. In an active monitoring infrastructure, for example, large numbers of end-to-end test probes are sent across the network; loss and delay measurements are calculated based on whether these probes are successfully received at each remote end, and how long they take to be propagated across the network. From these measurements, network loss, delay, and jitter can be estimated over time for each different source and destination pair tested. By aggregating these measures over larger time intervals, we can calculate a set of metrics, such as average or 95th percentile loss/delay for each individual source/destination pair tested. These can be further aggregated across source/destination pairs to obtain network-wide metrics. However, how loss is distributed over time can really matter to some applications – a continuous period of loss may have greater (or lesser) impact on a given application than the same total loss distributed over a longer period of time. KPIs can thus also examine other characteristics of the loss beyond simple loss rate measurements – for example, tracking loss intervals (continuous periods of packet loss). Metrics that track the number of “short” versus “long” duration outages may be used to characterize network performance and its impact on various network-based applications. 426 J.M. Yates and Z. Ge Ideally, end-to-end measurements should extend out as far to the customer as possible, preferably into the customer domain. However, this is often not practical – scaling to large numbers of customers can be infeasible, and the customer devices are often not accessible to the service provider. Thus, comprehensive endto-end measurements are often only available between network routers, leaving the connection between the customer and the network beyond the scope of the endto-end measurements. Tracking performance at the edge of the ISP network thus requires different metrics. One popular metric is BGP flap frequency aggregated over different dimensions (e.g., across the network, per customer, etc.). However, it is important to note that by definition the ISP/customer and ISP/peer interfaces cross trust domains – the ISP often only has visibility and control over its own side of this boundary, and not the customer and peer domains. It is actually often extremely challenging to distinguish between customer/peer-induced issues and ISP-induced issues. Thus, without knowing about or being responsible for customer and peer activities on the other side of trust boundaries, BGP event measures and other similar metrics can be seriously skewed. A customer interface that is flapping incessantly can significantly distort these metrics, making it extremely challenging to distinguish patterns that may be attributed to the ISP. Once we have defined our key metrics, we can then track how these change over time. This is known as trending. Trending of KPIs is critical for driving the network to higher levels of reliability and performance, and for identifying areas and opportunities for improvement. The goal is to see these KPIs improve over time, corresponding to network and service performance improvements. However, if KPIs turn south, indicating worsening network and service conditions, investigation would likely be required. KPIs can thus be used to focus operations’ attention to areas that need most immediate attention. Let us consider a simple example of end-to-end loss. If the loss-related KPIs (e.g., average loss) degrade, then investigation would be required to understand the underlying root cause(s) and to (hopefully) initiate action(s) to reverse the negative trend. Obviously, the actions taken depend on the root cause(s) identified – but could include capacity augments, elimination of failure modes (e.g., if loss may be introduced by router hardware or software issues), or may even require the introduction of a new technology, such as faster failure recovery mechanisms. Careful tracking of KPIs over time can also enable the detection of anomalous network conditions – thereby detecting issues that may be flying under the radar of the event management systems described in Section 12.2. Let us consider an example where the rate of protocol flaps has increased within a given region of the network. The individual flaps are too short to report on – each event has cleared even before a human can be informed of it, let alone investigate. Thus, the realtime event management system would only detect an issue if the number of flaps occurring during a given time duration and in a given location exceeds a predefined threshold, upon which the flapping is defined to be chronic. If the number of flaps on individual interfaces does not cross this threshold, then an aggregate increase in flaps across a region may go undetected by the event management system. However, this aggregate increase could be indicative of an unexpected condition, and be impacting customers. It would thus require investigation. 12 Fault Management, Performance Management, and Planned Maintenance 427 Careful trending and analysis of KPIs can also be used to identify new and improved event signatures, which can be incorporated in the real-time event management system discussed in Section 12.2. These new signatures are designed to better detect individual events that should be reported to operations personnel. The identification of new signatures is a continual process, typically leveraging the vast and evolving experience gained by network operators as they manage the network on a day-to-day basis. However, this human experience can be complemented by data mining. It is far from easy to specify what issues should be reported amidst the mound of performance data collected from the routers. For example, under what conditions should we consider CPU load to be excessive, and thus alarm on it to trigger intervention? At what point does a series of “one off” events become a chronic condition that requires immediate attention? Individual performance events used to trigger event notifications are typically detected using simple threshold crossings – an event is identified when the parameter of interest exceeds a predefined value (threshold) for a given period of time. However, even selecting this threshold is often extremely challenging. How bad and for how long should an event persist before operations should be informed for immediate investigation? Low level events often clear themselves; if we are overly sensitive at picking up events to react to, then we risk generating too many false alarms, causing operations personnel to spend most of their time chasing false alarms and risking them missing the critical issues among the noise. If we are not sensitive enough, then critical issues may fly under the radar and not be appropriately reacted to in a timely fashion. Analysis of vast amounts of network data can be key to selecting suitable thresholds so as to carefully manage the rate of false positives and false negatives. However, note that the thresholds selected may not actually be constant values – in some cases they could vary over time, or may vary over different parts of the network. Thus, it may actually be sensible to have these thresholds be automatically learned and adjusted as the network evolves over time. Simple thresholding techniques can also be complemented by more advanced anomaly detection mechanisms. For example, consider a router experiencing a process memory leak. In such a situation, the available router memory will (gradually) decrease – the rate of decrease being indicative of the point at which the router will hit memory exhaust and likely cause a nasty router crash. Under normal conditions, router memory utilization is relatively flat; a router with a memory leak can be detected with a nonzero gradient in the memory utilization curve. Predicting the impending issue well in advance provides network operations with the opportunity to deal with the issue before it becomes customer impacting. These are known as predictive alerts, and can be used to nip a problem in the bud, thereby entirely preventing a potentially nasty issue. Again, detailed analysis of vast amounts of data is required to identify appropriate anomaly detection schemes, which can accurately detect issues, with minimal false alerts. There is a vast array of publications focused on anomaly detection on network data [11–15], although much of the work focuses on traffic anomalies. 428 J.M. Yates and Z. Ge 12.4.2 Root Cause Analysis KPIs track network performance; they do not typically provide any insight into what is causing a condition or how to remediate it. However, driving network improvements necessitates understanding the root cause of recurring network issues – potentially down to the smallest individual events. Characterizing the root causes of network events (e.g., packet loss) and then creating aggregate views across many such events can enable insights into the underlying behavior of the network and uncover opportunities for longer-term improvements. However, investments made in improving network performance should ideally be focused on the opportunities with the greatest impact. By quantifying the contribution of different root causes for a given type of recurring network event (e.g., packet loss), a network operator can identify the most common root causes, and focus energies on addressing these. In the case of packet loss, for example, if a significant portion of the loss was determined to be congestion-related, then additional network capacity may be required. If significant loss was alternatively attributed to a previously unidentified issue within the network elements (e.g., routers), then the response will ideally involve actions that could permanently eliminate the issue. A Pareto analysis [16, 17] is a formal technique used to guide this process – it evaluates the benefits of potential actions, and identifies those that have the maximal possible impact. Troubleshooting individual network events was discussed in detail in Section 12.2.4. To identify the root causes of a class of events, such as hardware failures, packet losses, or protocol flaps, we need to drill down into multiple individual events in a bid to come up with the best explanation of their likely root causes. Root cause analysis here is similar to that described in Section 12.2.4 – with a couple of important distinctions. Specifically, we are typically examining large numbers of individual events, as opposed to a single large event, and we are typically examining historical events, as opposed to real-time events. As with troubleshooting individual real-time issues, root cause analysis of recurring events typically commences with a detailed analysis of available network data. Scalable data mining techniques are key to effectively making the most of the wealth of available data, given the large number of events generally involved, and the diversity of possible root causes. But data mining alone does not always reveal the underlying root cause(s) – especially in scenarios where anomalous network conditions or unexpected network behaviors are identified. Instead, such analysis would be complemented by targeted network measurements, lab reproduction, and detailed software and/or hardware analysis. Targeted network measurements can be used to complement the regular network monitoring infrastructure in situations where the general monitoring is insufficiently fine-grained or focused to provide the detailed information required to troubleshoot a specific recurring issue. Obtaining additional measurements – particularly when trying to capture recurring events with very short symptoms (e.g., short bursts of packet losses) – may involve establishing an ad hoc measurement infrastructure or augmenting an existing infrastructure to make targeted measurements pursuant to 12 Fault Management, Performance Management, and Planned Maintenance 429 the issue being investigated. For example, when troubleshooting a recurring issue related to router process (CPU) management, very fine-grained CPU measurements may be temporarily obtained from a small subset of network routers through targeted measurements (e.g., recording measurements every 5 seconds instead of every 5 minutes). Such measurements could not be obtained across the entire network on an ongoing basis (simply due to scale), but could be critical in getting to the bottom of an elusive recurring issue. As another example, if malformed packets are causing erroneous network behavior, then detailed inspection of specific traffic carried over a network link could identify what this traffic is and where it is coming from, information which is likely to be critical to troubleshooting the issue. However, such measurements are not going to be collected on a regular basis; they simply do not scale and are too targeted to a specific issue. Note also that targeted measurements will likely need to be taken during the occurrence of an event of interest – which could be challenging to capture for an intermittent issue with very short symptoms (e.g., short bursts of packet loss). But once such measurements are available, they can complement the regularly collected data and be fed into the network analyses. Lab testing and hardware/software analysis to troubleshoot recurring issues are similar to that discussed in Section 12.2.4 for troubleshooting individual issues. However, lab testing and detailed software/hardware analysis are more often than not immense and extremely time-consuming efforts, which should not be entered into lightly. It is thus critical to glean as much information from available network data as possible to effectively guide these other efforts. EDM techniques are at the heart of this. We thus return our focus to how we can effectively use EDM in analyzing recurring network conditions. 12.4.2.1 Data Integration Constructing a good view of what is happening in a network requires looking across a wide range of different data sources, where each data source provides a different perspective and insight. Manually pulling all these data together and then applying reasoning to hypothesize about an individual event’s root cause is excessively challenging and time-consuming. The data are typically available in a range of different tools, as depicted in Fig. 12.11. These tools are often created and managed by different teams or organizations, and often present information in different formats and via different interfaces. In the example in Fig. 12.11, performance data collected from SNMP MIBs is accessed via one web site, router syslogs may be obtained from an archive stored on a network server, a different server collates workflow logs, and end-to-end performance data are available in yet another web site. Thus, manually collecting data from all these different locations and then correlating it to build a complete view of what was happening in any given situation can be a painstaking process, to say the least. This situation may be further complicated in scenarios that involve multiple network layers or technologies, which are managed by different organizations. It is entirely possible that information across network layers/organizations is only accessible via human communication with an expert in 430 J.M. Yates and Z. Ge Topology views Routing reports Workflow Performance reports Syslogs Trouble shooting SONET Traffic Network alarms/alerts Network Fig. 12.11 Troubleshooting network events is difficult if data are stored in separate “data silos” the other organization. For example, obtaining information about lower-layer network performance (e.g., SONET network) may involve reaching out via the phone to a layer one specialist. Of course, different data sources also use different conventions for timestamps (e.g., different time zones) and in naming network devices (e.g., router names may be specified as IP addresses, abbreviated names, or use domain name extensions). These further compound the complexity of correlating across different data sources. Thus, analyzing even a single event could potentially take hours by hand – simply in pulling the relevant data together. This is barely practical when troubleshooting an individual event, but becomes completely impractical when troubleshooting recurring events with potentially large numbers of root causes. However, it has historically been the state of the art. Data integration and automation is thus absolutely critical to scaling data mining in support of root cause analysis. It would be difficult to overstate the importance of data integration: making all the relevant network and systems data readily accessible. The data should be made available in a form that makes it easy to correlate significant numbers of very diverse data sources across extended time intervals. AT&T Labs have taken a practical approach to achieving this [18] – collecting data from large numbers of individual “data silos” and integrating them into a common 12 Fault Management, Performance Management, and Planned Maintenance 431 Exploratory Data Mining (EDM) Troubleshooting Applications Re-usable Toolset Recurring condition management (trending, root cause, learning) Visualization … Anomaly detection Reporting framework … Time series Correlation testing Pre processing Correlations Correlation Pre processing Database (Daytona) Database Data management Data Depot Feed management Data warehouse … DB management and normalization Access management Fig. 12.12 Scaling network troubleshooting and root cause analysis database. Above this common database is a set of tools and applications. This architecture is illustrated in Fig. 12.12. Scalable and automated feed management and database management tools [19, 20] are used to amass data into a common infrastructure and to load data into a massive data warehouse. The data warehouse archives data collected from various configuration-, routing-, fault-, performance-, and application-measurement tools, across multiple networks. Above the data warehouse resides a set of scalable and reusable analysis and reporting modules, which provide core capabilities in support of various applications. These components include a reporting engine (used for making data and more complex analyses available via web reports to end-users), anomaly detection tools, and different correlation capabilities (rules-based correlations and techniques for correlation testing and learning [18, 21, 22]). Above the reusable components lie a rapidly expanding set of data mining applications – ranging from web reports designed simply to expose the data to operations through a set of integrated data browsing and visualization tools, through to sophisticated trending reports and advanced statistical correlation testing and automated learning for root cause analysis [18]. The key to the infrastructure is scale – both in terms of the amount and the diversity of the network data being collated, and the range of different applications being supported. One of the key ingredients to achieving this is simple normalization of the incoming data, which is performed as the data are ingested into the database. 432 J.M. Yates and Z. Ge This normalization ensures, for example, that a common time zone is used across all data sources, and common naming conventions are used to describe network elements, networks, etc. By performing data normalization and preprocessing as the data is ingested into the database, it removes the burden and corresponding complexity of continual data conversions by applications and human users alike. Although an enormous undertaking, such an infrastructure enables scale in terms of the very diverse analyses that can be performed [18–22]. 12.4.2.2 Scaling Root Cause Analysis So let us now return to the challenge of scaling root cause analyses for a series of recurring events. We specify these events as a time series, referred to as a “symptom time series.” This time series is characterized by temporal and spatial information describing when and where an event was observed and event-specific information. For example, a time series describing end-to-end packet loss measurements would have associated timing information (when each event occurred and how long it lasted), spatial information (the two end points between which loss was observed) and event-specific information (e.g., the magnitude of each event – in this case, how much loss was observed). Note that the infrastructure depicted in Fig. 12.12 allows such a time series to be formed using a simple database query. The most likely root cause of each of our symptom events is then identified by correlating each event with the set of diagnostic information (time series) available in the data warehouse. Domain knowledge is used to specify which events should be correlated and how. If we consider our packet loss example, then we would correlate loss observations with events such as congestion, traffic reroutes, and internal router conditions known to cause loss. The correlations would be constrained to those which could have caused each observed loss event – namely those events along the path of the traffic experiencing the loss. The analysis then effectively becomes an automation of what a person would have executed – taking each symptom event in turn and applying potentially very complex rules to identify the most likely explanation from within the mound of available data. Aggregate statistics can then be calculated across multiple events, to characterize the breakdown of the different root causes. Appropriate remedies and actions can then be identified. Again, the data infrastructure depicted in Fig. 12.12 makes scaling this root cause analysis far more practical, as the root causes are typically identified from a wide range of different data sources; a common data warehouse with normalized naming conventions ensures that the analysis infrastructure does not need to be painfully aware of the data origins and conventions. However, it is often far from clear as to what all the potential causes or impacts of a given set of symptom events are. Take, for example, anomalous router CPU utilization events. Once CPU anomalies are defined (a challenge unto itself), we are then faced with the question of what causes these anomalous events, and what impacts – if any – do they have. Causes and impacts can also be further categorized as “expected” versus “unexpected.” For example, we may know a priori to expect that 12 Fault Management, Performance Management, and Planned Maintenance 433 a given router upgrade would result in an increase in average router CPU, or that other newly implemented features or services may increase CPU load. In contrast, unexpected causes of CPU anomalies may include software bugs or improper router configuration changes. As to impact – CPU anomalies should not result in any customer impact. If they have gotten to the point where routing protocols are timing out, for example, then immediate attention is imperative. However, identifying that such impact is occurring among the enormous number of ongoing events that are observed in large-scale networks is far from an easy task. Domain knowledge can be heavily drawn upon in identifying potential causes and impacts – network experts can often deduce from both knowledge of how things should work and from their experience in operating networks to create an initial list. However, network operators will rapidly report that networks do not always operate as expected nor as desired – routers are complicated beasts with bugs which can result in behaviors that violate the fundamentals of “networking 101” principles. Examination of real network data, and lots of them, is necessary to truly understand what is really happening. Domain knowledge can be successfully augmented through EDM, which can be used to automatically identify these relationships – specifically, to learn about the root causes and impacts associated with a given event time series of interest. Such analyses are instrumental in advancing the effectiveness of network analyses, both for troubleshooting recurring conditions and revealing issues that are flying under the radar. Although there are numerous data mining techniques available [23, 24], one approach that is being applied within AT&T Labs is to identify relationships or correlations between different time series and across different spatial domains [21, 22]. This approach identifies those time series that are statistically correlated with a given symptom time series; these are likely to be the root causes or impacts of the time series of interest (the symptoms). However, given the enormous set of time series and potential correlations involved here, domain knowledge is generally necessary to guide the analysis – where to look and under what spatial constraints to correlate events (e.g., testing correlations of events on a common router, common path, common router interface). When looking for anomalous or unexpected correlations, the real challenge is in defining what is normal/desired behavior and what is not. Let us consider another hypothetical example here, this time to illustrate how we can use statistical correlation testing to automatically identify the root causes of a particular recurring event – in this case, BGP session flaps (where the BGP session goes down and then comes up again shortly afterwards). We focus here on the connectivity between customers and an ISP – specifically, between a customer router (CR) and a provider edge router (PER). In particular, we focus on customers who use the BGP routing protocol to share routes with the ISP, and thus establish a BGP session between the CR and the PER. BGP may be used here, for example, in the case where customers are multihomed; in the event of a failure of the link between the CR and PER, BGP reroutes traffic onto an alternate route. The physical connectivity between the CR and PER is provided over metropolitan and access networks as illustrated in Fig. 12.13. These networks in turn may be 434 J.M. Yates and Z. Ge IP/MPLS network Provider edge router Metro and access lower layer (e.g., TDM / Ethernet) network Customer router Fig. 12.13 Customer – ISP access made up of a variety of layer one and layer two technologies (see Chapter 2). We refer to these as the lower-layer networks. These metro/access networks often have built-in mechanisms for rapidly and automatically recovering from failures. Thus, in these situations, failure recovery mechanisms may exist at both the upper layer (through BGP rerouting) and the lower layers. It is highly desirable to ensure that failure recovery is not invoked simultaneously at both layers [25]. This is achieved in routers today using timers – the IP routers are configured with a delay timer designed to allow the lower layer to attempt to recover from an issue first. If the lower layer restores connectivity within the defined timeout (e.g., 150 ms), then the routers do not react. However, if the lower layer fails to recover from the issue within the defined time interval, then the routers will attempt to restore service at the IP layer. We can use correlation testing to help us investigate the potential root causes of BGP session flaps between PERs and CRs. Specifically, we can test the statistical correlation between the symptom time series (BGP session flaps) and a wide range of other time series, which correspond to a variety of other network events. We refer to these other events as diagnostic events. The goal is to identify those diagnostic time series that are statistically correlated with our symptom time series (BGP session flaps). However, rather than testing all possible time series across the entire network (an impractically large number of time series), we typically constrain our correlations to those in the same locality. In this case, we examine events either on the same router or on the same router interface as the BGP session flaps. Our diagnostic symptoms can be drawn from a range of different sources – workflow commands, router syslogs, lower-layer events, router performance events (e.g., high CPU, memory utilization, link loads, packet losses) and so on. The primary result of the correlation testing is a list of time series that are statistically correlated to the BGP session flaps; the idea being that these will reveal the root causes and impacts of the BGP session flaps. 12 Fault Management, Performance Management, and Planned Maintenance 435 So now let us consider a situation in which there is an underlying issue such that BGP sessions flap even when lower-layer failure recovery mechanisms rapidly recover from failures. This breaks the principle of layered failure recovery used here – as discussed, recovery actions at the lower layer should prevent IP links and BGP sessions between routers from failing during these events. Thus, domain knowledge would conclude that lower-layer failure recovery actions would not be related (correlated) to BGP session flaps. However, in the scenario we consider here, correlation testing would expose that the BGP session flaps are often occurring at the same time as lower-layer failure recovery events associated with the same link between the PER and the CR – more often than could be explained as pure coincidences. This would be revealed via strong statistical correlation between BGP flaps and failure recovery events on the corresponding lower layer – a correlation that violates normal operating behavior, and is indicative of erroneous network behavior. Correlation testing is in essence revealing that the network is not operating as designed – reality differs from intent. Erroneously failing the IP layer link and corresponding BGP session results in unnecessary customer impact – instead of seeing a few tens of milliseconds break in connectivity as the lower-layer recovery is performed, the customer may now be impacted for up to a couple of minutes. Also note the need here to bring in spatial constraints – we are explicitly interested in behavior happening across layers for each individual link between a PER and a given CR – it is this correlation that is not expected and indicates undesirable network behavior. It is entirely expected that failure recovery on a given PER – CR link correlates with BGP flaps on other links. This would be a result of a common lower-layer failure impacting both IP links that use lower-layer failure recovery mechanisms and those which do not. Those IP links without lower-layer recovery would experience the failure, causing their associated BGP session to fail. However, those with lower-layer failure recovery should not experience any impact on the higher layer protocols. Thus, the BGP failures on the links without failure recovery correlate with the lower layer failure recovery on the other, seemingly independent, links. This highlights the complexity here and the need for detailed domain knowledge and carefully designed spatial models in executing and analyzing correlation results. Thus, statistical correlation testing can be used to expose failure modes that might otherwise go undetected, yet cause significant customer impact over time. Statistical correlation testing can also be used to delve deeper into network behavior, once revealed. For example, correlation testing can be used to identify how the strong correlation between BGP session flaps and lower-layer recovery events varies across technologies. Does it only exist for certain types of lower-layer technologies or certain types of router technologies (routers, line-cards)? If the same behavior is observed across lower-layer technologies from multiple vendors, then it is unlikely to be the result of an erroneous behavior in the lower-layer equipment (for example, slower than designed recovery actions). If, however, the correlation exists for only a single type of router, then it would be highly advisable to look closer at the given router type for evidence of a software bug that could explain the observed behavior. Thus, analysis of what is common and what is not common across the symptom 436 J.M. Yates and Z. Ge observations can help in guiding troubleshooting. Targeted lab testing and detailed software analysis can then follow so that the underlying cause of the issue can be identified and rectified, with the intent that the failure mode will be permanently driven out of the network. In general, the opportunity space for EDM is tremendous in large-scale IP/MPLS networks. The immense scale, complexity of network technologies, tight interaction across networking layers, and the rapid evolution of network software mean that we risk having critical issues flying under the radar, and that network issues are extremely complex to troubleshoot, particularly at scale. Driving network performance to higher levels will necessitate significant advances in applying data mining to the extremely diverse network data. This is an area ripe for further innovation. 12.5 Planned Maintenance The previous sections focused on reacting to events and issues as they are identified. However, a large portion of the activity in a large operational network is actually a result of planned events. Managing a large-scale IP/MPLS network indeed requires regular planned maintenance – the network is continually evolving as network elements are added, new functionality is introduced, and hardware and software are upgraded. External events, such as major road works, can also impact areas where fibers are laid, and thus necessitate network maintenance. There are two primary requirements of planned maintenance: (1) successfully complete the work required, and (2) minimize customer impact. As such, planned network maintenance is typically executed during hours when the expected customer impact and/or the network load is at its lowest. This typically equates to overnight and early hours in the morning (e.g., midnight to 6 a.m.). However, the extent to which customers would be impacted by any given planned maintenance activity also depends on the location of the resources being maintained, and what maintenance is being performed. Redundant capacity would be used to service traffic during planned maintenance activities – where redundancy exists, such as in the core of an IP network. However, as discussed in Section 12.2.5, such redundancy does not always exist and thus some planned maintenance activities can result in a service interruption. This is most likely to occur at the edge of an IP/MPLS network, where cost-effective redundancy is simply not available within router technologies today. 12.5.1 Preparing for Planned Maintenance Activities Competing planned maintenance activities occurring across multiple network layers in a large ISP network could be a recipe for disaster, if not carefully managed. Meticulous preparation is thus completed in advance of each and every planned maintenance event, ensuring that activities do not clash, and that there are sufficient 12 Fault Management, Performance Management, and Planned Maintenance 437 network and human resources available to successfully complete the scheduled work. Planning for network activities involves careful scheduling, impact assessment, coordination with customers and other network organizations, and identifying appropriate mechanisms for minimizing customer impact. We consider these in more detail here. Scheduling of planned maintenance activities often requires juggling of a range of different resources, including operations personnel and network capacity, across different organizations and layers of the network. For example, in an IP/MPLS network segment where lower-layer network recovery does not exist, planned maintenance within the lower layer will likely impact the IP/MPLS network. If this is within the core of an IP/MPLS network (i.e., between ISP routers), the impacted IP/MPLS traffic will be rerouted, requiring spare IP network capacity. This same network capacity could also be required if, for example, IP router maintenance should occur simultaneously. This is illustrated in Fig. 12.14, where maintenance is required both on the link between routers C and D (executed by layer one technicians) and on router H (executed by layer 3 technicians). Much of the IP/MPLS traffic normally carried on the link between routers C and D may normally reroute over to the path E–F–H in the event of the link between C and D being unavailable. However, should router H also be unavailable (e.g., due to simultaneous planned maintenance), then this alternate path would not be available. The traffic normally carried on the link between routers C and D, and that normally carried via router H would all be competing for the remaining network resources. This has the potential to cause significant congestion and corresponding customer impact. This is clearly not an acceptable situation – the two maintenance activities must be coordinated so that they do not occur simultaneously, unless the network has adequate resources to successfully support them both. Thus, careful scheduling of planned maintenance within and across network layers is crucial. Such scheduling also necessitates carefully constructed processes to communicate and coordinate maintenance activities across organizations managing the different network layers. However, how can a network operations team determine whether there are sufficient network resources to successfully execute planned maintenance activities with F E C D A H I B ISP G Fig. 12.14 Competing planned maintenance activities on a network link (between routers C and D) and a router (router H) 438 J.M. Yates and Z. Ge minimal customer impact? This is particularly complicated within the IP/MPLS core, where it is a nontrivial task to predict where and how much traffic will be rerouted in response to network events. Detailed “what if” simulation tools that can emulate the planned maintenance activities are key to ensuring that adequate resources are available for planned and unplanned activities. Such tools are used to evaluate the impact of planned activities in advance of the scheduled work, taking into account traffic, topology, planned activities, and ideally current network conditions. In situations where the planned maintenance activities would cause unacceptable impact, the simulation tools can also be used to evaluate potential actions that can be taken to ensure survivability (e.g., tweaking network routing). If the planned maintenance is instead occurring at the provider edge router (the router to which customers connect), then it may be necessary to coordinate with or at least communicate the planned maintenance to the impacted customers. This is an especially important step when serving enterprise customers, who may need to make alternate arrangements during such activities. This communication is typically done well in advance of the planned activities – often many weeks. If the work needs to be repeated across many edge routers, as may occur when upgrading router software network-wide, then human resources must also be scheduled to manage the work across the different routers. This can become a relatively complex planning process, with numerous constraints. 12.5.2 Executing Planned Maintenance Planning for scheduled maintenance activities generally occurs well in advance of the scheduled event, ensuring adequate time for customers to react and for internal network survivability evaluations to be completed. Maintenance can proceed after a successful last-minute check of current network conditions. However, service providers go to great lengths to carefully manage traffic in real time so as to further minimize customer impact. In locations where redundancy exists, such as in the core of the IP/MPLS network, gracefully removing traffic away from impacted network links in advance of the maintenance can result in significantly smaller (and potentially negligible) customer impact compared with having links simply fail while still carrying traffic. Forcing traffic off the links that are due to be impacted by planned maintenance also eliminates unnecessary traffic reroutes that would result should the link flap excessively during the maintenance activities. How this rerouting of traffic is achieved depends on the protocols used for routing traffic. For example, if simple IGP protocols are alone used, then traffic can be rerouted away from the links by simply increasing the weight of the links to a very high value. This act is known as costing out the link. Once the maintenance is completed, the link can be costed in by reducing the IGP weight back down to the normal value, thereby re-attracting traffic. Continual monitoring of network performance and network resources is also critical during and after the planned maintenance procedure, to ensure that any unexpected conditions that arise are rapidly detected, isolated, and repaired. Think about taking your car to a mechanic – how often has a mechanic fixed one problem only to 12 Fault Management, Performance Management, and Planned Maintenance 439 introduce another as part of their maintenance activities? Networks are the same – human beings are prone to make mistakes, even when taking the utmost care. Thus, network operators are particularly vigilant after maintenance activities, and put significant process and automated auditing in place to ensure that any issues that may arise are rapidly detected and addressed. For example, in the previously discussed scenario where links are costed out before commencing maintenance activities, it is critical that monitoring and maintenance procedures ensure that these network resources are successfully returned to normal network operation after completion of the maintenance activities. Accidentally leaving resources out of service can result in significant network vulnerabilities, such as having insufficient network capacity to handle future network failures. Additionally, for the larger-scale planned upgrades mentioned earlier, KPIs need to be monitored against expected impacts both during and after deployment. Undesired results, such as unexpectedly high CPU loads, can then be quickly investigated. Other unexpected results, such as a slow memory leak due to a bug in newly deployed software, may not be immediately apparent but can be detected through appropriate monitoring. 12.6 The Importance of Continued Innovation The past 10 years or so have seen tremendous improvements in IP/MPLS network fault and performance management. However, there are still opportunities for exciting innovations. We herein outline a few directions in which we believe that further advances in the state of the art promise great operational benefits. Router reliability remains an important area where innovation is needed. Although dramatic improvements have been achieved in recent years, router failures and maintenance are still the dominant cause of customer service outages. Router technologies must evolve to allow router software to be upgraded without impacting customers, to effectively manage control plane resources in the presence of overload conditions, and to support hardware monitoring and cost effective redundancy [26, 27] so that outage durations are minimized. Improvements in these areas depend on a combination of technical disciplines including real-time software systems, software engineering, as well as an increased emphasis on hardware “design for maintainability.” As demonstrated in earlier sections within this chapter, service providers have typically mastered the detection, troubleshooting, and repair of commonly occurring faults and performance impairments. However, the same cannot always be said for dealing with the more esoteric faults and performance issues. Significant advancements are crucial in detecting issues that “fly under the radar,” and in troubleshooting complex network issues. Both of these present opportunities for advanced exploratory data mining. Tools for effectively and rapidly aiding in troubleshooting complex issues are particularly lacking; significant innovation is well overdue here. This is primarily because it is a challenging problem and one most understood by the small teams of highly skilled engineers to which such issues are escalated. These teams work in a demanding environment – each new line of 440 J.M. Yates and Z. Ge investigation may be different from previous ones, and may (at least initially) defy understanding. Operations personnel, while under tremendous pressure, have to sift through immense quantities of data from diverse tools, collect additional information, and theorize over potential root causes. Arming these teams with appropriate data analysis tools for best achieving this is a challenging but necessary advancement. Significantly advancing the state of the art here will likely require a melding of data mining experts and network experts. As a final topic, we consider process automation – specifically, how far can we and should we proceed with automating actions taken by operations teams? As highlighted in Section 12.3, process automation is already an integral part of at least some large ISP network operations. The ultimate goal may well be to fully automate common fault and performance management actions, closing the control loop of issue detection, troubleshooting, mitigation strategy identification and evaluation, and actuation of the devised responses in the network (e.g., rerouting traffic, repairing databases, fixing router configurations). There are vast opportunities for innovation in identifying new scenarios for such automated recovery and repair of both networks and supporting systems (e.g., databases, configuration). As such cases are revealed and proposed, it will undoubtedly often be challenging to replicate the complex logic that humans execute in identifying courses of actions to mitigate network issues, particularly dealing with the “corner cases” that arise in large networks. Creating appropriate safeguards to prevent potentially catastrophic actions should flawed reasoning be introduced into the system is also a challenge that must be addressed. 12.7 Conclusions In this chapter, we described a wide range of network management and operational tasks designed to ensure that ISP networks operate at high levels of reliability and performance. We have organized these network operation activities into three threads, each serving a conceptually different purpose, although they overlap in practice. The first thread included a series of operations covering monitoring network health and service performance, detecting and notifying operations of fault and performance issues, localizing and troubleshooting issues, problem mitigation and service restoration, and finally repair and restoration of the impacted network resources. These tasks are typically handled in a “real-time” fashion, as they involve an ongoing service impact, pressing for immediate care. The second thread focused on offline exploratory data analysis for driving continued performance improvements. This includes defining and monitoring key performance indicators, conducting trending analyses to track network performance and health over time, applying root cause analysis and data mining to uncover underlying issues, conducting targeted measurement and lab testing to pinpoint the problem, and finally driving the problem out of the network where possible. These tasks are less time-pressured. However, they are often more complex as they require advanced analytic systems 12 Fault Management, Performance Management, and Planned Maintenance 441 and experienced operations personnel to quickly focus on the anomalous network behaviors deeply hidden among a vast amount of network data. While the first two threads deal with detecting and reacting to events that occur within the network and service, the third thread focused on planned events – activities that operations execute to maintain and evolve the network. These planned maintenance activities involve replacing equipment, upgrading software, and deploying new hardware and network capabilities. The major challenges in planned maintenance lie in the careful planning and preparation for planned events, and the prudent execution of these tasks such that customer impact is minimized. We conclude with some final “best-practice” principles for both fault and performance management, and planned maintenance. Fault and performance management “best-practice” principles for large IP/MPLS networks: Incorporate network management requirements whenever new technolo- gies are being introduced – do not make network measurement, fault and performance management an afterthought Develop a comprehensive fault and performance data collection infrastructure Carefully design network alarms and alerts to ensure that network issues are rapidly detected and appropriate notifications are generated Deploy a scalable event management system, which effectively filters and correlates the onslaught of network alarms to rapidly isolate network issues Deploy a ticketing system, which is tightly integrated with the event management and process automation systems. The ticketing system is used to notify operations personnel of events requiring investigation, and to track analyses and final root cause Automate commonly executed operations tasks to speed issue resolution and free up staff for more complex tasks – but be careful and incorporate appropriate safeguards to protect the network Arm network operations teams with the necessary tools, network measurements, and skills for troubleshooting network issues Create and utilize lab environments for replicating and troubleshooting issues observed within the operational network Build close partnerships between vendors for collaborative troubleshooting of network events, particularly those related to vendor equipment Work closely with network engineering teams to ensure that automatic failure recovery is available where possible – this can (1) improve service availability through rapid failure recovery, and (2) remove the need for operations to respond in real time. Planned maintenance can then be scheduled at a convenient time Dedicate expert staff to ongoing analysis of recurring problems, and bring their learning back into the mainstream management systems 442 J.M. Yates and Z. Ge Track and trend element-level and end-to-end key performance indicators over time – both those that indicate customer impact (e.g., network loss, delay, key service-level metrics) and those associated with network health (e.g., router CPU, memory utilization) Create scalable root cause analysis techniques and processes for investigating recurring performance issues Data integration: create a scalable infrastructure for exploratory data mining – where data can be readily accessed and correlated across multiple diverse time series. Planned maintenance “best-practice” principles for large IP/MPLS networks: Instantiate processes that ensure that human beings “think twice” as they are touching the network to avoid unnecessary mistakes Plan and schedule maintenance activities carefully to minimize customer impact, including scheduling activities across network layers and across network organizations where necessary Execute careful validation of planned activities to ensure that there is sufficient spare network capacity to absorb load during core network maintenance Validate network state before executing planned maintenance to ensure that maintenance is not executed when the network is already impaired Minimize customer impact by taking routers and network resources “gracefully” out of service where possible (e.g., within the network core) Provide appropriate customer notifications (e.g., to enterprise customers) of upcoming planned maintenance activities, so that customers have the opportunity to take proactive actions where necessary Suppress relevant network alarms during planned maintenance activities to avoid operations chasing events, which they are in fact knowingly inducing Carefully monitor network and service performance before, during, and after maintenance, ensuring that all resources and services are successfully returned to operation Validate planned maintenance actions after completion (e.g., verify router configurations) to ensure that the correct actions were taken and that configuration errors or other bad conditions were not introduced during the activities Ensure where possible that mechanisms are available for rapid back out of maintenance activities, should issues be encountered during maintenance activities. 12 Fault Management, Performance Management, and Planned Maintenance 443 Acknowledgments The authors thank the AT&T network and service operations teams for invaluable collaborations with us, their Research partners, over the years. In particular, we thank Bobbi Bailey, Heather Robinett, and Joanne Emmons (AT&T) for detailed discussions related to this chapter and beyond. Finally, we acknowledge Stuart Mackie from EMC, for discussions regarding alarm correlation. References 1. Gerards, R. (2009). The Syslog Protocol. IETF. RFC 5424. 2. Della Maggiora, P., Elliott, C., Pavone, R., Phelps, K., & Thompson, J. (2000). Performance and fault management. Cisco Press. 3. Shaikh, A., & Greenberg, A. (2004). OSPF Monitoring: Architecture, Design and Deployment Experience. USENIX. Symposium on Networked Systems Design and Implementation (NSDI). 4. Mauro, D., & Schmidt, K. (2005). Essential SNMP. O’Reilly. 5. HP’s Operations Center. [Online] https://h10078.www1.hp.com/cda/hpms/display/main/ hpms content.jsp?zn = bto&cp = 1–11–15–28ˆ1745 4000 100 6. EMC’s Ionix platform. [Online] http://www.emc.com/products/family/ionix-family.htm. 7. IBM’s Tivoli. [Online] http://en.wikipedia.org/wiki/IBM Tivoli Framework. 8. Kliger, S., et al. (1995). A Coding Approach to Event Correlation. Fourth International Symposium on Integrated Network Management. pp. 266–277. 9. Yemini, S., Kliger, S., Mozes, E., Yemini, Y., & Ohsie, D. (May 1996). High speed and robust event correlation. IEEE Communications Magazine, 34, 82–90. 10. Ciavattone, L., Morton, A., & Ramachandran, G. (June 2003). Standardized active measurements on a Tier 1 IP backbone. IEEE Communications Magazine, 41. 11. Barford, P., Kline, J., Plonka, D., & Ro, A. (2002). A Signal Analysis of Network Traffic. ACM Internet Measurement Workshop. pp. 71–82. 12. Huang, Y., Feamster, N., Lakhina, A., & Xu, J. (2007). Diagnosing Network Disruptions with Network-Wide Analysis. ACM Sigmetrics. 35, pp. 61–72. 13. Lakhina, A., Crovella, M., & Diot, C. (2005). Mining Anomalies Using Traffic Feature Distributions. ACM SIGCOMM. Vol. 35, pp. 217–228. 14. Zhang, Y., Ge, Z., Greenberg, A., & Roughan, M. (2005). Network Anomography. ACM Usenix. Internet Measurement Workshop. pp. 317–330. 15. Venkataraman, S., Caballero, J., Song, D., Blum, A., & Yates, J. (2006). Black Box Anomaly Detection: Is It Utopian?. ACM 5th Workshop on Hot Topics in Networking (HotNets). pp. 127–132. 16. Tague, N. R. (1995). The Quality Toolbox. Amer Society for Quality. 17. Juran, J., & Gryna, F. (1998). Juran’s quality control handbook. New York: McGraw-Hill. 18. Kalmanek, C., Ge, Z., Lee, S., Lund, C., Pei, D., Seidel, J., Van der Merwe, J., & Yates, J. (October 2009). Darkstar: Using Exploratory Data Mining to Raise the Bar on Network Reliability and Performance. Design of Reliable Communication Networks International Workshop. 19. Golab, L., Johnson, T., Seidel, J., & Shkapenyuk, V. (2009). Stream Warehousing with Data Depot. ACM SIGMOD. 20. Golab, L., Johnson, T., & Shkapenyuk, V. (2009). Scheduling Updates in a Real-Time Stream Warehouse. IEEE International Conference on Data Engineering (ICDE). pp. 1207–1210. 21. Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., & Ee, C. (2008). Troubleshooting Chronic Conditions in Large IP Networks. Madrid, Spain: ACM International Conference on Emerging Network Experiments and Technologies (CoNEXT). 22. Mahimkar, A., Ge, Z., Shaikh, A., Wang, J., Yates, J., Zhang, Y., & Zhao, Q. (2009). Towards Automated Performance Diagnosis in a Large IPTV Network. ACM SIGCOMM. 23. Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. Wiley. 24. Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis & data mining applications. Academic. 444 J.M. Yates and Z. Ge 25. Demeester, P., Gryseels, M., Autenrieth, A., Brianza, C., Castagna, L., Signorelli, G., Clemente, R., Ravera, M., Jajszczyk, A., Janukowicz, D., Van Doorselaere, K., & Harada, Y. (August 1999). Resilience in multilayer networks. IEEE Communications Magazine, 37, pp. 70–76. 26. Sebos, P., Yates, J., Li, G., Greenberg, A., Lazer, M., Kalmanek, C., & Rubenstein, D. (2003). Ultra-Fast IP Link and Interface Provisioning with Applications to IP Restoration. IEEE/LEOS Optical Fiber Communications Conference. pp. 557–558. 27. Sebos, P., Yates, J., Li, G., Rubenstein, D., & Lazer, M. (2004). An Integrated IP/Optical Approach for Efficient Access Router Failure Recovery. IEEE/LEOS Optical Fiber Communications Conference. Glossary of Terms and Acronyms Term Fault management Definition Set of functions that detect, isolate, and correct faults in a telecommunications network Fault “Hard” failure (e.g., link down) Performance management Set of functions that detect, isolate, and correct performance issues in a telecommunications network Performance events Situations where a network element or the network is operating with degraded performance (e.g., packet loss, excessive delay) Event management Set of functions that detect, isolate, and correct events in a telecommunications network Incident An occurrence that affects normal network operation Event A fault or performance anomaly or impairment. A single incident may result in multiple events Originating event The event directly associated with a given incident, as opposed to being a side-effect or symptom of the incident Alarm Notification of a fault Alert Notification of a performance event (e.g., threshold crossing, traffic anomaly) Event notification Generic term covering alarms and alerts Event correlation Taking multiple incoming events (observations related to an incident) and correlating them to identify a single correlated event to capture the incident Event management system A system that collects incoming event notifications, and filters and correlates these to output correlated events Correlated event Output from the event correlation Ticket A document that is used to notify operations of an issue that requires investigation and to track the analysis performed in diagnosing the issue Ticketing system System that manages tickets Event manager A person who manages event resolutions (continued) 12 Fault Management, Performance Management, and Planned Maintenance 445 Term Definition Troubleshooting A form of problem-solving applied to diagnosing the underlying root causes of network impairments A condition where network elements fail to detect and report an impairment Traffic is dropped (lost) within the network. Black holes are often associated with silent failures Automation of process-related tasks Track how parameters of interest (e.g., KPIs) behave over time Metrics designed to measure network performance and health Identifying the root cause of network event(s) Statistical technique in decision-making that is used to select a limited number of tasks that produce significant overall effect Planned activities in the network, such as for upgrading hardware and software, scheduled hardware replacements, and network growth and evolution Silent failure Black hole Process automation Trending Key performance indicators Root cause analysis Pareto analysis Planned maintenance Acronym Definition IP MPLS ISP ICMP IPTV IGP OSPF IS-IS BGP CPU CRC SNMP IETF MIB DPI LDP PIM PPP CR PER KPI EDM DPM Internet Protocol Multi-Protocol Label Switching Internet Service Provider Internet Control Message Protocol Internet Protocol Television Interior Gateway Protocol Open Shortest Path First Intermediate System to Intermediate System Border Gateway Protocol Central Processing Unit Cyclic Redundancy Check Simple Network Management Protocol Internet Engineering Task Force Management Information Base Deep Packet Inspection Label Distribution Protocol Protocol-Independent Multicast Point-to-Point Protocol Customer Router Provider-Edge Router Key Performance Indicator Exploratory Data Mining Defects Per Million Chapter 13 Network Security – A Service Provider View Brian Rexroad and Jacobus Van der Merwe 13.1 Introduction In keeping with the theme of this book, this chapter on security, explores the actual and potential impact of security threats and concerns on network stability and robustness. We specifically take a service provider centric view of network security by considering the actions a service provider can take to ensure the integrity of the network and to protect network services and users.1 Many of the security concerns providers and network users face are related to the fundamental fact that networks are shared resources, and their purpose is to provide connectivity and the means of interaction between network users and devices. Unfortunately, this very functionality also provides the means for unwanted interaction and exploitation. As an enabler for communications, this puts service providers in a unique position to also protect users and inhibit traffic unwanted by the indended recipient. Indeed protection against some network security threats, such as distributed denial of service (DDoS) attacks, is near impossible to achieve without network support.2 Further, security services are enhanced by being network aware and utilizing network derived intelligence. Dealing with security threats in the network by necessity requires monitoring of network activity and in some instances interfering with, or blocking, unwanted traffic. Traffic monitoring and manipulation are both important issues that may have legal and regulatory implications. We acknowledge this tension and argue that “the B. Rexroad () and J. Van der Merwe AT&T Labs, Florham Park, NJ 07932, USA e-mail: brian.rexroad@att.com; kobus@research.att.com 1 We explicitly use the term service provider to emphasize the fact that, in addition to Internet access, other internet protocol (IP) based networks (e.g., virtual private networks (VPNs)) and IP-based services (e.g., hosting, VoIP, IPTV, content distribution, etc.), all relate to the security concerns of a provider and are therefore considered in scope. 2 The brute force nature of many DDoS attacks means that access links are often overwhelmed, which renders premises based protection mechanisms ineffective. C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 13, c Springer-Verlag London Limited 2010 447 448 B. Rexroad and J. Van der Merwe network”, or more generically cyberspace, has become such a critical part of our society that finding workable solutions to these non-technical issues is critical. Our goal for this chapter is to first serve as a practical guide by identifying bestpractices and describing specific monitoring and mitigation mechanisms that can be utilized. Second, we hope to aid the reader in developing a framework or philosophy for dealing with security from the point of view of a service provider, i.e., understanding which problems are inherent to the current Internet architecture, protocol suite and trust model, understanding the incentives, strengths and weaknesses of different role players and developing strategies of where to spend resources going forward. Covering the complete network security subject in a single chapter is not feasible, not even if the coverage is perfunctory. Indeed, many excellent security books have been written to cover specific subsets of network security problems and solutions. As such we will largely focus on security from a service provider perspective given the current Internet architecture, set of protocols and business relationships. Within this context we will cover security related procedures, mechanisms, tools and services that can be utilized by service providers to protect the network infrastructure as well as the services that it enables. In the final section of the chapter we will, however, deviate somewhat from this near term focus to offer a more forward looking perspective. The outline of the chapter is as follows. In Section 13.2 we provide an exposition of the underlying network security threats and their causes. Some underlying security causes are technical in nature, e.g., the Internet best-effort service model. Others are the result of current business practices (e.g., service providers being both retail competitors with each other, as well as interconnection partners) and indeed the development of nefarious uses such as spam, phishing, data theft, and DDoS related extortion, which are enabled by broad use of the Internet. Because a single exploit can often be successfully launched against many Internet users, the economic balance tends to be weighted in the favor of bad actors. In Section 13.3 we present an overall framework for service provider network security. We discuss the seven pillars that make up this framework. Having the means to know when there is a security threat or incident, and having the necessary information to then deal with the problem is fundamental to any security strategy. Section 13.4 addresses the importance of developing good network security intelligence. We articulate a strategy for monitoring of network activity and systems to maintain security awareness. In Section 13.5 we present a number of operational network security systems used for the detection and mitigation of security threats. A significant challenge for any network-based security system is the need for scalability. We describe several highly scalable systems, covering informational, compulsory and supplementary security services. We consider the role of security operations as an essential part of the broader network operations in Section 13.6. 13 Network Security 449 Finally, in Section 13.7 we summarize important insights and then briefly consider important new and developing directions and concerns in network security as an indication of where resources should be focused both tactically and strategically. 13.2 What Is the Problem? Despite, or perhaps because of, its undeniable success and utility, the Internet and all networks that derive from the Internet architecture and protocols, are suffering from a litany of security concerns. In the best case these concerns are annoying and impede progress. However, because of the ever increasing use of networks for virtually all aspects of modern society, in the worst case, these security concerns have potential to negatively impact economies and governments at a global scale. Interestingly the Internet can trace it roots, in part, to a desire to create more secure communications systems [16, 54]. Specifically, concerns for physical attacks against centralized control systems motivated the conception of distributed communication networks [16]. Consequently, there was a decision to prioritize availability of the network over confidentiality and integrity. These protections were left to the end user to consider. This and other design goals were articulated in a retrospective paper on the design philosophy of the DARPA internet protocols [24]. In priority order the design goals for the Internet architecture were: 1. 2. 3. 4. 5. 6. 7. Internet communication must continue despite loss of networks or gateways. The Internet must support multiple types of communications service. The Internet architecture must accommodate a variety of networks. The Internet architecture must permit distributed management of its resources. The Internet architecture must be cost effective. The Internet architecture must permit host attachments with low level of effort. The resources used in the Internet architecture must be accountable. Given that its roots were in the defense community, it is not surprising that robustness against physical loss ranked highest in this list. This external threat model is, however, quite different from current day attacks, which come from use of the network. These attacks exploit protocol and architectural characteristics of the network itself and therefore effectively constitute an internal threat model. Further, while interworking between different network technologies was part of the architectural thinking right from the start, the Internet predecessors mostly interconnected closed groups of trusted users. The architecture that emerged offered communication on a best-effort basis, specifically limited the amount of per-flow information that network elements are required to maintain, instead relying on end systems to do that [24], and did not require global operational control [54]. While providing a highly scalable system that is robust against physical failure, these guiding principles are somewhat problematic from a network security 450 B. Rexroad and J. Van der Merwe perspective. Best-effort delivery significantly simplifies the network forwarding mechanics because the network does not have to be [overly] concerned about dropping packets, i.e., transport protocols (e.g., Transmission Control Protocol (TCP)) take care of reliable delivery from the edge of the network. The fact that end-systems are entrusted to maintain connection state in effect means that they become part of the implied network trust model. This works well when end-systems can be trusted and when all traffic being forwarded to a particular destination is wanted by that destination, as would be the case in a closed community of trusted users. However, when end-systems are malicious and generate unwanted traffic, the best-effort delivery and the lack of per-flow information in network elements effectively becomes a conduit for delivering denial-of-service (DoS) attacks. Indeed DoS attacks (or their close cousin distributed denial of service (DDoS) attacks), remain a fundamental problem for the current Internet. The fact that IP source addresses are not authenticated and therefore easily spoofed, exacerbates the situation because the perpetrators of the attack are effectively untraceable and therefore unaccountable. The implication is that the final goal in the above list has never been achieved. Dealing with unwanted traffic provides a strong argument for the need of source authentication and accountability. However, some argue that such measures would result in easy identification of endpoints and by association users, to which some have expressed privacy concerns. Such identifiability concerns present an inherent tension that will have to be addressed in network architectures that provide strong accountability [76]. Obviating the need for global operational control and instead allowing for distributed management was a great equalizer which allowed networks with different levels of operational sophistication (among other differences) to be interconnected with relative ease. From a network security perspective, this lack of formal operational interworking hampers the ability of service providers to deal with major security incidents. Further, the volume of minor security concerns is such that providers are left to fend for themselves via local approaches, especially when the root of the problem originates from a remote network with whom the provider has no formal relationship nor a vested interest to assist. In the remainder of this section we will elaborate on these issues by first examining both the stated and actual Internet threat model, as well as the somewhat implied trust model of the Internet. We then consider the role of security protocols before looking at the incentives of different role players in network security, illustrating that the economic balance is heavily biased in favor of bad actors. Finally, we briefly consider the fact that cyberspace has become a critical infrastructure which impacts virtually all aspects of society, well beyond its cyber limits. Effectively dealing with many of the concerns identified in this section might require architectural changes to the Internet, or changes to well entrenched business practices, and as such is well beyond the scope of this chapter. 13 Network Security 451 13.2.1 Threat Model The Internet Engineering Task Force (IETF) is an open international community concerned with the operation and evolution of the Internet. All IETF documents are required to specifically address security and the IETF provides guidelines for this in RFC 3552 [67]. The Internet threat model as defined in RFC 3552 in essence states that: (i) End-systems are assumed to not be compromised and (ii) attackers are assumed to have near complete control over the communication channel over which communication takes place. This threat model is clearly unrealistic. First, security vulnerabilities in operating systems and applications (e.g., browsers) result in end-systems that are in fact routinely compromised and unwittingly utilized for nefarious activities. Despite end-system security receiving significant attention by operating system vendors and communities, the openness of these platforms and the plethora of applications that it enables suggest that end-system vulnerability will continue to be a concern for the foreseeable future.3 Software piracy, among other things, exacerbates this situation since pirated software typically will not be updated with vendor patches. Consequently, vulnerabilities remain which can be exploited by attackers. Further, the ease with which end users can be lured into installing malware themselves, e.g., by downloading and executing electronic postcards from illegitimate websites [48], suggests that the social engineering aspect of the problem might be the most difficult challenge. Second, while complete control of the communication channel by attackers remains a possibility, this has in practice proved to be much more difficult and unusual. Yes, end-systems and network elements, such as routers, contain software that are subject to flaws and bugs and are therefore not inherently more secure than endsystems. However, in general, commercial network providers have a vested interest to be more cautious [than end users] when deploying new software and more vigilant in working with vendors to identify and correct vulnerabilities. After all, commercial providers’ business not only depends on the network, it is the network. More realistic threat assumptions, from a service provider point of view would be: End-points (broadly defined) can and will be compromised and used to launch attacks against the network, its users and the services it provides. Necessary precautions must be taken to ensure the security of network elements that are under control of the provider. These assumptions lead to a threat model where everything outside the periphery of the provider network is assumed to be potentially hostile and untrustworthy, while the objective is to make everything inside the network to be secure and trustworthy. This simplistic threat model is, however, only part of the picture. In the Internet any single provider is only part of a set of interconnected networks that 3 Data from the National Vulnerability Database (nvd.nist.gov), show a 25-fold increase in the annual number of published software flaws across all software systems from 1997 to 2007. 452 B. Rexroad and J. Van der Merwe provide end-to-end connectivity. Therefore, to enable the most basic communication services between arbitrary hosts on the Internet, a provider has to trust, to some extent, entities that are outside of its sphere of control. We consider this somewhat implied trust model next.4 13.2.2 Trust Model There is, somewhat surprisingly, no formal trust model for the Internet. However, by virtue of providing the means to communicate between different parties and across networks operated by different organizations there is an implied trust model. This implied trust model is largely defined first by the business and functional relationships between all involved parties and second by the underlying architecture, protocols and technologies. 13.2.2.1 Business and Functional Relationships The Internet is a largely unaffiliated and loosely coupled set of organizations that interwork to realize its functionality. As such, every organization involved in the end-to-end delivery of a packet has to be relied upon to do “the right thing”: Network equipment are assumed to be configured correctly and operators are assumed to follow best practice guidelines. Network equipment software is assumed to operate correctly and be bug free. Protocol endpoints are assumed to be who they claim to be. Internet users are assumed to act in good faith. We know that these assumptions are not realistic. Incorrect network configuration routinely result in network incidents. For example, in a well known YouTube hijacking incident, an attempt by Pakistan Telecom to locally block access to YouTube prefixes in effect resulted in YouTube traffic from all over the Internet being redirected to Pakistan [68]. Like all software systems, network equipment software often has bugs that could have security implications [4]. Finally, while the owners of most network endpoints do act in good faith, their computers might be controlled by those who do not. And as a result, the protocol endpoint could exploit the unauthenticated nature of the IP protocol to claim any identity. Further, these business and functional relationships are transitive in the sense that end-customers have to trust their access providers. Access providers (might) have to trust higher tier providers. All providers have to trust other providers 4 In cases where the complete end-to-end path is under control of a single provider, e.g., in the case of virtual private network (VPN) services provided by a single provider, the simplified threat and trust model hold, and this results in higher confidence levels of security of the communications. 13 Network Security 453 (technically, autonomous systems (ASes)) along the path to the ultimate destination. This transitive trust exists despite the fact that formal business arrangements are typically limited to the closest neighbors in this chain. 13.2.2.2 Technology Drivers At the most basic level, the Internet best-effort unaccountable service model implies that the network should trust end-users to not send unwanted traffic. We have already discussed the fallacy of this misplaced trust which enables denial of service attacks. The inherent and implied Internet trust model can be explored (from an end-user perspective) by means of the common-place action of downloading a Web page [38]. First, the domain name system (DNS) is trusted to correctly map the domain name in the Uniform Resource Locator (URL) of the Web page (e.g., the “www.att.com” part of http://www.att.com/index.jsp) into an IP address. Second, in establishing a TCP session with the Web page in question, the Internet routing system is trusted to route packets along the intended path between the browser and the server. Third, all intermediate network elements are trusted to faithfully convey packets in transit. Finally, the end-systems in this interaction, i.e., the user host (client) that runs the browser and the server that provides the content, are both trusted to not be compromised so that the content intended by the content owners/creators (and only that content) be displayed to the user. Unfortunately none of the elements in this chain of trust is built on solid ground as each element is subject to inherent vulnerabilities. DNS can provide no guarantees about the validity of domain name to IP address mappings [2, 13]. DNS as currently deployed does not have any strong security protection of messages and is thus subject to modification in transit like all unprotected Internet protocols. The request/response nature of DNS queries, the fact that most DNS queries are conducted as connectionless transactions, the relatively small message identification space (which can be guessed relatively easily), and the capability to perform source address spoofing on the Internet allow an attacker to provide bogus responses to legitimate requests. The hierarchical caching nature of the DNS architecture, means that these types of attacks are particularly problematic as the (bogus) response may be cached until the time-to-live field in the response expire, which would typically be set to a long time period by an attacker. In fact, an attacker may originate the query in an effort to poison the cache of DNS servers with bogus answers. This is a technique that has been used to redirect victims to phishing sites. Internet routing protocols and in particular BGP provide no guarantees about the correctness or validity of routes [17]. BGP messages could be tampered with in transit as there are no per-message or per-session security mechanisms. More serious though, is the fact that there is no guarantee that a BGP speaker advertising a route to a particular destination is authorized to advertise that route, or in fact has a route to the destination, or would be forwarding packets to the destination if it has a valid path [20]. In particular, an attacker could hijack a prefix belonging to another AS, to either intercept the traffic en route to the actual destination, or to 454 B. Rexroad and J. Van der Merwe send traffic while taking the identity of the hijacked address space. This would, for example, allow address-based firewall filters to be bypassed. It has been reported that this technique is used by email spammers to temporarily create the appearance that their mail servers are associated with reputable organizations and evade filtering techniques [66]. TCP provides no guarantees about the actual identity of the system that terminates the TCP connection. For example, TCP connections are routinely terminated by intermediate devices such as Web proxies, although that is not necessarily an indication of malicious activity. Further, TCP does not ensure that content is not tampered with in transit. Finally, end-system exploits allow the compromise of both clients and servers. With a compromised server, even if the TCP session is terminated on the intended server and is not tampered with in transit, it is possible that the content on the server itself might have been tampered with. Compromised server content might cause a client’s communication to download both the intended content, but also unintended content (i.e., malware) from a malicious website. Alternatively, a compromised client may be fooled into unknowingly visiting and disclosing data to a malicious server. (This attack might also be perpetrated via DNS exploits even when the client system is not compromised.) Finally, a compromised client computer, which is generally operated by someone that is not an IT professional, can be used as a tool to perform other compromises, to generate unwanted traffic, to serve as relay points, as temporary illegitimate servers, etc. 13.2.2.3 Towards an Internet Trust Model? In an ITU recommendation [43] that deals with network security, trust is defined as follows: Generally, an entity can be said to “trust” a second entity when it (the first entity) assumes that the second entity will behave exactly as the first entity expects. This trust may apply only for some specific function. Underlying this definition is an assumption that the entities in question can be reliably identified to be who they claim to be, i.e., can be authenticated. Based on how this original authentication is performed, three major trust models have been articulated [7]: Direct Trust where the two entities involved validate each other’s credentials without relying on a third party. Transitive Trust where trust between two entities is imputed by virtue of a third party, or parties, trusted by one of the entities in question, having validated and established an original trust relationship. I.e., A validates and trusts B, B validates and trusts C , therefore A trusts C without performing any validation. Assumptive (or Spontaneous) Trust where there is no mandatory explicit validation of credentials. 13 Network Security 455 Above we have argued that many of the implied trust assumptions regarding business/functional relationships and the underlying technology that makes the Internet work, are in fact very weak at best. I.e., with respect to the “trust” definition provided above, many entities cannot be assumed to behave in the expected way. It is also clear from this discussion that the relationships and dependencies are very complicated so that, perhaps, it is not too surprising that there is no well defined trust model for the Internet. Depending on the functionality being considered, elements of all three trust models defined above are present in the Internet. For example, considering the monetary relationships between participants, service providers typically have a direct trust relationship with their paying customers and with other providers. As such there is a transitive trust relationship with the customers of other providers. However, because of lack of accountability and associated controls, this monetary relationship may fail to influence the way traffic flows across the Internet, so that an assumptive trust model will in effect be operational. Finally, while the above discussion might seem dire, we note that this assumptive trust model works reasonably well where incentives do align. Specifically, many service providers generally work hard to do well by each other, thereby ensuring that the aggregate behavior is good. As we address below, however, assumptive trust breaks down when incentives are not aligned. 13.2.3 Secure Protocols to the Rescue? Given the deeply embedded nature of the business/functional relationship component of the trust model described above, it is imperative that solutions to the weaknesses in the technical part be found. Of course the security vulnerabilities described above are well known in the networking and systems communities and a variety of counter measures have been developed over many years. Unfortunately, while solutions or partial solutions exist, they lack deployment for a variety of reasons. For example, DNSSEC [11], the secure version of DNS, will eliminate many of the current known DNS vulnerabilities [13], mature implementations are available and there exist operational experience from several DNSSEC trials. However, widespread adoption of DNSSEC has not yet happened. This is due, in part, to technical and operational concerns: (i) DNSSEC make use of public-key cryptographic signatures and as such will require significantly more resources than current DNS systems. (ii) DNSSEC is a much more complex protocol than DNS and will therefore require more sophisticated operational support. (iii) There is a chicken-and-egg dilemma where lack of widespread deployment means the usefulness of DNSSEC is diminished which in turn hinders further deployment. In addition, some of the theoretical attributes of DNSSEC may exacerbate several important practical security considerations. For example, one of the techniques used by DNS providers to protect users from visiting malicious sites or joining a malicious botnet is using a technique called DNS Sinkhole [70]. When implementing 456 B. Rexroad and J. Van der Merwe a DNS Sinkhole, domain name resolution is overridden from the authoritative response. DNSSEC will interfere with this protection technique. Other examples include DDoS attacks toward or related to DNS services, such as DNS amplification attacks. These attacks are much more insidious and frequently occurring problem than DNS cache poisoning or spoofing [5]. Since DNSSEC requires more processing resources and also will create larger query responses, it has become a mechanism to facilitate or worsen similar types of attacks. Of far greater concern, there are vetting or accountability concerns associated with the DNS registry. It is well known that many DNS registry entries are incorrect [79]. DNSSEC only validates that the resolution of a domain name to IP address is consistent with what the authoritative name servers intended. Bogus registry information allows attackers to use domain names for malicious purposes and remain unaccountable. DNSSEC does little, or nothing, to maintain any enforcement of registry accuracy or integrity. The threat that the registry providers themselves may be compromised remains. This suggests a need to both rein-in the hierarchy to some set of trusted authorities, establishing standards for identity and authority management for domain names, and setting security standards for management of the systems that maintain the assignments of fully qualified domain names (FQDNs) to IP addresses. Ironically, the greatest stumbling block in DNSSEC deployment, however, has been the controversial issue of which entity, or entities, would be responsible for signing of the root zone and how the management of the key signing key would be handled [52,85]. These issues concern Internet governance which is well beyond the scope of this chapter. At the time of writing there appears to be increased pressure for these issues to be resolved to pave the way for DNSSEC deployment.5 The picture is somewhat less promising for finding an imminent solution to the security concerns of BGP. A number of comprehensive architectures have been proposed to deal with BGP security [32, 50, 61], however, there currently appears to be no consensus on which, if any, of these solutions will be adopted [20]. Similar to DNS, many of the proposed solutions assume the existence of an accurate routing registry which would provide information concerning organizations and the autonomous system (AS) numbers and prefixes that are allocated to them. Unfortunately current registries are known to be highly inaccurate and fixing that presents significant problems in itself. Similar to DNSSEC, there are concerns about the processing resources that would be required to accommodate strong security mechanisms. In the case of BGP this concern is exacerbated by the dramatic increase in the number of routes that BGP is required to handle, as well as the fact that the control processors of deployed routers might be lacking in processing power. 5 This change might have been helped along by recent publicity around the so-called Kaminsky DNS vulnerability [88]. The vulnerability involves a DNS cache poisoning attack, where an attacker (i) fakes a DNS response, e.g., by guessing the transaction identifier in the response, and (ii) provide incorrect information in the additional section (or “glue” section) of the response. While this attack can be mitigated through security analysis of DNS activity, there is no inherent solution without a change in the DNS protocol itself [81]. 13 Network Security 457 In the absence of comprehensive security mechanisms for critical infrastructure services like DNS and BGP, end-to-end application level security mechanisms provide significant protection. Specifically, in the case of client/server Web interaction, the use of HTTP over a secure transport protocol (e.g., Transport Layer Security (TLS)), provide cryptographic protection for Web sessions as well as server authentication in the form of a digital certificate that is typically signed by a certification authority. Certification authorities provide different levels of monetary guarantees for different strengths of certificates, with correspondingly more stringent identity verification by the certification authority. While secure protocols are clearly needed and can eliminate some of the more basic problems, ultimately, secure protocols do not make systems secure. For example, as described above, a modern browser on an uncompromised host will be able to verify the validity of certificate issued by a certification authority. However, with this approach the end user in effect trusts the certification authority to validate and vouch for the identity of the server, or more correctly the identity of the organization or individual who buys the certificate. Similarly DNSSEC will be able to authenticate that a response originated from an authoritative name server, but will ultimately depend on some form of identity verification to allow DNS mappings to be entered into the system in the first place. The trust put into these verification steps then becomes the weakest links in the overall security chain [28]. Finally, the security mechanisms themselves do not necessarily provide iron-clad protection. For example, the practical feasibility of generating fake certification authority certificates has recently been demonstrated [75]. This, in combination with falsified DNS or BGP entries, can direct users to counterfeit sites which closely resemble the real sites and trick the users into providing private information. 13.2.4 Motivations, Incentives and Economics Having looked at the technical and functional properties that make networks vulnerable to abuse and attack, we now briefly consider the motivations and incentives of different role players. There appear to be at least three motivations for malicious network behavior namely mischief, economic/financial and political/ideological, perpetrated respectively by script kiddies, hackers/criminals and nation states/cyber terrorists. From the attacker’s side, early attacks were often performed by technically savvy villains who carried out their deeds for the associated boasting rights, or indeed to attack some of their fellow villains in the cyber equivalent of turf wars [30]. It is clear that there has been a progression from these early mischievous acts to economically motivated cyber attacks and activities. The prevalence of spam email, while not a security threat in itself, is evidence of the “success” of questionable economic practices and is often also the entry point for more serious attacks such as 458 B. Rexroad and J. Van der Merwe identity theft. There is also evidence that social engineering attacks via well crafted email campaigns have become an effective means for botnet operators to replenish their armies [48]. Other evidence of an underground economy enabled by the Internet include the trading of compromised hosts [56] which are then enlisted in botnet armies and used to attack targets, often with economic consequences for the target. Alternatively, such botnets are used as part of extortion threats against targets where short disabling attacks are used to convince the target to pay money to prevent a repeat attack [25]. Other economic incentives for cyber related attacks include the theft of intellectual property and identity theft. For example, some bots are harvested and/or purchased with the objective of extracting personal data or to be used as access points into otherwise closed networks. We note that a significant factor in the booming underground cyber economy relates to the fundamental economics of Internet communication which are very favorable towards villains. Specifically, botnets can be hired cheaply [36] which translates to very insignificant business costs related to these activities. Flat-rate service models and the resulting always on-line practices that it enable, mean that consumer systems are easy targets for botnet recruiting and users are less vigilant about monitoring their network usage. This situation is exacerbated by the difficulty of effective law enforcement against cyber criminals, which often requires an arduous process of coordinating with law enforcement agencies in different parts of the world. This essentially means that cyber criminals can operate with very small investment and risk. It is also interesting to note that these economically motivated cyber criminals need the network to remain operational, or at least to remain operational to the level where it enables their objectives. I.e., they have no incentive to bring down the network as a whole. While it is in the interest of economically motivated miscreants to keep the network operational, the same is not necessarily the case for politically or ideologically motived cyber crimes. An example concerns multiple massive cyber attacks against web sites in Estonia in 2007 [82]. These attacks severely disrupted Internet functions. And they continued and evolved for an extended period of time. These attacks followed street violence after actions by the Estonian authorities which proved highly controversial with Estonians of Russian decent. For this reason some feared state involvement or endorsement by Russia. No such linkage has been proved, however, the incident does serve to illustrate the vulnerability of the Internet, or parts thereof, to concerted efforts by those with extreme political and ideological motivations. This is unfortunately not an isolated incident as evidenced by similar more recent cases. For example, in August 2008 Georgia accused Russia of launching cyber attacks against Georgian web sites [29], and in January 2009 DoS attacks from computers in Russia were launched against Kyrgyzstan ISPs [55]. Again, no government involvement has been proven, however, the attacks did coincide with political tension between the countries. Since service providers are commercial endeavors, their incentives to deal with network security are also highly influenced by the fundamental economics involved. For providers, the economics are a difficult balance between commercial viability, 13 Network Security 459 flexibility, resiliency, and capacity. For example, while in principle it is feasible to build a network with enough capacity to withstand DDoS attacks (or more generally to add mitigation technology to that effect), such a network would be economically infeasible to operate [84]. Further, service providers need to provide network services to enable the legitimate traffic on its network, which involves significant operational costs. At the same time there are very limited means to prevent any unwanted/illegitimate traffic from using the same resources. As we will show in the remainder of this chapter, one way in which this economic imbalance can be addressed, at least in part, is to offer opt-in network security services which users pay for. Indeed, security services, like other services provided by a service provider, are provided in a competitive commercial environment. Like other service offers, the business reason for such an investment is typically that it would provide a competitive advantage and thus attract customers. Some security services might be provided on a subscription basis, thus directly garnering paying customer. In other cases, security services might be provided on-demand, e.g., to protect a customer against a massive DDoS attack. In such cases having the ability to provide on-demand protection services differentiate providers in the competitive landscape. 13.2.5 Critical Infrastructure Cyber-Security Concerns We have already mentioned the fact that “the network” has become a critical part of our everyday lives. The extent to which this is true is well articulated in this quote from a U.S. Department of Homeland Security report [27]: Without a great deal of thought about security, the Nation shifted the control of essential processes in manufacturing, utilities, banking, and communications to networked computers. As a result, the cost of doing business dropped and productivity skyrocketed. A network of networks directly supports the operation of all sectors of our economy – energy (electrical power, oil and gas), transportation (rail, air, merchant marine), finance and banking, information and telecommunications, public health, emergency services, water, chemical, defense industrial base, food, agriculture, and postal and shipping. The reach of these computer networks exceeds the bounds of cyberspace, They also control physical objects such as electrical transformers, trains, pipeline pumps, chemical vats, and radars. This dependency between cyberspace and real world objects is exemplified by the 2003 power blackout in the northeastern U.S. and Canada [80]. This event was not caused by a network problem, nor were there any malicious actors, and indeed many different factors contributed to the final outcome. However, software (and hardware) malfunction in the alarm system that was used, were found to have contributed to the outage. In a more direct cyber/physical security incident, which fortunately had no ill effect, the Slammer worm was responsible for taking monitoring computers offline at a nuclear power plant [64]. There was a time when telecommunications in the United States was a completely regulated business. This may have enhanced the ability of the Government to influence how the telecommunications infrastructure was engineered and operated to protect it as a critical national infrastructure component. Since the Cold War, 460 B. Rexroad and J. Van der Merwe deregulation and a highly competitive telecommunications market have evolved. Prices have fallen dramatically. Technology has evolved. Further, in a highly competitive environment where service providers are accountable to their shareholders, the ability of the government to easily influence the strategic direction of telecommunications may be more limited. The robustness of services is defined by customer preferences and private business models that seek maximum return on investment. Markets for telecommunications services are some of the most competitive. Consequently, the robustness of the buildings, equipment, software, security measures, bandwidth overhead, testing, personnel training, and a myriad of other measures becomes a function of the price and quality standards demanded by customers. Dealing with these concerns is well beyond the scope of this chapter. However, the potential impact on national and, by extension, international security suggests that it is important for all role players to carefully consider the reliance of modern society on the Internet and the role they should be playing to ensure its continued robust operation. The way to influence the robustness of services is through purchasing practices – best services are not necessarily the cheapest. 13.3 Service Provider Network Security Network security for service providers consists of establishing a basic foundation for security implementation and correctly executing on the details. The network needs to be architected and configured in a way to resist the myriad of security threats it faces. Some lessons are learned from experience. And since security threats evolve over time, other lessons can be learned through observation of smaller evolving threats and establishing protection mechanisms on a broad scale to prevent those threats from becoming customer affecting. In this section we describe a general framework for service provider network security. Specifically, we describe a framework for network security that has been developed over time by the AT&T security organization and articulates some of AT&T’s philosophy to network security. We elaborate with some of the configuration and protection mechanisms that can be used to help protect the infrastructure. Since each network environment is unique and depends on the equipment used, the architecture, the customers or users of the network, defining specific configurations is out of scope for this text. However, use of the framework should provide a foundation from which a solid network security program can be established. Generally, the network needs to be configured to be secure against attacks and exploits that can be prevented or neutralized. Similarly, there will remain inherent shortcomings in network infrastructure such as weaknesses in BGP that need to be overcome or compensated for as part of the network security framework. 13 Network Security 461 13.3.1 A Framework for Network Security The AT&T framework for network security defines seven pillars: Separation, Testing, Control, Automation, Monitoring, Response & Recovery and Innovation [14].6 These pillars build upon having established security policy standards, engineering best practices, as well as being informed of network management practices recommended by vendors and operator groups such as NANOG [3]. All of these pillars are inter-dependent; they are all inter-related to each other. As we discuss the seven pillars below, these co-dependencies will be apparent, illustrating how the true strength of the approach derives from the framework as a whole. Separation This fundamental security principle dictates that things that do not belong together be separated both in terms of network functionality and in terms of duties performed by service provider personnel. Testing Testing and certification of all network elements and services, both before and after deployment, is crucial to ensure the functionality is as expected. Control The network needs to assure that authorized operators are the only ones in control of the network. Operators also need the ability to control the network both in terms of how traffic flows through the network and how the network is protected against protocol and architectural vulnerabilities. Automation The scale and complexity of provider networks, as well as the consistency and timeliness of actions compel automation of as many aspects as possible. Monitoring Measuring and monitoring network health, changes, and activity at different levels of granularity and from different viewpoints give operators insight into the normal and anomalous behavior of elements and traffic on the network. Response and Recovery Knowing what attacks and threats traffic or penetration attempts present to the network is of limited utility unless operators also have the necessary tools, mechanisms, and procedures to respond and mitigate the threats. In order to react to sometimes fast developing security events, 24/7 security analysis and remediation operations are required with well developed execution plans. Innovation Given the continually evolving nature of networking in general and networking security threats in particular, require a reciprocal continuous investment by providers in innovation to ensure robust and secure operation of their networks. We now consider each pillar in turn. 6 Note that the service provider context implies a broader operational approach compared to the more traditional CIA security principles of confidentiality, integrity and availability. For example: both control and response takes into account the need for dynamics in managing the security of a network, separation not only helps to maintain confidentiality of traffic on the network, but also helps to provide assurance of network management functions. 462 B. Rexroad and J. Van der Merwe 13.3.1.1 Separation The separation principle finds application along several dimensions in a provider network. Two major categories are separation of traffic traversing the network and separation of duties involved in operating the network. Separation of Traffic Traffic separation is used to maintain priority of various traffic types on the core network. Multi-protocol Label Switching (MPLS) technology provides a powerful capability to provide traffic separation on the network core. Separation of management (i.e., network command and control) traffic is the first priority. As we will discuss later, maintaining control of the network under all circumstances is a fundamental factor in maintaining services and operational continuity. MPLS networks can carry different protocol families, e.g., IPv4 and IPv6, on the same infrastructure while essentially keeping their operation separate. MPLS can be used to separate virtual private networks (VPNs) from other Internet traffic, and to provide separation between various customer networks. MPLS can be used to isolate traffic associated with specialized services such as VoIP and DDoS defense traffic. MPLS can also be used to prioritize delivery of certain traffic in times of stress (e.g., when segments of the network become overwhelmed with traffic floods.) Since MPLS provisioning and separation features are implemented logically in the network with minimal or no physical configuration changes, this lends itself to performing automation, which we will see is an important security attribute. Maintaining the separation of traffic sometimes necessitates the deployment of dedicated equipment at the edge of the network. For example dedicated equipment is needed when switching elements (e.g., routers or firewalls) do not provide inherent separation mechanisms and cannot support the appropriate delivery prioritization, or might have limited bandwidth and can be overwhelmed with traffic. Separation of Duties Strict rules are enforced regarding who can view the status of network elements and who can modify them. With rare exceptions, only operations teams have direct access to network elements and support systems. Select development/engineering team members may have read-only access to operational systems for the purpose of helping to diagnose and reproducing behavior in the lab. But they are required to work through operations teams for any changes. This model helps to establish a consistent authorization model, which we will see is important to maintain secure control of the network and the association support systems. Those with access need to be appropriately trained and practiced in appropriate network operations. Similarly, operations teams are held accountable for the reliability and operations of the network. Therefore, they generally only support elements that have been “certified” through a battery of tests. Similarly, testing teams should be largely independent of engineering teams to assure no bias in compliance testing and to add an additional factor of variability in verification. 13 Network Security 463 13.3.1.2 Testing Testing is necessary to verify that the network will operate as expected in terms of operations and security. Testing applies to the certification of network elements as well as tests or audits of the configuration and installation environment in actual practice. Certification Testing Certification tests are formulated for each element type (e.g., device, model, version, and patch) to assure products behave as they are supposed to. Service providers cannot depend on vendor marketing claims. Nor can vendors be expected to emulate all of the situations that are encountered in an operational environment. Therefore, independent tests help to validate devices are interoperable, reliable, maintainable, monitorable, replaceable and securable. Security specific tests are used to verify security requirements and adherence to security policy can be satisfied. Examples include, verifying access control and authentication compatibility, controls to limit exposure of management interfaces to user networks work as expected, and validation that the equipment behaves appropriately to unorthodox traffic. Naturally, this testing necessitates a reasonable amount of test facilities, effort, and time. But the effort is well worth it, and there are peripheral benefits such as the ability to provide rapid lab-reproduction for problems, patches, and installation processes when needed most. Many product vendors are willing to partner with large providers to support the testing since the testing process provides significant insight into the operational needs and “opportunities for improvement” to their products. Configuration Testing and Audits Configuration testing and audits are also regularly performed on systems as they have been installed in or around the network. Security tests include vulnerability scanning with commercial security scanning products to help identify potential exposed access points. Occasional “white-hat” security penetration tests are performed that study the circumstances of installations and try to identify points of vulnerability that can be exploited. This includes social engineering exploitation attempts, which is a context that no scanning tool can consider. Audits come in multiple forms. Some aperiodic regulatory audits validate that the proper security practices are in place. Large network providers are frequently exposed to this type of audit in the course of providing communications services to certain industries such as financial and health. Regular system configuration audits are performed against systems by collecting the configuration of individual devices and vetting those configurations against expected configurations. Tests include validation that devices are at proper version and patch levels, appropriate access control lists (ACLs) are in place, only appropriate services are active, etc. In a very large network consisting of literally millions of manageable elements and literally thousands of configuration scenarios, there is no choice but to automate these types of checks. 464 B. Rexroad and J. Van der Merwe 13.3.1.3 Control Network control functions can be categorized into several general and sometimes overlapping categories: Operational availability controls – measures taken to assure operations personnel have full control of the network at all times in terms of situational awareness and the ability to make changes to the network. Device access controls – measures taken to assure that only authorized network operations personnel and the tools they employ have access to devices in the network. Passive router controls – measures taken to avoid exposing network forwarding elements from protocol attacks or other denial of service attacks. Traffic flow controls – measures to control where traffic can and cannot go on the network, especially during security incidents. Operational Availability Controls To the extent possible, operational tasks are scripted as Methods and Procedures (M&Ps) to establish and maintain consistency in practices and maintain positive control of the network. As we will discuss in the automation section, the need to provide automation of network management functions presents some important security attributes. However, there are still many scenarios that cannot be automated and require some form of human judgment and intervention. For example, when unexpected network events occur, M&Ps establish appropriate collaboration teams, scenario development, and operational change guidance to resolve any issue. During a security event, first order of business is to assure network control and operations are maintained. We detail some structure and methods for this response capability in Section 13.6. To ensure operations staff can access network elements under all circumstances, both out-of-band and in-band access to network elements should be provided. Much of today’s network management and operational tasks are performed in-band. I.e., network management traffic is carried like any other traffic through the network (possibly with a higher priority) and specifically addressed to an IP address associated with the device being managed, typically an interface address of the router. This arrangement simplifies network management tasks because tools can directly access routers without having to negotiate the intricacies of out-of-band access. Inband access to the control plane should, however, be restricted to specific trusted source address ranges to prevent access from external, untrusted parts of the network. While in-band operational access is preferred, all network devices should also be reachable via out-of-band means, which should not be dependent on the correct operation of network being managed. Out-of-band access is needed in cases where a severe network problem (not necessarily security related), prevents in-band device access. Typical out-of-band access is provided via dial-in access to terminal servers, which in turn provide device access via serial ports on network elements. Devices Access Controls Device access controls are used to protect against the significant potential for abuse and service disruption through unauthorized access to network elements. 13 Network Security 465 In the first instance this involves appropriate authentication, authorization and accounting (AAA). Network device access should be limited to authorized users to allow them to perform (only) the specific functions they are authorized to perform. Further, user actions should be logged for auditing and debugging purposes. The de-facto standard for providing AAA functions is the TACACSC protocol. TACACSC started out as a vendor proprietary protocol, but has since been widely adopted [21]. TACACSC simplifies the management of AAA functions by utilizing a centralized database that defines the functions specific users can perform and further can be configured to log all configuration actions. To prevent a single point of failure, TACACSC servers can be replicated and network elements can be configured to try them in turn. While remaining a mainstay of network management, older versions of the simple network management protocol (SNMP), have poor security properties. For example, for SNMP versions prior to SNMPv3, SNMP access control is provided by clear text “community strings” which is susceptible to compromise via simple packet sniffing techniques. These shortcomings have been addressed through security mechanisms in SNMPv3, however, not all equipment are SNMPv3 capable. For such devices, SNMP security limitations can be addressed by: (i) limiting SNMP to read-only access, (ii) installing access control lists to limit SNMP access from specific SNMP server addresses and (iii) by compartmentalizing SNMP access between different SNMP based tools (i.e., providing each tool with separate and functional specific access). Network devices generally use telnet as the default access protocol. Like SNMP community strings, telnet passwords are transmitted in clear text. Using telnet as an access protocol should be disabled and access should be provided via encrypted transport like ssh. Passive Router Controls Network elements are in essence special purpose “computing devices” and are therefore subject to much of the same vulnerabilities as general purpose computing devices. Specifically, attackers can exploit software or other vulnerabilities to launch an attack against the functionality of the network element. At the most basic level protecting the router as a whole requires operators to explicitly use/allow what is needed while explicitly not using/allowing what is not needed. For example, because of the diverse services and configurations that they enable, routers are capable of a myriad of features and protocols and not even the most sophisticated networks make use of all such features. Routers should therefore specifically be configured with the services and protocols desired, and those that are not needed should be disabled. Further, physical router interfaces that are not currently in use should be explicitly disabled. The router control plane provides functions that are typically too complicated to perform on linecards and thus constitute a so-called “slow path” through the router. As such, the control plane is a potential attack target. For example an attacker can attempt to exhaust the control plane resources by simply sending large numbers of packets that require processing in the router control plane. Similarly, an attacker might send malformed packets in an attempt to trigger bugs in the control plane 466 B. Rexroad and J. Van der Merwe software. Such attacks might cause protocol daemons, or the router itself, to crash, or might allow the network element to be compromised by allowing unauthorized access. To protect the router control plane, again the basic approach is to allow all wanted and/or needed communication, while prohibiting all other communication. First, access control lists should be defined to restrict which network entities and which protocols are allowed to interact with the control plane. Second, within each allowed protocol, options with security concerns should be explicitly filtered out. For example, filtering packets with IP options set, filtering all fragmented packets, limiting ICMP to a safe subset of all ICMP packet types (Destination Unreachable, Time Exceeded, Echo Reply). While a more obvious target, the router slow path is not the only potential attack target. Specifically, attacks against the router data-plane or “fast path” are also feasible. These attacks might take the form of resource exhaustion attacks against network elements that maintain state, e.g., stateful firewalls. Network elements in core provider networks typically do not maintain such state (by design), and dataplane attacks are therefore typically limited to edge devices. In addition to per-router protection mechanisms, filtering, in the form of access control lists (ACLs), should be performed at the perimeter of the network to protect both the provider network as well as customers of the provider. Again only the limited safe subset of ICMP packet types should be allowed to cross the provider edge. Access to infrastructure IP addresses7 within the provider network should not be allowed from external networks, i.e., from the Internet and from customer networks. On links from customer networks, source address validation should be performed to prevent address spoofing. Since network operations’ traffic should only enter the network from the NOC, source address ranges associated with this function should be blocked from non-provider networks. There need to be filters and policy management functions on routes that are received from other network providers. BGP filters should be deployed to defend the network against basic routing exploits. So called “bogon” routes, i.e., the default route, the loopback network, RFC 1918 routes and IANA-reserved routes [39], should be filtered out. In the same way, routes with private autonomous system numbers (ASNs) should not be accepted or announced. Since in practice AS path lengths are known to be constrained [78], limiting the acceptable BGP path length provides basic protection against exploit attempts. Further, as a basic measure against prefix hijacking, routers should be configured to not accept or announce routes with prefixes longer than a specific length, e.g., /24. (Recall that longer, or more specific, prefixes are more preferred and if accepted can therefore override legitimate shorter prefixes.) Similarly, for customer peering, only routes for prefixes that are assigned to the customer should be accepted. Route processing is compute intensive and uses memory on the router control plane. To prevent starvation of processor and memory resources, the rate and number of routes that are accepted from peers should be 7 Infrastructure IP addresses refer to those addresses through which the equipment itself can be reached as an IP destination. 13 Network Security 467 limited. Finally, all address space allocated to the provider should be blackholed in the aggregate (i.e., effectively dropped). Since all used address space will be specifically advertised via more specific addresses, this practice prevents abuse of currently unused but allocated address space. Note that these measures provide a necessary first step in protecting the routing plane. However, because of the inability to precisely filter routes received across peering links, significant vulnerabilities remain. This was exemplified by the previously mentioned YouTube highjacking incident [68], where an incorrect route-map installed by a local provider caused a more specific route to be leaked to the Internet as a whole, thereby accidentally hijacking all traffic over the Internet that was destined for the content provider. To address these scenarios, current monitoring and analysis of routing data should be performed to determine when and where likely events occur. And operational processes are needed to coordinate the mitigation of rogue routes within the network as well as coordination with other providers to remediate rogue routes. A summarization of some of the operational availability, devices access and passive router controls needed in a network are depicted in Fig. 13.1. The figure shows the separation of the different functional entities in the overall network operation and the security controls associated with each entity. While network-based services and the network operations center (NOC) are in a sense part of “the network”, their specific higher level functions (compared with the basic packet forwarding functionality of the network) demands unique security concerns. Traffic Flow Controls When security events do occur, appropriate dynamic traffic flow control mechanisms should be available to mitigate and/or eliminate the security threat. Examples include: Deploying access control lists (ACLs) to prevent the spread of an impending worm epidemic. Mitigating the effect of a DDoS traffic by dropping DDoS related traffic or redirecting it to a scrubbing complex (see Section 13.5.2). Adjusting routing policies in response to temporary peering link overload conditions. Provider networks (Backbone and access): - Authenticated device access - Disable unused services/protocols Customer Networks Internet Interface with customer networks: - Block access to infrastructure addresses - Anti-spoofing for NOC addresses - Ingress filters (address assurance) - Routing stability filters IP/MPLS Backbone Access Networks Network-based Services (DNS, Hosting, CDN, VoIP etc) Interface with network based services & interface with NOC: - Authenticated device access - Ingress/egress packet filters - Block allocated unused address space Network Operations Center (NOC) Fig. 13.1 Securing components of a provider network Interface with Internet: - Block access to infrastructure addresses - Anti-spoofing for NOC addresses - Max AS limit check - Route dampening - Route filtering (RFC 1918) 468 B. Rexroad and J. Van der Merwe 13.3.1.4 Automation Automation is imperative when operating a large network. Reliability and scalability are the primary influencing factors that necessitate automation. Operations support systems are implemented for provisioning, route policy management, billing, audit, network element scanning, log collection, security analysis and a myriad of other network operations functions. These systems provide consistency and accuracy to the process of managing millions of elements that make-up the network. They also provide a form of separation that assists with security. For example, it should be possible to designate specific operations support systems that are permitted and need to perform SNMP probes on network elements. This makes the activity patterns for the SNMP protocol somewhat predictable, and security analysis algorithms needed to evaluate the validity of SNMP probes should be relatively straight forward. Support systems also present security challenges since they tend to hold the keys to the kingdom. These platforms need to be engineered, tested, and operated with particular attention to security. Automation allows operators to concentrate on exceptions. As we outline in later sections, automated responses to security related events can be particularly challenging since the events are inherently not predictable and may be deliberately deceiving. In some cases, available network data alone may not be sufficient to determine if an event is a security issue. For example, distinguishing between a legitimate flash crowd and a DDoS attack may require application or services specific information [46]. 13.3.1.5 Monitoring In the context of security, monitoring can be categorized in two primary contexts: (i) analyzing traffic behavior on the network for security anomalies, (ii) analyzing control activities to assure there has been no breach of control systems. Because network traffic monitoring in the context of large provider networks is a relatively complex subject, we deal with this topic separately in Section 13.4. Compared to monitoring and analyzing network traffic, monitoring the control activities of the network is more closely related to security analysis of business enterprise networks. When monitoring control activities, it is possible, for example, to perform much more focused checks for policy violations, or to flag specific exceptions to normal behavior in the latter case. The AT&T Threat Management Solution is an example control activity monitoring system. This Security Information Management (SIM) system performs security analysis on data collected from a variety of sources related to network management and operations systems. A unique aspect of the system involves the use of a highly scalable data management system called Daytona [1] that allows the system to scale in depth and breadth. The Daytona technology also provides a means to perform analytical functions with significant flexibility and performance. Scalability allows the system to not only collect security event data but also to collect a variety of other activity event data associated with the network management activities. Such inputs include firewall logs, flow data, alarms (e.g., intrusion 13 Network Security 469 detection systems (IDS) alarms), inputs from subordinate SIMs and syslog data collected from a variety of network elements. Performing security related analysis across such a broad range of sources allows the identification of security events that might go undetected in systems that perform specialized security detection. Scalability is also important to allow online retention of a significant history of activity, e.g., dating back many months. If there are any suspected events, they might be learned about weeks or months after the fact. The ability to forensically isolate specific suspect events can often help to determine the root cause and aid in resolution. Attackers will try to hide their tracks, but they will have difficult hiding from a system that collects and analyzes data from many points in an independent repository. 13.3.1.6 Response and Recovery From the global nature of today’s economy and the “flat world” nature of interactions on the Internet, it should be apparent that while provider networks experience well established daily peaks and valleys in terms of demand, these networks carry significant traffic volumes throughout the day. Providers are therefore required to have the necessary support in place to provide commensurate response on a 247 basis. Further, to ensure any event receives appropriate attention from technical experts, a tiered support structure is essential. Tiered support allows routine events to be handled through documented operational procedures, or ideally through automated operational procedures, thus leaving domain experts free to deal with unexpected and/or sophisticated events. This subject is explored in greater detail in Section 13.6. Further, while to date most security incidents did not constitute disasters, it is conceivable that a massive security event might develop into the cyber equivalent of a physical disaster. Service providers should therefore have a well developed recovery program in place. This topic is dealt with in detail in Chapter 14. 13.3.1.7 Innovation As we will outline in the remainder of this chapter, the unique requirements of each provider’s network, service offerings and users, suggest that provider unique innovations are required to supplement security vendor offerings and to integrate vendor offerings into a comprehensive security solution. 13.4 Importance of Network Monitoring and Security Intelligence A securely configured network is a necessary first step in service provider network security. Unfortunately, as outlined in Section 13.2 even a perfectly configured network remains vulnerable to attack and exploitation because of, for example, inherent 470 B. Rexroad and J. Van der Merwe protocol vulnerabilities and the fact that different role players and protocol components have inherent dependency and trust relationships. To address these concerns, diligent service providers are required to develop and deploy extensive network monitoring capabilities and to develop systems and algorithms to derive actionable intelligence from such data. In this section we will first expound a number of principles associated with security related provider network monitoring. We will then consider sources of security monitoring and touch on the challenging aspect of implementing reliable monitoring mechanisms and tools. Finally, we will consider the use of network flow records as a specific source of data from which network intelligence can be derived. We will show real world examples of network intelligence derived from flow data which illustrates its utility but, more importantly, also shows how network intelligence can provide early indicators of potential future security events. We end this section by considering the importance of automated analysis of network intelligence. 13.4.1 Principles of Provider Network Monitoring Network monitoring is a critical but challenging component in the security arsenal of network service providers. Below we discuss some of the challenges and opportunities associated with these actions and articulate a number of principles associated with security related network monitoring. These principles are listed in the text box titled “Principles of Provider Network Monitoring” and discussed in detail below. Principles of Provider Network Monitoring Providers have broad visibility and coverage. Network monitoring is an integral part of network operations. Proper base-lining helps prevent false monitoring. Combine external information with analysis-derived intelligence. Perform only appropriate network monitoring. Security monitoring has broader benefits. Security monitoring helps providers understand the bigger context. Providers Have Broad Visibility and Coverage While network monitoring is an essential part of simply keeping the network operational, we note that from a security perspective service providers are in a strong position to detect and react to security concerns. For example, compared to individual users and enterprises, provider networks with many client users have an entirely different perspective of network threats such as botnets, which are typically used as the platform for a variety of nefarious network activities. In the case of enterprise networks, visibility is typically restricted to activity in one’s own address space where a relatively small number 13 Network Security 471 of addresses will generally be active. When a probe or exploit is attempted in this space, it is difficult to assess whether this is a random attack or whether this is a targeted attempt against a specific organization. Any statistical measurements could have significant error due to a lack of sufficient distribution. Network Monitoring Is an Integral Part of Network Operations Service providers that do not perform the appropriate network monitoring are essentially blind to what is happening in their network, with potential dire consequences for them and their customers. Events such as self propagating network worms, email viruses, massive exploit events, distributed denial of service (DDoS) flooding attacks, and spam floods can occur at scales that can potentially congest network services. Many of these activities are associated with botnets either actively attacking or in the process of recruiting more bots. As long as botnets are able to generate revenue through DDoS attack extortion, spam campaigns, and other questionable activities, they will present a major and growing threat to network services. Proper Base-Lining Helps Prevent False Monitoring It is crucially important for service providers to continuously perform network monitoring, not only when a network event is taking place. In cases where little or no monitoring is performed until an event happens, everything becomes an event. False monitoring is the result of not performing necessary monitoring during “normal” network conditions, which means that there is no baseline for comparing normal against abnormal when security events occur. The Internet continually has background noise of traffic due to exploit scans for new and old vulnerabilities, surveys & research probes, DDoS backscatter, and other unexplained activity. Consequently, it is possible to look at traffic at any time and conclude an attack is underway. It is important to conduct some sort of baseline monitoring at all times and assess the relative impact of the undesired activity to determine if an attack really is underway, or if the activity should simply be ignored. Ideally, this would be a science. But the Internet is not ideal, and consequently distinguishing “attack” from “noise” is somewhat of an art. As with any art, it requires practice and skill, which in the current context translates to service providers maintaining a staff of well trained security analysts. Combine External Information with Analysis-Derived Intelligence In the past, computer attacks targeted select victims that had weak or flawed security. Now, botnets take advantage of anyone with even minor security weaknesses. Some attacks such as DDoS depend on no basic flaws in the target systems for the attacks to succeed. There are some sources of threat information based on honeypots/honeynets such as shadowserver.org and cymru.com. But information available regarding sizes of botnets and the threats they present are often predicted based on statistical models rather than actual measurements. There are few reliable sources of data that characterize the threats botnets present to network and application service providers. To understand the types and sizes of threats, it is necessary to merge externally available data with specific information about activity on your network. Examples of external security related data sources include organizations that track sources of spam, analyze malware or track nefarious activity across the Internet. 472 B. Rexroad and J. Van der Merwe Perform Only Appropriate Network Monitoring It is important to establish a strictly enforced policy regarding network monitoring. It may not be permissible to perform traffic content analysis on carrier networks without appropriate justification.8 And there may be legal differences between using monitoring systems to detect activity profiles that are indicative of malicious behavior and less discriminative perusal of traffic [62]. For example, it is a generally accepted (indeed expected) practice for ISPs to scan email content for virus attachments and/or links to malicious content. In general users are not complaining about email scanning, which scans content for virus and spam signatures, because the utility of such actions hugely outweigh potential customer concerns. Security Monitoring Has Broader Benefits Some of the benefits of good network intelligence are somewhat peripheral to the operation of a robust network. The primary objective of operating a reliable and robust network is maintaining the service. DDoS attacks can threaten the service by clogging pipes. Spam originating from user clients can overload email systems and can result in other providers blocking email from your customers. On the other hand, phishing, identity theft, and to some extent network exploit attempts are things that are less likely to affect the network services, but they can have a derogatory affect on customers satisfaction. If early detection and mitigation results in fewer affected customers/clients, then customer satisfaction is improved, and subsequently, customer service calls and service cancellation may be reduced. Security Monitoring Helps to Understand the Bigger Context Finally, it is important to understand the threats that affect services and your customers/users. Specifically, when a customer is under a DDoS attack it does not necessarily follow that the attack is negatively affecting the customer [84]. For example, large content providers are typically under constant attack and have to deal with it as part of staying in business. Unilaterally mitigating such attacks might make the situation worse, especially since many mitigation strategies have negative side effects. Also, the type of business customers are conducting will directly impact the type of traffic they expect to see on their network and will therefore impact the type of mitigation strategies that would be appropriate. For example, traffic in a corporate private network can be expected to be more predictable, lending itself to protection strategies that take advantage of that predictability [47, 59]. The business model of 8 In the U.S, network monitoring is allowed by a provision in the so called wiretap law. Specifically, U.S. Code, Title 18, Chapter 119, 2511 deals with “Interception and disclosure of wire, oral, or electronic communications prohibited” and states the following: “(2) (a) (i) It shall not be unlawful under this chapter for an operator of a switchboard, or an officer, employee, or agent of a provider of wire or electronic communication service, whose facilities are used in the transmission of a wire or electronic communication, to intercept, disclose, or use that communication in the normal course of his employment while engaged in any activity which is a necessary incident to the rendition of his service or to the protection of the rights or property of the provider of that service, except that a provider of wire communication service to the public shall not utilize service observing or random monitoring except for mechanical or service quality control checks.” 13 Network Security 473 e-commerce and content web sites, on the other hand, is built on the premise of attracting a less predictable audience, requiring alternative mitigation strategies [86]. 13.4.2 Network Monitoring Deriving good network intelligence builds on good basic network monitoring. As such, we will now in turn look at network monitoring, i.e., the mechanisms and infrastructure needed to collect network data, and then discuss how this data is used to derive network intelligence. We first consider the various sources of network related monitoring data and then discuss some of the practical issues related to developing the appropriate infrastructure for collecting such data. 13.4.2.1 Types and Sources of Security Monitoring There is no single good source for network security data. Further, network security concerns often impose contradictory requirements on security data. For example, a global view of the security state of the network demands complete network coverage of all traffic on the network. Such complete coverage by necessity will have to be provided through an aggregate view (or a variety of aggregate views). On the other hand, determining the payload signature of an evolving worm epidemic requires very detailed monitoring of a subset of the traffic on the provider network. A good security monitoring approach will include all, or a significant subset, of the data sources discussed below and summarized in Table 13.1. Monitoring of network node resources such as link bandwidth utilization, CPU load, and memory use, are all necessary and useful parts of network health monitoring. These SNMP-based monitoring mechanisms also have a purpose in security since some types of network events that can affect network performance are consequences of malicious activities in large scale. The objective is to recognize potential performance impact to the network and applications. Ideally, the goal is to recognize Table 13.1 Types and sources of security related data Category Example source Information Infrastructure data Asset databases Node/link locations Node configuration Configured protocols and services Traffic dynamics Flow records Network wide traffic SNMP data Node health Route monitors Internal and external routing Packet inspection devices Detailed traffic characterization Service specific data DNS logs Botnet/phishing activity Spam traps Spam sources Honeypots and honeynets Malware characterization 474 B. Rexroad and J. Van der Merwe events as they develop; prior to the point where impact has occurred. Infrastructure data includes information from asset databases and network element configuration information. Other than highly aggregated information, SNMP derived node health data reveals little about the traffic dynamics on the network. Flow records collected by network elements provide significantly more granular information [15]. Specifically, flow based data analysis can provide insight into volume, protocol/port, source addresses, destination addresses, byte-to-packet ratios, and timing characteristics of events. Flow records can be generated with packet sampling and still provide useful insight into significant events such as DDoS attacks and network worms. Identification and characterization of some more subtle events such as network reconnaissance, attack forensics, and botnet controller identification require unsampled flow data generation. Sampling of flow data is considered in detail in Chapter 10 and we consider use of flow data to derive network security intelligence in more detail below in Section 13.4.3. Given the crucial role of routing in the wellbeing and overall operation of the network, monitoring all aspects of routing is critically important both for normal network operations and from a network security perspective. Routing data aids the analysis of security incidents and supplements other data, i.e., to show where traffic might have entered or left the network. A number of route monitoring tools with a range of capabilities exist. In its most basic form monitoring tools partake in routing exchanges with routers in the network to allow an accurate real time view of routing. This includes tools to monitor interior gateway routing (IGP), such as OSPF [72] and ISIS [40]) and interdomain routing. More sophisticated monitoring tools allow the detection of inconsistent route advertisements across differing peering points from the same peer [31], looks for more general violations of peering agreements [63] or attempts to detect prefix hijacking attempts [90]. Route monitoring is covered in significantly more detail in Chapter 11. Flow data necessarily does not provide any information regarding payload of packets. More granular data plane monitoring can be realized through so-called deep packet inspection (DPI) devices [26]. Such information might be crucial to understand, for example, the type of payload of a targeted infrastructure attack that is causing router malfunction [23]. As mentioned earlier, another example where payload information might be needed would be to understand the exploit method of an evolving worm outbreak. Unlike flow monitoring, the equipment and operational cost associated with DPI monitoring make ubiquitous deployment prohibitive. One approach to address this problem is to deploy DPI equipment at strategic locations in the network and to then redirect, e.g., by using a very specific BGP prefix, traffic of interest to these locations for further inspection. This approach has significant limitations though, as traffic of interest can not always be easily identified and redirected. An alternative is to utilize a mobile approach, whereby a DPI device is dynamically deployed to the physical network location from where detailed information is needed. This approach works best when the data collection of the DPI device is completely passive. The mobile DPI approach also has limitations because of the time involved for deployment, however, it could provide a practical compromise. 13 Network Security 475 In addition to data derived directly from network traffic, data from network based services and security specific data sources fill out the quiver of potential data sources. An example service specific data source is logs from DNS caching resolvers. DNS is designed to translate domain names into IP addresses. It is much less effective at reversing the process, i.e., identifying domain names that point to known IP addresses. Internet registries have different requirements for maintaining reverse mapping information, reverse DNS is not uniformly implemented, and when implemented not always well maintained [71]. Attackers use this situation to their advantage. Botnets use domain names in malware to identify malware update sites and control points. Phishing sites create domain names that appear legitimate. As malicious IP addresses are identified and blocked by ISPs, or at enterprise firewalls, domain names can be pointed to new IP addresses – allowing attacks and operations to continue. By recording DNS logs or DNS response metadata from the network, it is possible to map IP addresses to the domain names used in these malicious activities [49]. It is also possible to perform a variety of analysis such as temporal analysis of domain names to identify fast-flux and transient domains [41], which can be used to help discover botnets and phishing activities. Security specific data sources include various approaches to intentionally attract unwanted traffic or attacks to a controlled environment where it can be analyzed to make useful security related observations. Generically, these systems are called Honeypots and the basic concept is nicely captured by this definition [77]: A honeypot is an information system resource whose value lies in unauthorized or illicit use of that resource. Honeypots present a popular way to gather information about malware. By hosting computers on the Internet or in enterprise networks with common vulnerabilities, eventually attackers will locate the machine and exploits will be executed against it. By hosting honeypots on many IP addresses, the probability of becoming a victim of attack increases. The objective is to capture malware, detect the event, and provide an opportunity to analyze and characterize the malware. This technique has limited utility if the honeypot does not have the correct vulnerability, or if user action is a factor in the infection process, as is the case with many application exploits. For example, the virtual honeypot framework [65] interacts with attack and exploit attempts only at the network level. This means that the actual end system software is never compromised, which reduces the risk associated with running the honeypot, but also limits the amount of information about malicious activity that can be learned from it. Honeypot technology that understands how to become a victim of social engineering attacks is still under development and probably will evolve for sometime to come. To study attacks, exploits and malware behavior in a more holistic way requires that attackers be allowed more freedom on the honeypots, i.e., allowing the honeypots to become infected. Mechanisms need to be in place to prevent the honeypot from subsequently conducting attacks as a consequence of the infection. For example, Honeynets [37] have been developed as a means to allow honeypot end-systems to be completely compromised, and for the installed malware to be allowed to execute in order to study its behavior. This is achieved by selectively allowing interaction of the compromised honeypot with systems outside the honeynet through 476 B. Rexroad and J. Van der Merwe a filtering device called a honeywall [37]. Approaches to better scale honeynets have been developed to make use of virtual machines to host a large number of honeypots on a significantly smaller number of physical machines [87]. Unfortunately, in the ever evolving cat-and-mouse game between attackers and defenders, newer malware is capable of detecting execution in a virtual machine environment. This is taken as indication of a possible detection attempt so that the malware automatically disables itself [53]. The result is that honeynets had to evolve to detect virtual machine aware malware and facilitate execution of such malware on non-virtualized hardware [44]. Another form of honeypot is so called spam traps, or spam sinkholes, which create email accounts or complete email domains with the express purpose of attracting spam email [66, 77]. Obviously, there is no shortage of spam; roughly 90% of Internet email is characterized as spam [58]. In the effort to manage spam and understand the relationship to customers, there is value in understanding when characteristics of spam activity change, how they developed, the purpose or objectives of the spam, how that activity relates to customers. There are significant efforts by a variety of organizations to track sources of spam and use that information to identify problematic IP addresses, IP address blocks, and domains (e.g., http://www.spamhaus.org/). They also characterize attributes of the spam messages themselves for the purpose of detecting the spam. As a contribution to these efforts, and also for an ISP’s own use, there is value in creating spam traps within the ISPs email systems. By creating some proportion of spam trap accounts and seeding account information into a variety of places, it is possible to learn valuable information about spamming activity as it pertains to the ISPs network and email domain. By tracking spam associated with different account seeding techniques, it may be possible to determine how email addresses are being harvested by spammers. Spamming activities may come in surges. In some cases, spamming campaigns will seek to flood messages to the brink of an email system’s capacity. Measuring changes in the volume of spam and whether that spam is detected as spam or not will help prepare appropriate mitigation strategies. It will be possible to quickly recognize the on-set of new spam campaigns. Understanding the motives behind those campaigns can be valuable. Spam can be characterized in categories of malware, phishing, or mundane solicitation. Malware spam (or email viruses) may contain malware payload or may contain links to malware drop points. These emails are often cleverly crafted, with content related to current news events, to lure victims into downloading and executing malware. Understanding the malware that is being sent to customers will help understand threats against customers, which may result in customer service calls or could consequently impact the network. Phishing email payload typically contains a URL for a phishing site that attempts to convince recipients to divulge user credentials (username and password) for misuse. Again it is helpful to understand how customers are being targeted and how to prepare protections. In the course of forensic analysis, spam analysis, and botnet analysis, encountering samples of malware is highly likely. Malware analysis is specialized, difficult, labor intensive, and time consuming. This challenge is not accidental. Malware developers are continually creating increasingly sophisticated techniques to hide their malware, mutate characteristics, change behavior, and prevent being analyzed. 13 Network Security 477 Anti-virus vendors have developed some of the most advanced capabilities to analyze malware and the indicators that are left behind on infected computers. However, as a network provider, some of the most valuable characteristics are going to be the network observables associated with the malware behavior and activity. Understanding the functional capabilities of the malware such as back-door ports that might be opened, malware update capabilities, command and control mechanisms, DDoS tools they might contain and defensive actions the malware might utilize are valuable to a service provider. Using such information in conjunction with information about the quantitative presence of infected devices in and around the network will provide insights into the threat level they present to offered services. Understanding the command and control mechanisms and characteristics will provide insight into possible methods to surgically disable these threats without disrupting the services customers need. These indicators can also be used to help identify infected customers for notification and remediation assistance. 13.4.2.2 Implementation Considerations Complete coverage of all implementation issues related to network security monitoring are well beyond the scope of this chapter. However, here we do address some of the concerns and contradictory requirements presented by a comprehensive security monitoring framework. Perhaps the single most pressing implementation concern for provider based network monitoring is the tension between scalability, fidelity, and coverage. In an ideal world, fine grained measurements of all parts of the network would be available instantaneously and be archived over long periods of time. This is clearly not a feasible goal as the infrastructure needed to realize the monitoring system will be of similar (if not higher) complexity and cost as the network it is supposed to monitor. We already mentioned the ubiquity of flow based monitoring. By definition flow monitoring is aggregated into per-flow measurements. However, despite this aggregation, traffic volumes are such that unsampled flow collection is still problematic, both from the point of view of the load imposed on network elements, as well as the capacity requirements of systems that process the flow data. A common approach to address this concern is to generate flow records based on packet sampled data. E.g., one in every 500 or 1,000 packets are used to generate flow records. This approach is attractive from a scalability point of view, but require caution when interpreting the data. We consider this in more detail is Section 13.4.3.1. Scalability concerns of the collected data volume might be significantly magnified when DPI data collection is utilized in a naive manner. One of the primary objectives of security monitoring is to improve the reliability and availability of network services, it is therefore desirable to consider a passive implementation (e.g., physical-layer splitters). This adds other important attributes as well. Whereas network services require meticulous change control processes to maintain high levels of reliability, security monitoring and data analysis require the capability to remain very flexible and reactive to new attack techniques and investigation of suspect events. A passive approach provides valuable autonomy 478 B. Rexroad and J. Van der Merwe between the diverse operational requirements of the active network elements and the passive monitoring elements. The added costs of a passive approach can provide some clear benefits in a large network where reliability is a primary consideration. Once captured, there is a trade-off between sending all captured data in unprocessed form to a centralized location for further analysis, versus performing initial processing locally and only sending processed data to a central point. One of the factors that play into the centralized versus distributed decision concerns the flexibility that is required in the post processing of the data. For example, if the ultimate network intelligence that will be extracted from the data is well known and well understood, it might be relatively simple to partition the work such that a distributed solution is feasible and provide good scalability properties. On the other hand, distributed processing invariably leads to a loss of information, and the lost information might be crucial in analyzing security incidents that are new or not well understood. While true for all network monitoring systems, flexibility is of particular concern for network security given the ever changing nature of network threats. Another scalability concern involves the storage, processing and retention of security data. In the case of unsampled data, volumes are such that data typically cannot be stored in unaggregated form for very long periods of time.9 There is a common tendency to put data into a database and then consider how to process that data. In many circumstances this is the right approach. But a large network can also generate a significant amount of metadata about network activity. Consequently, an attempt to insert and index all of the data can quickly become a task that consumes all of the available processing resources, and consequently accomplishs nothing. There is a balance that must be achieved between the types of analysis that are performed and the aspects of that data entered into a database. As a general rule, only enter data into a database when that data will be retrieved many more times than it is entered. For example, in the informational security system described in Section 13.5.1, no raw flow data is entered into a database as part of the unsampled flow data analysis. Rather, the flows are processed and then attributes regarding the volume of activity on each of the ports and protocols are stored in a database. Raw flows are retained for a short period of time and discarded, but the volumetric attributes are retained in a high-performance database and used for a variety of analysis functions. Further, off-the-shelf database technologies or naive processing methodologies are typically not sufficient for the data processing that needs to be performed. Instead specialized database technologies, such as the Daytona data management system [1], and streaming processing technologies, such as the Gigascope network stream database system [26], are often required. Metadata is a generic term for data that is derived from other data. There is no fundamental beginning to metadata, and there is an even more ambiguous end. As analysis tools develop, the output from one analysis step starts as a report, but it invariably becomes the input to another analysis step. It is valuable to recognized this early, and consider a relatively standard format for all data that analysts, researchers, developers and downstream systems are comfortable with. 9 Section 4.3 in Chapter 10 discusses the volume reduction associated with sampling. 13 Network Security 479 Binary formats are the most compact, but they tend to be less flexible and more difficult to work with in ad-hoc ways since most Unix tools tend to manipulate ASCII files. A compromise may be to assure there are sufficient conversion tools to allow ad-hoc manipulation of data stored in binary format. Finally, there is typically a trade-off between the robustness and scalability of a systems versus the flexibility it allows to enable prototyping and ad-hoc investigations. An ideal realization will allow scalable and reliable processing of well understood analyses, while at the same time facilitating ad-hoc investigations using the same data. For example, the informational security system described in Section 13.5.1 implemented four phases of analysis that are all connected. (i) Ad-hoc analysis is needed to perform analytical functions that have never been performed or need to address a new type of situation. The tools range from simple commands to use of complex analytical tools. (ii) As analytical needs are better understood and can be articulated in conceptual terms, researchers can apply mathematical tools to improve the accuracy and performance of the analysis. (iii) From this point, a proof-of-concept implementation is used by analysts in actual use to determine how effective the tool is and assess readiness. (iv) Finally, the tools are migrated to the production platform for life-cycle support, performance enhancements, and where applicable, automated reporting. This type of evolutionary model has been very successful by getting complex capabilities in the hands of analysts in the shortest possible time. 13.4.3 Network Intelligence First-order analysis of network activity is to determine when there is an existing problem. Denial of service events can clog network bandwidths, overwhelm routers, and/or overwhelm servers on the network. While there has not been a massive network worm for a while, the possibility of events similar to Slammer, Blaster, and MyDoom still fundamentally exist.10 Monitoring bandwidth usage, router buffer and CPU usage, firewall state space, and host performance can provide insights into the health of these resources as part of normal network analysis. These metrics are fairly reliable indicators of massive network events, whether security related or otherwise. However, by themselves these metrics do not provide enough information into why events are taking place nor how to mitigate them. Also, smaller events, that could still be customer impacting, might not necessarily be discernible in these metrics. Given the prevalence of flow-based monitoring capabilities in most network forwarding equipment, flow data is a particularly attractive source to form the basis of a comprehensive network intelligence infrastructure. In Section 13.4.3.1 we detail the types of network intelligence that can be readily derived from network flow data. 10 While Conficker worm of 2009 exemplifies the continuing potential for massive network worms to exist, Conficker generally did not cause the same sort of network disruption as some of the earlier worms. 480 B. Rexroad and J. Van der Merwe To complement reacting to security knowledge, there are definite benefits to taking a pro-active approach to the analysis. In Section 13.4.3.2, we present a history of cases where early indications of exploit development have been identified through generic traffic profiling of Internet activity. The objective is to detect threats as they develop rather than wait for events to have adverse effects on network or service application performance. Given the massive amounts of monitoring data that large provider networks produce on a daily basis, automating procedures for deriving network intelligence is imperative. We discuss this topic in Section 13.4.3.3. 13.4.3.1 Intelligence from Flow Data Commercial products are on the market to help measure activity on the network using flow data (e.g., netflow or cflowd). For coarse grained analysis, packet sampled flow data can be used to measure relative byte traffic levels at various access points in the network. It is important to realize that sampling results in loss of information. As a result sampled flow data cannot be used to accurately interpret certain events. For example, individual events that may have occurred are most likely not in the data, TCP flag information is not complete, packet and byte counts in individual flows are not correct. Even interpreting packet sampled or flow sampled records in aggregate can be difficult. These analysis points seem obvious, but they are easy to forget, and results from analyzing sampled data can be easily misinterpreted. Below we describe how flow data can be interpreted to detect some types of security and non-security events, namely, DDoS attacks, flash crowds, address scanning and network worms. When an anomaly is detected (generally a relative increase in packets or bytes), analysis is used to help determine the origination points and destination points for the changes. Interpretations of the activity can help diagnose what might be happening as outlined below. DDoS Attacks Even with packet sampled flow data, the characteristics of DDoS attacks are such that most attacks can be detected with reasonably high confidence. The text box “Flow characteristics associated with DDoS attacks” list these characteristics. Flow Characteristics Associated with DDoS Attacks Large Increase in Packet Rate Changes in packet rate relative to normal are often an indicator of a denial of service attack. Attacks might use large packets in an attempt to overwhelm bandwidth resources and/or can use many connections in an attempt to overwhelm end-host or firewall session capacity. Many Source IPs If there is a high proportion of source IP addresses in a given address block, then spoofing is likely. Determining a high proportion depends on a number of factors, so some experience comparing normal and 13 Network Security 481 attack traffic is helpful. Identifying the presence of spoofed sources helps to develop greater confidence the activity is an attack (i.e., of malicious intent). Consistent TCP Flag Combinations If nearly all flows have the same flag combination (e.g., SYN only) in combination with a large increase in packets, this is a supporting indication of an attack where connections are generally unsuccessful. However, care should be taken concerning the interpretations of TCP flags when analyzing packet sampled flows. Some statistical analysis of many flows can be used to interpret typical TCP flag activity, but flag combinations for a single flow are obviously subject to the effects of sampling. I.e., the TCP flag field in a flow record is the logical “OR” of all TCP flags of sampled packets observed by the router and associated with that flow. Consistent Single-Packet Flows If nearly all flows have only one packet and/or have nearly all the same packet size, this suggests connection attempts are not being acknowledged. This might simply be the result of a non-DDoS related host failure. However, if the protocol is not TCP, then this may be the best indicator of a DDoS attack. But this is not a particularly strong indicator unless you can compare with a change from normal activity. There are protocols that use only one packet in a session and can have relatively consistent packet sizes in normal operation. DNS (53/udp) is an example. Maximum Size Packets If the attacker’s objective is to flood the byte bandwidth capacity of the target, the attacker may choose to use maximum size packets. While maximum size packets are not unusual when transferring large amounts of data between two points, it is very unusual to see many long sequences of maximum size packets in UDP and particularly in ICMP for legitimate purposes. Traditionally, maximum packet size has been 1500 bytes as defined by Ethernet. Larger sizes will likely become more popular with higher access speeds. Backscatter Backscatter refers to the phenomena of observing unsolicited response traffic because some DDoS attacks use spoofed IP addresses [60]. Typical indications are TCP SYN-RST flows or ICMP “Destination unreachable” response flows sent from a target IP address and using the spoofed IP address as destination. Flash Crowds It is not unusual to mistake a flash crowd for a DDoS attack. Flash crowds are caused by events that are generally not malicious such as a really good online sale or a very popular webcast. Many flows that appear to be successful connections could suggest a possible flash crowd, i.e., a rush of visitors to a particular website. Flows with a variety of flag combinations that in combination appear to show successful connections suggest an innocuous flash-crowd event. But there is a possibility that it is a DDoS attempt in progress. I.e., with a sufficiently large botnet, 482 B. Rexroad and J. Van der Merwe a sophisticated attacker can emulate user behavior which can not be distinguished from a flash crowd simply from flow level characteristics. Address Scanning Flow level information can be used to detect scanning which is used to identify potentially vulnerable machines on a network. Massive address scanning is performed by botnets or worms (which are not necessarily distinct scenarios). Again these activities have tell tale characteristics which are listed in the text box “Flow characteristics associated with address scanning”. Flow Characteristics Associated with Address Scanning Many flows from one or possibly many distinct source IP addresses to many destinations. Flows to darkspace (or greyspace) destinations as well as active destina- tions. (Darkspace refers to IP address blocks with no legitimate hosts but with advertised BGP routes on the Internet. Greyspace refers to unused addresses within address blocks that have active addresses.) Activity to darkspace and greyspace can manifest itself as ICMP “Destination unreachable” or ICMP “Time exceeded” backscatter messages from probed addresses toward the scanning sources. Most connections are unsuccessful with only occasional indications of successful connections. Successful connections might be determined based on the types of TCP flag combinations or identification of response traffic toward the IP addresses that are suspected to be scanning. Network Worms The fundamental distinction between massive address scanning and a network worm is that worms demonstrate progressive increase in scanning activity and an increasing number of source addresses performing the scanning. 13.4.3.2 Early Indicators Network intelligence derived from flow data might not be accurate enough to pinpoint specific security events, e.g., detect with high accuracy a DDoS attack against a specific customer. However, because of its ubiquitous coverage it serves as a very effective early warning system regarding the increase of suspicious activity in the network. Below we describe a number of such early indicators identified in real exploit scenarios, namely: intent to exploit, exploit trials and worm propagation. Intent to Exploit When a vulnerability is announced for a network application, invariably there will be some reconnaissance activity (network scanning) for the associated application, i.e., intent to exploit. The reconnaissance is presumably a survey for hosts that might be potential targets for exploit. The activity may be collecting signatures of hosts that might indicate the version of application software and 13 10 Network Security 483 Blaster Worm TCP135 Change Factor Relative to 5 Week Mean 9 8 Change factor flows 7 day Moving Average Start of Blaster 7 6 5 4 3 2 1 0 3 /0 27 8/ 03 / 20 8/ 03 / 13 8/ 3 0 6/ 8/ 3 /0 30 7/ 03 / 23 7/ 03 / 16 7/ 3 0 9/ 7/ 3 0 2/ 7/ 3 /0 25 6/ 03 / 18 6/ 03 / 11 6/ 3 0 4/ 6/ 3 /0 28 5/ 03 / 21 5/ 03 / 14 5/ 3 0 7/ 5/ 3 /0 30 4/ 03 / 23 4/ 03 / 16 4/ 3 0 9/ 4/ Fig. 13.2 Activity metrics on TCP port 135 in weeks prior to and including the Blaster worm underlying operating system. The presence of sufficient hosts with potential vulnerabilities will likely guide the priority and amount of effort attackers will devote to developing and refining an exploit. Indications of such efforts were very clear in the case of the Blaster worm, where reconnaissance activity started immediately following the vulnerability announcement, and the level of reconnaissance increased slowly over days until the presence of the worm was clearly evident. Figure 13.2 illustrates the increase in activity on port 135/tcp for the weeks prior and including the Blaster worm event. Exploit Trials As an exploit is developed, it is often developed in stages and usually does not work to full potential in early phases. Such exploit trials can manifest themselves in a few different ways, often as phases of on-again and off-again activity. For example, this occurred in the weeks leading to the Slammer worm with activity on a MS-SQL port (1434/udp). Figure 13.3 shows packet counts for UDP port 1434 in the weeks prior to the Slammer worm. Note that the y-axis is in log scale. During the weeks of January 1, 2003 and January 8, 2003 there was more than two orders of magnitude increase in packet counts. Some of these existed for extended periods of time which might have been an initial worm attempt that fizzled out. During the weeks of January 15, 2003 and January 22, 2003 there were two short periods of increased activity which might again have been “test runs” before the start of the actual Slammer event on January 25, 2003. Worm Propagation There have been some rare examples of very rapid worm propagation. Noted examples are the Slammer worm (once past the trials phase) and the Witty worm. These worms have illustrated that it is not always possible to react to worm propagations once the worm is launched. However, these examples 484 B. Rexroad and J. Van der Merwe Port 1434/udp packets 1.0E+09 Anomalies_ Early indicators 1.0E+08 Worm event 1.0E+07 Typical port activity 1.0E+06 2 orders of magnitude 1.0E+05 1.0E+04 25 Days 1.0E+03 3 /0 29 1/ 3 /0 22 1/ 3 /0 03 15 8/ 1/ 1/ 03 1/ 1/ Date Fig. 13.3 Packet counts on UDP port 1434 in weeks prior to Slammer worm ICMP/8 Flows Flows (Millions) 1000.00 Worm growth is easily detectable by this point 100.00 10.00 1.00 8/ 20 /0 3 3 3 3 3 3 3 3 3 /0 /0 /0 /0 /0 /0 /0 /0 19 19 18 18 17 16 16 17 8/ 8/ 8/ 8/ 8/ 8/ 8/ 8/ 0: 00 :0 00 12 0: :0 00 12 0: :0 00 12 0: :0 00 12 0: 0 0 0 0 Time (UTC) Fig. 13.4 Number of ICMP flows leading up to Nachi worm were exceptionally aggressive/efficient worms, and most other worms have not been nearly as aggressive. The Nachi worm, for example, existed on the Internet for more than 8 hours before it became visible to even a small portion of the Internet (see Fig. 13.4). This is sufficient time to recognize the propagation, recognize a behavior 13 Network Security 485 profile, and prepare a mitigation plan. The activity is recognizable as a continual increase in the number of source IP addresses that are performing reconnaissance activity on a given port. In the case of the Nachi worm, the underlying behavior was a little more difficult to recognize since the reconnaissance was performed using icmp type 8 (echo request or ping) while the exploit was predominantly on port 135/tcp. 13.4.3.3 Automated Analysis An important aspect of any data analysis effort is a need for automation. It is not practical to hire a team of analysts to continually assess traffic activity for millions of subscribers and billions of flows. Automated analysis functions are needed to help determine what is important and to lead a small team of analysts in the right direction to isolate relevant security events. Below we describe considerations concerning automated analyses for developing security intelligence. Ongoing Measurement of Key Parameters Select parameters need to be measured on a periodic basis. Some obvious parameters to measure are the number of packets and number of bytes on each IP protocol and port. It can be useful to count flows, which is defined as a unique source IP address destination IP address, IP protocol, source port, and destination port. Some less obvious measurements also include the number of active source IP addresses and the number of active destination IP addresses. Selection of parameters is a trade-off between simplicity and manageability of the data, system performance, and sacrificing information that might be useful. It is not possible to maintain all parameter information, but when dealing with unknown security concerns, it is desirable to err on the conservative side and maintain as much information as possible. Baseline Generation A key aspect of identifying what is an anomaly is defining what is normal. We call the process of calculating the normal activity “baseline generation”. As with selection of parameters to measure, there are a variety of ways to determine a baseline, which invariably involves some method of averaging over time. In traditional POTS phone call patterns, it has in general been possible to use the previous week of activity as a baseline measurement for call volume. But the Internet, in an era of multi-GigaByte transfers and DDoS attacks, is much more volatile than the metered 64 Kbps calls of POTS. Further, specific ports and protocols subdivide segments of activity into smaller and more volatile behaviors. Each hour of the day as well as each day of the week have unique characteristics. Methods that use decaying averages and compensate for diurnal behavior are effective for short-term averaging. For longer-term baselines, a moving average over several weeks for the same hour of day and same day of week have been shown to be effective. Another normalizing factor that we call “share value” can also be used in measurement. For example, rather than measuring the absolute count of packets 486 B. Rexroad and J. Van der Merwe on port 25/tcp, measure the share/percentage of 25/tcp packets relative to all packets on the network. This can help compensate for normal network changes or even anomalies in the availability of certain data in the analysis platform. Alarm Detection Alarm detection becomes a comparison of current activity with the baseline. We already mentioned the volatility of measured parameters, and it is sometimes useful to consider how volatile data normally is in anomaly detection. To account for this, it can be useful to set alarm thresholds as multiples of standard deviation for a given measured parameter. For some ports and protocols, there is no defined application. But not surprisingly, there is occasionally traffic on nearly every port, protocol, and address possible. Some of this traffic may be accidental, but much has some sort of nefarious intent, so it is useful to monitor this activity. But measuring the relative change of activity from a baseline that is effectively zero presents a challenge. It may be sufficient to define a minimum measurement of activity that is considered important for alerting and investigation, and this can overcome some of the problems created by overly volatile or otherwise unused ports and protocols. Threshold Management It is desirable to have a standard threshold set for all types of activity, but we have found some thresholds need to be more sensitive than others. For example, there are certain ports and protocols that are frequently targeted for attack, while other ports and protocols do not present a significant threat. Consequently, we have found it to be useful to set thresholds for certain special interest protocols to be tighter than the default group thresholds. None of this remains fixed, i.e., as the environment and attacks change, so should the measurements and the thresholds. It is useful to hold periodic analyst meetings that review thresholds along with current attack and exploit trends, and determine if, or what, threshold changes (upward or downward) are needed. Reporting Generating reports is another activity highly amenable to automation. Specific examples include: Alarms Alarms are an obvious report type that are needed to raise attention to specific events. Depending on the users, it is useful to provide a summary console of recent alarms and also provide alerting subscription capability. Email seems to be the most flexible means for alarm delivery. If pertinent alarm information can be squeezed into the limits of SMS messages without inappropriate disclosure of information, analysts on call can easily receive pertinent alerts on a conventional cell phone. Traffic Trending Invariably, there will be a need to look at activity and traffic trends over short and long periods of time. Examples include reports on traffic volume attributes such as flows, packets, bytes, bytes per flow, bytes per packet. 13 Network Security 487 13.5 Network Security Systems In this section we consider several example network security systems that service providers utilize to protect network users from security threats. Such service provider actions are typically manifested as service offers or features. User protection services can be classified in three categories namely Informational, Automated/Compulsory, and Supplementary/Optional. Informational services generally provide information to users about that status of the network and issues that may pertain to them. Acting on such information is, however, left up to the end user. Naturally, there must be sources of data to provide information to users, and this is necessarily the result of gathering and analyzing data from a variety of sources including abuse complaints, security advisories, traffic flow analysis, and sometimes some in-depth analysis of suspect activity. In Section 13.5.1 we describe the AT&T’s Security Analysis Platform as a specific example of a sophisticated network security intelligence apparatus which, among other things, can be used to provide informational security services to users. Automated or compulsory security services are normally realized in the form of filtering. It includes those that are performed to prevent collateral damage to users or services that might indirectly suffer as a result of an attack. For example, in order to provide quality service to all customers on a network, it is sometimes necessary to block the congestion caused by a denial of service flooding attack. Security services that are automatically provided as part of “standard” service offers also fall in this category. A canonical example is the filtering or tagging of suspected spam email when email services are provided. In Section 13.5.2 we describe DDoS blackholing as well as the AT&T email platform as specific examples in this category. Supplementary or optional security services are those that customers specifically select or opt-in to use. Supplementary services often involve dedicated security infrastructure and as such often require service specific payment by customers. As noted earlier in this chapter, security as a service helps to correct the economic imbalance that is otherwise skewed in favor of bad actors. In Section 13.5.3, we describe customer specific DDoS filtering and network based firewall services as two specific service options in this category. By necessity the categorization provided above is not absolute or perfect. For example, while derived network intelligence can be provided to customers as informational services, the same information might be utilized to trigger compulsory DDoS filtering, or customer specific DDoS filtering. Similarly, some informational services might be provided to all users as a default part of their service, or informational services might be offered as an optional service feature. 13.5.1 Informational Security Services In this section we describe AT&T’s Security Analysis Platform as a specific end-toend example of how informational security intelligence is derived. As noted above, 488 B. Rexroad and J. Van der Merwe External Information Sources Network Flow Records Other Network Data Short Term Storage Analysis/ Detection Security Analysts Customer Alerts Long Term Storage External Data Sources Mitigation Platforms Fig. 13.5 Generic overview of AT&T’s Security Analysis Platform such security information is not only used for informational purposes, but also form the basis for mitigation actions. A generic high level overview of the platform is depicted in Fig. 13.5. The primary source of dynamic network traffic information for this system is in the form of network flow records. Network flow records convey information regarding the source, destination, IP protocol, source port, destination, TCP flags, packet count, byte count, start time, and end time for activity on the network. In select portions of the AT&T Internet backbone, unsampled flow generation has been implemented and is processed in a variety of ways to help identify anomalies. The unsampled flow data complements packet sampled and smart sampled flow data that is more ubiquitously available. As shown in Fig. 13.5, flow records are combined with other network data, e.g., topology information, as well as other external data sources, such as external sources of unwanted traffic, e.g., sources of spam email. Data from all these sources are saved in short term storage and an analysis and/or detection component combines all the data and performs automated analysis using predefined rules and algorithms. For simplicity we represent this component as a single entity, however, in reality it consists of a variety of sub-systems which we describe in more detail below. Further, the output of particular subsystems might be used as input to other sub-systems and is again not shown in the figure. For example, the output of port-scanning detection is used as input to botnet detection. The output of the analysis/detection component is a set of alarms. These alarms are typically stored in long term storage and made available to a group of security analysts who investigate the alarms to determine whether action should be taken. The analysts also make use of other external information sources, for example CERT alerts or reports from virus protection vendors. Based on their domain knowledge and the intelligence provided by the platform, the analysts could generate detailed customer alerts to warn customers of emerging security concerns. The analysts could also trigger appropriate mitigation actions to be performed in the available mitigation platforms. In some cases alarms generated from the analysis/detection component can directly feed into a mitigation system. We note, however, that because most mitigation activities have some negative side effects, 13 Network Security 489 and because all detection system are subject to false positives, automated response to network security threats is not trivial. We will now describe various aspects of the analysis/detection component of the AT&T Security Analysis Platform in more detail. 13.5.1.1 Scan Detection and Trending We consider several scanning related detection activities and/or algorithms namely, general scanning, worm detection, scan volume alarming and summarization of scanning activity in a Reconnaissance Index. General Scanning General Scanning activity is characterized as source IP addresses that are making many, many connection attempts to destinations. This type of activity is generally suspect and usually represents an intent to exploit vulnerabilities in network applications. For this reason, the Security Analysis Platform detects and records the sources and some general characteristics of scanning activity on the Internet. The results of this analysis are used for a variety of subsequent analysis algorithms including worm detection, scan volume alarms, the reconnaissance index, and botnet detection. For example, the graphic in Fig. 13.6 depicts the number of unique source IPs that have been associated with scanning activity on port 445/tcp over a 200 day period leading into and through the evolution of the Conficker worm. The graphic was generated from data on the AT&T Internet Security Analysis Platform. Figure 13.6, clearly shows the evolution of the Conficker worm over time. The graph shows initial significant activity starting on Port 445/tcp Scanning Activity - Unique Source IPs 40 Conficker.C Unique SIPs 24 per. Mov. Avg. (Unique SIPs) 35 Conficker.D Thousands 30 Conficker.E 25 Conficker.B 20 Conficker.A 15 10 5 0 9 /0 09 3/ 5/ 09 9 /0 19 4/ 5/ 4/ 22 3/ 9 /0 09 8/ 3/ 09 9 /0 22 2/ 8/ 2/ 9 08 /0 25 1/ 11 1/ 08 8/ /2 12 08 4/ 8 08 0/ /1 12 /3 11 /0 08 6/ /2 9/ /1 11 11 /1 10 Date Fig. 13.6 Scan activity on port 445/tcp in Unique Source IP addresses/hour over a 200 day period 490 B. Rexroad and J. Van der Merwe November 21. The graphic also shows the changes in Conficker behavior as new variants were released, and provides a relative measure of the worm’s “success” at reaching previously infected hosts for update to later variants. Worm Detection The Worm Detection algorithm provides early detection of worm activity on the Internet. When a worm propagates on a network, it performs the following steps: (i) Seeks exploitable hosts through scanning network addresses on target ports, (ii) performs an exploit against identified targets, (iii) replicates itself to the target, and (iv) repeats from step (i). The worm detection algorithm tracks the number of unique hosts scanning, and alerts analyst to any significant increase in the number of hosts scanning on a given port. We perform this analysis at the Internet circuit level, i.e., physical links at the perimeter of the network. Worm alarms are valuable for identifying mass use of new exploits early in the deployment or even development phases. These types of events are attributable to a number of network disruptions and problems that have occurred on the Internet as well as within enterprises, making this unique capability invaluable for early warning and mitigation. Table 13.2 shows (in reverse chronological order) a sample of alarms that precipitated during the period leading to the Conficker worm event. Each line corresponds to an alarm being triggered on a specific circuit. These alarms are signifying an increase in the number of unique source IP addresses that are detected actively scanning on port 445/tcp. Specifically, for each circuit where an alarm has triggered, the number of detected source IPs scanning on this port is compared with the baseline average that has been observed on this circuit in previous periods. As the activity increases and becomes visible on more circuits with greater change, the frequency of alarms increases until the worm reached a saturation point. Interestingly, the alarms data in Table 13.2 show indications of developing activity on November 20 and perhaps as early as November 13. Table 13.2 Example worm alarms which provided early indication of the Conficker worm propagating on port 445/tcp. As time progresses, more alarms are triggered in the same hour indicating more circuits are affected by the event. The alarms are listed in reverse chronological order since analysts are generally interested in the most recent activity first Date Hour Alarm type Target port 11/21/2008 6:00 Worm 11/21/2008 6:00 Worm 11/21/2008 6:00 Worm 11/21/2008 6:00 Worm 11/21/2008 6:00 Worm 11/21/2008 5:00 Worm 11/21/2008 5:00 Worm 11/20/2008 13:00 Worm 11/20/2008 12:00 Worm 11/20/2008 7:00 Worm 11/19/2008 6:00 Worm 11/19/2008 6:00 Worm 11/13/2008 20:00 Worm tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 tcp.dport.445 Message Scans from 56 source IPs compared with 16.36 ave. Scans from 53 source IPs compared with 15.81 ave. Scans from 52 source IPs compared with 15.69 ave. Scans from 61 source IPs compared with 21.76 ave. Scans from 282 source IPs compared with 84.31 ave. Scans from 196 source IPs compared with 56.39 ave. Scans from 44 source IPs compared with 9.76 ave. Scans from 35 source IPs compared with 10.00 ave. Scans from 34 source IPs compared with 10.00 ave. Scans from 158 source IPs compared with 55.91 ave. Scans from 33 source IPs compared with 9.33 ave. Scans from 33 source IPs compared with 9.18 ave. Scans from 159 source IPs compared with 74.66 ave. 13 Network Security 491 Scan Volume Alarms Another algorithm produce Scan Volume Alarms by evaluating changes in scanning activity across the Internet. As malicious botnets embark on efforts to draft new hosts into their control, network scanning is sometimes used to identify exploitable hosts. Scan activity increases on a given port or protocol can be indicative of a new exploit in use, which analysts can investigate prior to affecting an enterprise. Increases in scanning activity can also be indicative of botnet ramping-up efforts to draft new bots and facilitate malicious acts such as a spamming campaign or a DDoS attack. Reconnaissance Index The AT&T Security Analysis Platform summarizes scanning activity measurements and normalizes these over time to generate a long term trending report called the Reconnaissance Index. The purpose of this index is to assess long-term changes in network exploit threat activity, in a manner analogous to a financial index. This index takes into account both the number of sources that are performing reconnaissance as well as the number of aggregate probes performed by those sources. A recent image of the AT&T Threat Reconnaissance Index is shown in Fig. 13.7. Not surprisingly, the reconnaissance index has shown a relative decrease over the past few years. This trend is indicative of efforts by attackers to deemphasize the rapid spread of network worms and minimize attention to their activities. Attackers are more motivated to gain control of computers without drawing attention so they can use the exploited computers for undesirable activities such as sending spam, DDoS attacks, identity theft, intellectual property theft, and even illegal distribution of media. In late 2006, as operating system vulnerabilities were starting to be patched more quickly, some exploit discoveries in applications Reconnaissance Index 90 Sasser Bobax korgo worm erra 80 Index Value 70 Probe Count Index Source Count Index 60 AV exploit used by botnets 50 40 DNS amplification DDoS attacks pop-up spam surges Conficker.B worm 30 20 10 0 09 ar M 08 ec D 08 p Se 08 n Ju 08 ar M 07 ec D 07 p Se 07 n Ju 07 ar M 06 ec D 06 p Se 06 n Ju 06 ar M 05 ec D 05 p Se 05 n Ju 05 ar M 04 ec D 04 p Se Date Fig. 13.7 Reconnaissance Index shows the contributions in the numbers of probes and the number of sources that are conducting the probes over time 492 B. Rexroad and J. Van der Merwe initiated a surge in network scanning for vulnerabilities in applications including anti-virus software, weak database application passwords, and remote access applications. Some surges in pop-up spam activity to promote system scanning tools are noted. The rapid increase in the number of sources scanning increased significantly in late 2008 due to propagation of the Conficker worm. There have also been some recent surges in DNS amplification DDoS attacks that appear as probing activity to our analysis and affect the index. 13.5.1.2 Botnet Detection and Tracking Based on the sources of malicious activity such as scanning for exploits and spam activity, analysis methods have been developed to correlate the activity of these malicious actors to likely control points and server hubs associated with botnets [49]. A high-level illustration of this analysis algorithm is shown in Fig. 13.8. As shown in Fig. 13.8, the botnet detection analysis takes as input various reports of suspicious host activity, e.g., sources involved in scanning, spamming or DDoS attacks. These suspicious activity reports are further processed to extract the set of IP addresses that were involved in the suspicious activities. Next all flow records associated with the suspect IP addresses near the time of the activities are isolated. These flow records are then analyzed together to identify candidate botnet controllers. As shown in Fig. 13.8, DNS metadata is also used in the analysis. However, not all botnets use domain names as pointers to botnet servers; some point directly to IP addresses, therefore, we do not rely on the DNS metadata as a primary factor in detection. There are some types of legitimate services that have behavioral profiles that are very similar to botnet command and control. These cases are relatively few and can be easily white-listed from alerting. There are also some cases where a high correlation may exist between clusters of suspected bots/zombies that can lead to false-positives. For example, the indexing server of a P2P file-sharing network may trigger an alarm as a result of analysis of spam sources. In some cases, we have suspected this correlation may be attributed to use of the P2P file sharing network as a distribution of Trojan malware that consequently drafted these computers into the spamming botnet. Scan Sources SPAM Sources Isolate Source Flows Correlate shared activity & behavior patterns External Info Flow Records DNS Metadata Fig. 13.8 Botnet detection algorithm summary Suspect botnet servers Analyst validation 13 Network Security 493 As shown in Fig. 13.8, the resulting suspect IP addresses are investigated by analysts to verify the type of function and validate association with botnet activities. Likely domain names and sometimes port information can be used to identify and track other control points of the botnets, and subsequent activity is used to help identify the members of the botnet(s) and further determine the types of activity the botnets are performing. These methods enable estimation of the relative sizes of botnets and validation of the intent to do harm through illegal and abusive activities such as exploits, DDoS attack, spamming activities, identity theft, etc. The primary purpose of the botnet detection analysis is to determine if a given botnet presents a threat to network operations and services, customers, or to critical infrastructure. The information gained about the behavioral profiles of specific botnets, the malicious IP addresses, and associated domain names used can also be used to assist with isolation and blocking of malicious activity in the enterprise environment. General knowledge of the botnet technology, methods, and motives can be used to develop tools and operations functions that improve detection methods and automate the filter and/or alerting on suspected infections as part of routine network security operations. Table 13.3 shows an example alarm from the botnet detection processing. A given alarm identifies an IP address and service port that is suspected to be supporting the botnet in some capacity. Triggers are identified as part of the alarm. For example, the first alarm indicates that 17 suspected botnet clients (zombies) associated, with this controller, were detected scanning (i.e., “sp:”) on port 135 TCP (i.e., IP protocol 6). The range of analysis for this suspect is identified in a YYYYMMDDHH format, and the period of the latest alarm noted. Finally, a confidence score is provided that takes into account a number of additional flow characteristics that are generally indicative of botnet activity. A score that reaches a defined threshold is issued as an alarm to analysts for further investigation. When botnet operators have purposeful tasks to perform, they are forced to engage in botnet recruiting in order to add new bots into their botnet(s). Fortunately, improvements in spam source controls and DDoS scrubbing technologies have caused increased volatility of bots engaged in these actions. Specifically, sustaining attacks requires a continual influx of new bots or at least a well established inventory. As the recruiting bots are exposed, so is the opportunity to defensively expose the command & control infrastructure of the associated botnet(s). While it is generally difficult to mitigate botnets, it is possible to squelch their strength and Table 13.3 An example botnet alarm. For example, the first line shows a controller IP address (masked) with associated scanning activity (“sp”) on port 135 TCP (protocol 6) from 17 zombies. Each alarm also contains times of activity and a confidence score Earliest Latest Server IP Server port Triggers activity activity Alarm time Score x.x.x.167 65146 sp:135-6(17) 2009050320 2009050704 2009050707 63 x.x.x.84 9517 sp:445-6(58),135-6(1) 2009050309 2009050707 2009050707 57.9 x.x.x.29 1122 sp:445-6(15) 2009050415 2009050619 2009050707 51.9 494 B. Rexroad and J. Van der Merwe force activities on the part of botnet operators to maintain the botnet. I.e., force the botnet operator to perform botnet recruiting, which helps to reduce the attack power of the botnet. 13.5.1.3 Volumetric Anomaly Detection Volumetric analysis is performed on each IP protocol, TCP port, UDP port, and ICMP type for changes in flow, packet, and byte volumes. This analysis measures significant changes in activity of each parameter relative to expected values. Generation of baseline or “expected” values is generally calculated based on historical activity. The baseline must account for the diurnal characteristics of network traffic activity and must also reasonably isolate any historical anomalies. Once anomalies are identified, further automated analysis is performed by the platform to identify contributing attributes that are reported to security analysts as alarm details for evaluation. Volumetric analysis is a catch-all mechanism for detecting various types of events on the Internet. In addition to the alarming analysis, graphical tools provided by the platform allows analysts and customers the ability to look at short-term and long-term activity levels for specific ports and protocols. For example, Fig. 13.9 shows changes in network activity that resulted from patches that were applied to DNS servers in response to the recent disclosure of DNS cache poisoning attack techniques [88]. 13.5.1.4 DDoS detection The AT&T Security Analysis Platform also integrates commercial sub-systems to perform analysis and detection. In particular, multiple commercial DDoS detection systems [10] form part of the platform. One instance provides detection at a coarse “infrastructure” level. I.e., it is used to alert network operations and security analysts to significant traffic volume events that might have an impact on Internet service delivery. Because it is configured to look for large volume events in the core network of a Tier-1 ISP, this DDoS detection system will not detect smaller DDoS events, which, although not impacting on the network as a whole, might still be customer impacting. A second DDoS detection instance is therefore utilized to perform detection for customers who subscribe to this service, typically in combination with a DDoS mitigation service described below. This capability provides added sensitivity to customer designated network interfaces and address blocks, and it represents an analysis and reporting capability that complements the standard infrastructure DDoS detection. Finally, a third flow analysis platform is used to provide, among other types of analysis, intranet DDoS detection for private enterprise (VPN) customers. The alarms generated by these DDoS detection systems are ingested into the Security Analysis Platform for analysis and correlation with other detected anomalies. 13 Network Security 495 Change in Flow Count for 53/udp (DNS) 160% CERT Alert Percent Change 140% Increasing flows; result of patching Details Leak; inspires more patching 120% 100% 80% Bytes 60% Flows 40% 08 2/ 8/ 8 /0 28 7/ 8 /0 23 7/ 8 /0 18 7/ 8 /0 13 7/ 08 8/ 7/ 08 3/ 7/ 8 /0 28 6/ 8 /0 23 6/ 8 /0 18 6/ Date Fig. 13.9 Relative change in source-port 53/udp activity volume shows the effects of patching activities on the network in response to recent disclosure of recent DNS cache poisoning vulnerabilities. The solution to help alleviate the DNS cache poisoning threat was to force each query to assign a unique source port for DNS queries thus assigning a new session to each query. For performance reasons and simplicity of firewall rules, it had previously been common to use a fixed source port for DNS queries. This worked since DNS queries on port 53/udp are by definition single-packet sessions. As patches were installed in the wake of new information about the vulnerability and later exploit code, there were surges in patching efforts. These were revealed in network traffic behavior by a relative increase in the flow count on port 53/udp with no significant increase in byte count for the associated traffic. Randomization of the source port increased the number of flow records generated in DNS transactions 13.5.2 Automated or Compulsory Security Services With the informational security services we dealt with in the preceding section, users were provided with information of possible or impending threats. Acting on such information, however, was largely left up to the users receiving the information with the service provider specifically not taking any mitigative action. There are, however, cases where service providers take unilateral action to prevent or mitigate specific security concerns. Below we consider two examples namely DDoS and spam mitigation. 13.5.2.1 DDoS Mitigation In general service providers do not attempt autonomous mitigation of DDoS attacks. As we have indicated earlier, it is not always easy to distinguish between a DDoS 496 B. Rexroad and J. Van der Merwe Destination under attack Network wide DDoS Detection Route Control R Impacted Customer DROP R REDIRECT Attack Traffic R R R Infrastructure Service under attack Scrubbing Complex Original attack traffic Fig. 13.10 DDoS mitigation techniques attack and another legitimate surge in traffic, e.g., a flash crowd. DDoS mitigation techniques typically involve some negative side effects, therefore this possibility of a false positives in the DDoS detection mechanisms could be problematic. Further, some Internet destinations are almost constantly under attack and simply consider that as part of their operational costs. There are, however, two scenarios where service providers do react to DDoS attacks as a normal course of action. One is when the attack in question is of such a magnitude that it starts to cause collateral damage in the network. A classic example involves an attack against a target that starts to indirectly impact other customers or network services. This example is depicted in the top part of Fig. 13.10. A second scenario involves an attack against a specific infrastructure service such as DNS. This example is illustrated in the bottom part of Fig. 13.10. Figure 13.10 also shows two possible mitigation strategies that the provider might employ. As shown in the figure there is an implied DDoS detection mechanism, using techniques such as those described above, which precedes mitigation. The simplest DDoS mitigation technique involves an approach called “blackholing” where the route to the attack target is tagged with a semantically “drop” label and distributed to ingress routers with a special route control function. Ingress routers in the provider network receive these routes and forward traffic to a specifically configured null interface, thus effectively dropping all traffic towards that destination. This approach is crude since all traffic destined for the advertised prefix will be dropped, whether attack traffic or wanted traffic. As such, this approach is best suited to the case where a very specific destination, e.g., a host specific route, is null routed. Of course all traffic to that specific destination 13 Network Security 497 will still be dropped, but the collateral damage is minimized. Fortunately, DDoS attacks are typically not as distributed as one might expect [57]. A more desirable approach is to surgically distribute the drop-labeled route to only those ingress routers that have the majority of attack traffic [83]. This approach is illustrated at the top of Fig. 13.10. Despite its shortcomings, blackholing is a useful mitigation mechanisms for service providers. In cases where significant DDoS traffic volumes cross boundaries between provider networks, smaller providers often seek blocking assistance from larger providers if the traffic is overwhelming their network. Unfortunately, this type of control does little to help the target of the attack. It is desirable to provide more surgical and more customer friendly mitigation services. A more sophisticated DDoS mitigation strategy involves deploying dedicated DDoS mitigation devices [22] at strategic “scrubbing” locations in the provider network. In this case, the route control function advertises a prefix associated with the attack target to ingress routers in such a way that traffic towards the attack target is effectively “redirected” towards the scrubbing complex. This is shown in the bottom part of Fig. 13.10. This approach is attractive because in principle only attack traffic is filtered out at the scrubbing complexes so that wanted traffic can still be forwarded to the ultimate destination. On the flip side, the fact that dedicated infrastructure has to be deployed means that this approach can typically not be utilized for all provider traffic but is limited to protecting infrastructure services, or, as we will outline in Section 13.5.3, when it is offered as part of (paid for) supplementary security services. 13.5.2.2 Spam Mitigation In addition to Internet access and basic services like DNS, ISPs typically provide a number of consumer or business-grade end user services to their customers including email services, Web-hosting, chat-rooms, Web portals, etc. Of these services, email typically has the highest take rate and also requires special care from a security perspective. In this section we will describe the essential functionality of the security frontend to the AT&T consumer email platform as an example ISP email infrastructure which effectively adapts to ongoing email security concerns in a highly scalable manner. Figure 13.11 depicts the major functional components of the multi-tiered AT&T email platform. External SMTP connections first encounter the connection management component. The connection management component maintains a reputation system whereby external SMTP connections are classified based on the historic and/or current behavior of their source IP addresses and are allocated resources accordingly. The resources of concern here are the number of SMTP processes that are allowed to be spawned. First, connections from known and trusted IPs (friends) are given unlimited resources. Connections from unknown source IPs are classified into a default class. The default class receives enough resources so that under normal operating conditions, e.g., when there are no email DoS attacks, no blocking occurs. However, the resources allocated to the default class are constrained to 498 B. Rexroad and J. Van der Merwe User verdicts Log Analysis Detected SPAM sources Honeypot Security Analyst Analyst reputation list Analyst filter list Connection Management All SMTP connections Detected SPAM sources Real Time Blacklist Accepted SMTP connections Email messages accepted for processing Email for honeypot mailboxes Content Filtering Email messages to user mailboxes Rule updates from content filter vendor Fig. 13.11 Logical depiction of the security frontend of the AT&T email platform prevent impact on the friends class when there are attacks. In the case of resource exhaustion, the sending SMTP connection may be terminated with a temporary unavailable SMTP response (i.e., response code 450: Requested action not taken; mailbox unavailable or busy). The third class of source IPs, which are generally known to be spam sources, fall in the throttle classification where the resources allocated to the set and/or to individual IPs are significantly constrained. Specifically, on average connections in this group would typically receive a 450 SMTP response 80–90% of the time. SMTP connections that pass through this first level of defense are passed to the real time blacklist (RBL) component. The RBL parses and analyzes the SMTP protocol to determine the trustworthiness of an SMTP source. For example it performs reverse-DNS lookups on the domain name reported by the SMTP source and note discrepancies between this lookup and the SMTP source IP address. As the name suggests, the RBL is also dynamically updated based on feedback from subsequent tiers in the email platform. For example, information concerning detected spam sources is fed back from the next tier, so that this information can be used in ensuing rounds to decide whether or not to accept a connection from the same source. Based on this analysis, the RBL may terminate the connection with an SMTP unavailable response (i.e., response code 550: Requested action not taken: mailbox unavailable). The RBL also receives a set of sieve filtering rules [33] from security analysts. These rules are utilized to queue accepted email messages for differential treatment. For example, most email messages are simply passed to the next defense layer, however, some might be queued as possible phishing attacks for analyst attention. 13 Network Security 499 The third tier in the email processing platform consist of a commercial email content screening product. At this level email content is analyzed to determine whether any spam and or phishing rules are triggered. Like the RBL, the rules of the content filtering component are constantly updated. As mentioned above detected spam sources are fed back to the RBL tier for automatic inclusion in the RBL filter set, and the output is also provided to security analysts for further manual evaluation. Finally, email messages are delivered to user mailboxes, with an indication as to whether the email platform considered it spam or not. The email processing platform also maintains honeypot or spamtrap [77] accounts. As we explained earlier, these are bogus email accounts, set up explicitly to attract spam email. As such, all email destined for a spamtrap is spam by definition. Sending this mail through the content filtering system therefore serves as an indication of spam that may have been missed by the content filters. This information is also made available to security analysts. Users who are the final recipients of the email are ultimately the judges of the accuracy of the classification from the content filtering system. The email platform allows users to provide this judgment back to the system in the form of user verdicts. I.e., indicating that system classified spam was in fact not spam, or, conversely indicating that mail that passed through the system unflagged was spam. In addition to all the sources of information already mention as informing them, security analysts also perform log analysis of the various system component logs. Indeed the human security analysts remain an integral component of this system. While well understood components of the analyses can and should be automated, the role of skilled analysts to react swiftly to changes in the strategy of perpetrators remain crucial to the success of dealing with unwanted email. The success of this tiered approach is best appreciated by considering the fact that the platform receives in excess of 1.4 billion spam email messages on a typical day and is ordinarily successful in blocking more than 99.3% of it. 13.5.3 Supplementary or Optional Security Services We now consider security services that users might specifically subscribe to. Service providers might offer these services as a means to differentiate their existing service offers, or might offer it as stand alone services. Either way, there is an implied economic incentive for service providers to provide these services. 13.5.3.1 Customer Specific DDoS Mitigation Customer specific DDoS mitigation is technically accomplished in much the same way as the DDoS mitigation described earlier in Section 13.5.2. Traffic destined for attack targets is passed through scrubbers to allow only wanted traffic to pass to the subscribing customers. By subscribing to the service, the key difference is in the detection of DDoS attacks. Specifically, as shown in Fig. 13.12, a customer specific 500 B. Rexroad and J. Van der Merwe Route Control REDIRECT R Customer specific DDoS Detection REDIRECT Attack Traffic R R R Customer under attack Scrubbing Complex Fig. 13.12 Customer specific DDoS mitigation DDoS detection mechanism is needed in order to detect attacks at the granularity where customer links might be impacted. (As opposed to the much higher capacity provider links.) This might involve deploying DDoS detection tools at the customer premises, or might be a network based detection capability that is tuned to detect attacks at the appropriate granularity. Since severe DDoS attacks might overload the regular customer links, a back channel might be needed to alarm on attacks when a customer premises deployment is performed. As before, once an attack is detected, route control mechanisms are utilized to redirect customer traffic to an appropriate scrubbing complex, and “cleaned” traffic is sent on for delivery to the customer. 13.5.3.2 Network Based Security Services The complexities of dealing with network security make the outsourcing of network policy enforcement to service providers as a network based managed security service an attractive option for customers. The acquisition, installation and maintenance of security equipment is handled by the service provider, while customers maintain the freedom to specify (and modify) their own security policies. Security services that can be provided in this manner provide bi-directional protection of customer networks from Internet-based security threats through stateful firewalls, network address translation (NAT), URL filtering, intrusion detection systems (IDSs) and content inspection. Figure 13.13 depicts a high level view of an architecture that enables network based security services. As shown in the figure, security services are provided via security data centers that are directly connected to the provider network. Customer traffic to and from the Internet passes through these security data centers en route to the customer’s network(s). 13 Network Security 501 Internet Security Data Center R R R R Customer A Network Customer Internet Traffic R R R Customer A Network Rx Ry Customer B Network Security Appliance B policy A policy Private Side Public Side Security Data Center Fig. 13.13 Network based security services It is critically important that individual customers private traffic remains separate as it routes to and traverses through the secure data center, where policy is applied. Another key concern is ensuring that individual customers security policies remain distinct (or virtualized) within the devices that are enforcing the set of security services, and that those policies can be easily and securely administered. As illustrated in the figure, separation of customer traffic is achieved by logically separate “connections” from the customer’s network to the data center. The “connections” are realized in practice through tunneling or VPN technologies. Similarly, within the data center VLAN technologies are used to maintain the traffic separation. Modern security appliances [19, 34] are also capable of per-VLAN security policies and processing so that the per-customer separation is maintained all the way from the Internet gateway router, i.e., router Ry in Fig. 13.13, to the customer’s network(s). 13.6 Security Operations Network operations are a standard consideration in any environment where reliability is a factor. There are some added considerations when addressing the security operations needs of a network. The network security environment is continually changing. New attacks are created. Variations on old attacks are perfected. Human behavior is manipulated in creatively new ways to allow exploitation. New exploit methods are cumulative with old methods. Even as workarounds and patches are introduced, it is not unusual for old vulnerabilities to be reintroduced or similar ones to be created through the life-cycle of even stable systems. This suggests that an 502 B. Rexroad and J. Van der Merwe automation strategy is imperative to assist with detection and response methods as security threats evolve. It also suggests that even with the most aggressive strategy to automate detection and response methods, some manual operations are going to be necessary to maintain a secure operating profile in the network. Generally, the distinction between network operations and security operations comes down to the following characteristics, which stem from the malicious intent associated with security events: Failures occur randomly, but attacks are deliberately timed. Failures will present themselves in a predictable manner, but attacks may be in- tentionally deceiving. Components of the network will fail somewhat randomly, but ultimately, the results are predictable and can be characterized. While new devices, software, and systems will develop new modes of failure over time, the randomness of the events generally do not have significant consequences on the overall performance of the network. Of course there are isolated exceptions.11 Conversely, security events are malicious, and thus will be planned to occur at an opportune time to place the attacker at an advantage. For example, an attacker will attempt to take advantage of a newly discovered vulnerability before there is an opportunity to create and/or deploy a suitable patch. As network failures will time themselves randomly, they will also present themselves in a predictable manner. While the diagnosis of a root cause of a network failure may not be straight forward, it is generally possible to create logical rules to diagnose the cause (see Chapter 12). Again, considering that security events are generally malicious, they can be intentionally disguised to appear as one thing while in reality being something different. Or events can be created to divert attention from the real event. For example, an attacker may launch a denial of service attack against one resource to divert attention from a penetration attack or to mask penetration probes against a target. Tactics such as diversion, concealment, and obscurity can and will be used to achieve the objective. So while network operations will seek to find the simple explanation for a problem, a security operations team will need to dig deeper, always considering what motivation and technique may have been used by an attacker to trigger events. For this reason, it is generally recommended to have a security operations team represented in the analysis of root cause for network events, particularly if there is any suggestion of strange coincidences or unusual traffic activity. 13.6.1 Components of Security Operations In this section we consider the components or entities involved in security operations and the relationships between those entities. Figure 13.14 depicts the organizational 11 Case in point is an outage in the AT&T frame relay network in 1998, where a complete network outage resulted because of a software issue that propagated between switches [69]. 13 Network Security 503 Service Provider Entity Non Service Provider Entity Tier 5 (Products & Research) Network Device Vendors Tier 4 (Tool Specialists) Tier 3 (Incident Management) Software Vendors Vulnerability Assessment Security Engineering Tier 1 & 2 (Event Identification Network Operations & Coordination) Other Operations Teams Incident Response Team Security Research Community Anti-virus Vendors Malware Research Community Malware Analysis & Forensics Algorithm Research Network Data Analysts Tool Providers Tool Development Incident Investigation Customer Community Customer Security Operations Global Client Services Fig. 13.14 Generic overview of the organizations involved in operations security components involved in network security operations. The structure follows the typical tiered approach which, as we discussed in Section 13.3.1, ensures that security events can be responded to in timely fashion by the appropriate technical experts. Figure 13.14 also illustrates the fact that security operations is not a standalone function. In particular, security specific organizations (or functions) interact with other organizations within the service provider, interact with vendors and other relevant communities outside of the service provider organization, and of course interact with the customer community. Having such a holistic view of network security operations is critically important to ensure its success. I.e., the complexity involved in any particular function represented in Fig. 13.14, implies that the function be fulfilled by a specialist, who might not necessarily be aware of the holistic view. E.g., network engineers tend to be focused on making the network operate, similarly, analysts tend to be focused on the activities they need to perform. Consequently, some of the tools that will be needed to help the analysts do their job and to help the network engineers be more successful can easily be overlooked without taking a more holistic approach. To form a holistic view, below we list and discuss some of the functions that form part of security operations. Functions Associated with Network Security Operations Event Detection Sensors that detect events that are either directly security related or provide information that contributes to identifying and understanding security events. 504 B. Rexroad and J. Van der Merwe Data Collection and Management In large networks, there will be large amounts of data from detection sensors, which includes general metadata that can assist with security analysis. The collection and management of this data needs to be a deliberate activity and function. Data Analysis Tools As manual analysis becomes more routine, there are opportunities for automation and refinement. It will be necessary to have people that focus on tools development while analysts focus on the analysis requirements. Data Analysis Some manual analysis of data will be needed to interpret security events. This implies some appropriately trained personnel will be on hand to perform that analysis. Algorithm Research Continual research and analysis is needed to keep abreast of known attack methods and to identify emergence of new attack methods. Vulnerability Database Tools that collect vulnerability information of many kinds that relate to hardware, operating systems, and application software used in and around the network environment. Network Device Vendor Relationships Ultimately, the creators of network and network security products know the most about their products. The product vendors do not always know the network environment. A cooperative relationship with vendors is necessary to merge the two. Network Event Root Cause Analysis When events occur, it may be easy to make the problem go away, but understanding the root cause with a balanced consideration of potential malice is important to recognizing and preventing future security issues. Situational Awareness Tracking This is the activity to be cognizant of events that are taking place in the world and how they might influence the security posture of the network and influence network activity. Coordination and Collaboration Tools As security analysis teams become more complex and disciplines become more specialized, tools are needed to exchange and preserve information efficiently. Case Management Tools that provide the capability to track and record the status of network and security events. Customer/User Support Customers and users of network services will depend on your network expertise, data, and controls to help maintain a good security posture. Mitigation Mechanisms It is not enough to simply identify and understand problems. Policies, tools and procedures need to be in place to remediate and hopefully prevent problems. Network Engineers and Tools Development Automation is paramount to recognizing relevant issues in large-scale networks. 13 Network Security 505 Event Detection, Data Collection and Management and Data Analysis Tools These functions are covered in detail in the earlier part of this chapter. Data Analysis Security Intelligence is discussed at length in this chapter. It is important to develop a team of analysts that are well acquainted with what various security attack events look like in the available data. Further, it is equally important to be able to recognize what normal traffic activity looks like. Every network is architected a little differently, data is collected a little differently, metadata is generated, collected, and managed differently, and traffic profiles vary depending on the user demographics. For these reasons, there is no substitute for training and practice of capable analysts. By detecting, characterizing, and addressing small network events, management of larger events become a matter of routine rather than a stressful experience, and the chances of accurate diagnosis and action are improved. Algorithm Research Network security is an arms race between the ability of attackers to exploit in competition with the ability to anticipate, detect, prevent, and remediate. In an enterprise environment, there are already a number of commercial tools that provide reasonably sophisticated analysis. In service provider environments, there are much fewer tools that are available commercially that can provide the appropriate perspective. Therefore, there may be a need to perform research and development into algorithms that characterize data, detect events, and help determine appropriate courses of action. Highly skilled and well trained analysts become very good at recognizing specific types of activities. But no analyst possesses all the skills necessary to recognize or characterize complex events, and analysts each possess different strengths. Researchers that have disciplines in mathematics, data presentation/visualization, and algorithm creation help to create the tools analysts will ultimately use. It is advisable to include algorithm researchers in a forensics activities. For example, when an event was not automatically detected but perhaps should have been, researchers should be employed to look for evidence of the event in the underlying historical data, develop algorithms or methods for detecting future events, and testing the algorithms against on-going activity to help validate the algorithm. Vulnerability Database Many security issues for systems are known. But there are many systems, network elements, operating system versions, and platforms that make it virtually impossible for any one person to know the full set of vulnerabilities and implications on network operations. Managing this complexity is paramount to managing a secure network. Collecting a database of potentially relevant vulnerability information (and exploit information when practical) is a helpful tool for engineers, systems managers, and security analysts alike. Theoretically, we should be able to identify all vulnerabilities for systems, fix those vulnerabilities as soon as they are discovered, and many of the security threats will be mitigated. In reality, not all vulnerabilities are known, and not infrequently vulnerabilities are exploited before they become public knowledge (so-called “zero-day” events). In other cases, vulnerabilities are known by a few and not disclosed to the users for lack of suitable fixes. The only practical defense is to engineer systems and networks using 506 B. Rexroad and J. Van der Merwe a defense in depth strategy. This type of engineering is regularly performed in the context of reliability engineering, which is commonly known as engineering for “no single point of failure”. In security engineering, a similar strategy should be used to assure no single point of failure in a security mechanism will significantly compromise the assets that need to be protected. Network Device Vendor Relationships There is a mutual benefit to developing a strong relationship with vendors that provide your network products. Most vendor designers do not actually use their products in real-world operations. While they may perform robust testing of their products, it is enormously difficult to anticipate all of the negative test cases that would be needed to thoroughly identify any issues in products. There is no substitute for reality, where many unanticipated circumstances will be encountered. Obviously, it will be desirable to report behavior that appears to be a direct security threat. But there is also benefit to reporting behavior that appears to be innocuous, since that behavior could be used to create a much more insidious problem. Similarly, having developed a strong relationship, vendors will become more comfortable and feel more obligated to communicate suspected issues that they have discovered such as vulnerabilities and exploits. One should never assume that a vendor’s product is either secure or accurate, even if it is a security product. Network Event Root Cause Analysis When network events occur, it is common practice to consider the “root cause” for the network event. Security representation should always be a part of network event analysis to consider the potential malicious motivations and techniques that could have been related to or the cause of the event. As stated previously, there may be malicious intent involved and attempts to make one type of event look like another. While network engineers and operators diagnose what appears to be the problem, security analysts should consider the possibility of other types of activities that may be taking place. In other words, it is the security analyst’s role to act as a conspiracy theorist. Obviously, there is a balance that must be maintained. There is little value in disrupting normal operations and consuming excessive resources to investigate elaborate conspiracy theories. The security analysts should consider the possibilities and follow only those that have merit considering the likely impact to business risk and operations. Again, practice is the best way to develop an appropriate balance. Situational Awareness Tracking There are activities and events that reside outside the network but can still affect the state of the network. Situational awareness is the practice or art of keeping abreast of conditions that could affect the network. For a service provider, national security status and terrorist threats, natural disasters, political events and other major events, personnel issues or threats, existing network events/outages, and other factors will all have a potential influence on the focus of security operations. Sometimes events can come from unlikely places. For example, Fig. 13.15 shows the effects of online viewing of the 2009 Inauguration event for President Barack Obama on UDP traffic volumes. This type of activity change could easily have been interpreted as either a DDoS attack or a worm had 13 Network Security 507 Change in Total UDP Traffic due to Online Coverage of the Presidential Inauguration UDP Change Ratio 2.5 2 1.5 1 0.5 0 00 0: 9 0 00 :0 /2 12 22 9 1/ 0 00 /2 0:0 21 1/ 009 0 :0 /2 12 21 9 1/ 0 00 :0 /2 0 20 1/ 009 0 :0 /2 12 20 9 1/ 00 00 : /2 0 19 1/ 009 0 :0 /2 12 19 9 1/ 0 00 :0 /2 0 18 1/ 009 0 :0 /2 12 18 9 1/ 0 00 /2 0:0 17 1/ 009 0 :0 /2 12 17 9 1/ 0 00 /2 0:0 16 1/ 009 0 :0 /2 12 16 9 1/ 0 00 /2 0:0 15 1/ 009 /2 15 1/ Date-Hour (GMT) Fig. 13.15 UDP traffic change due to online viewing of the Inauguration of President Barack Obama outside influences not been taken into consideration. Situational awareness and investigation of the event dynamics helped avert suspicions of malicious behavior in this case. The most common situational awareness “tools” in use likely are to have current affairs news casts, e.g., CNN and TWC, available on video monitors in the operations centers. Coordination and Collaboration Tools While not discounting the value of this form of situational awareness, it is not sufficient. It will be useful to have some tools available that allows analysts to make notes on relevant events to track. Blogging tools might be adapted for this purpose. As new analysts start their work-day or shift, the situational awareness notes and recent cases provide a good starting point. At the end of each shift, analysts should check several internal and external sources of information that may have updates and relevant news that could have implications on network operations. Relevant points should be extracted from sites and included in the notes. Similarly, activities that have taken place during the shift which have been determined to be irrelevant, or more importantly, activities that need to be watched should be noted. Collaboration tools should keep a record of activities that extend beyond simple management of alerts and cases. Case Management There will be a need for recording security events, delegating the investigation and/or the mitigation of those events and tracking them to closure. For a network of any significant size, there will be security issues and events that range from small and routine to large and complex. Issues might include situations or scenarios that are identified to help prevent security events from occurring. 508 B. Rexroad and J. Van der Merwe Addressing issues to prevent events is a good thing and worthy of tracking. For example, it may be necessary to implement a critical security patch in network devices to assure a particular vulnerability cannot be exploited. In the course of implementing the patch, temporary mitigation actions may be implemented to assure attempts to exploit the vulnerability are not successful or at least are detected. There will be a need to track the status of security issues and events. A network of any significant size will have a case tracking system to address network events, and it may be possible to integrate security issue tracking as an integral part of network event tracking. However, since network security is often managed and functions as a distinct specialized discipline that complements network operations and engineering, it may be desirable to track security issues and cases as a distinct entity. Customer/User Support Invariably, customers and/or users of network services will have security and operational difficulties. In the interest of maintaining customer satisfaction with the service, some amount of security support in the form of event management, data analysis, forensic & root cause analysis is needed to help customers resolve their issues. This seems somewhat obvious, but supporting customer events can become a burden on resources if the security operations costs do not account for the time and effort needed to support this function. There are significant peripheral benefits to providing this type of support since understanding the issues and concerns of customers can lead to developing services and functions in the network that address customer needs. Network Engineers and Tools Development While not an intrinsic part of security operations, the people that develop tools that are used for monitoring, analyzing, and protecting the security of the network and network users often have the best knowledge about what works and what would constitute a misuse of tools that have been developed. Since security operations can be forced to extend outside the envelope of normal or expected use, it is best to plan for and include development representatives in the response execution for fringe events. For example, network devices, analytical tools, and systems almost always have features that may not have been fully tested or have not been integrated into normal network operations. Engineering may know what these features are and how they can be used. That could mean the difference between avoiding an event or having to recover from a disruptive outage. 13.7 The Indefinite Arms Race More than any other aspect of networking, security appears set to be an indefinite arms race between those providing network services and services enabled by networks, and those who seek to use the same resources for illegitimate activities. Given the fact that a significant part (perhaps the majority) of such nefarious activity is economically motivated, plus the fact that the reliance of modern society on such networks continue to grow, suggest that this situation will persist. Indeed, 13 Network Security 509 realizing that service provider network security is as much a process and an approach as it is a technical discipline, is perhaps the single most important message in this chapter. Below we summarize key insights from the approach to service provider network security presented in this chapter and provide thoughts on important future directions. 13.7.1 A Service Provider Approach to Network Security Understand the Problem in Context A critical first step in dealing with network security is to understand the nature of the problem in the context of the specific provider network in question. I.e., understand intrinsic threats and how those are shaped by business relationships, fundamental technological dependencies or limitations and the incentives of different role players. Develop and Follow a Comprehensive Framework to Network Security A comprehensive security framework is as much about technology as it is about structure, procedures and relationships. The most basic technical component of such a framework involves the configuration of network elements to be robust against exploitation and abuse. Derive Actionable Network Intelligence Network monitoring at different granularities and timescales provide the raw data, which should be combined with other information sources, to derive security related network intelligence. Automation is crucial to support and enable analysis by security experts. Pro-actively Deploy Network Security Services Acting on network intelligence to protect the network and to provide security services provides a “closed-loop” environment and to some extent offsets the economic imbalance between legitimate and illegitimate economies. Take a Holistic Approach to Network Operations Network security operations should be performed within a holistic context with appropriate relationships and interactions between security functions, other service provider functions and non service provider functions like vendors and customers. 13.7.2 Future Directions Safe Sharing of Security Intelligence We have emphasized the importance of each service provider developing good network security intelligence. To the extent that each provider is only part of the global network infrastructure, there is a need to share such security intelligence across service providers. There are some existing proposals and solutions along these lines [6, 8, 9]. One example is the anomaly 510 B. Rexroad and J. Van der Merwe Fingerprint Sharing Alliance [9] which allows summary information about an attack (i.e., a fingerprint) to be shared between participants. Another example is the ATLAS initiative [8], whereby each service provider deploy devices in their network which runs a type of honeypot. Data collected from the honeypot is then shared with the third party ATLAS-service provider and made available to all participants. These initiatives are moving in the right direction, however, more work is needed especially to address the delicate balance between sharing richer information, while at the same time not running afoul of privacy concerns or revealing information that can be abused by a competitor. Secure Protocols While we cautioned in Section 13.2 that secure protocols are not the proverbial “silver bullet”, such initiatives should be encouraged and secure protocols should be developed and deployed where they address security needs. Secure Network Architectures Despite receiving much attention from the networking research community [12, 73, 74, 89], the security implications of the best-effort, unaccountable service model of the Internet architecture remains an unresolved problem. While the role of the network in mitigating other “higher layer” security concerns might be debatable, it seems clear that this problem can fundamentally only be solved “in the network”. Some argue that these shortcomings can only be addressed in a clean-slate network design that is unencumbered by backwards compatibility with the current network [18]. Improve Incentives In the context of this chapter we have focused on one aspect of attempting to balance the playing field between good and bad actors, namely economics. Generalizing this to architectures and protocols that aim to provide mechanisms to correct the imbalance appears to be a promising direction. We consider a number of examples: At the transport protocol level “client puzzles” have been proposed [45] as a means for servers under DDoS attack to selectively accept connections from (presumably) legitimate clients that have successfully solved a puzzle. This approach shifts the balance of power as the client is required to perform work before any server resources are allocated to it. At the architectural level a DoSresistant architecture has been proposed [35]. This architecture proposes to change the any-to-any service model of the Internet by being explicitly aware of whether a node acts as a client or a server and further explicitly aims to tilt the cost of communication in favor of the server. Economic disincentives have been proposed as a spam mitigation mechanism [51]. In essence this scheme associates cryptographic “stamps” to each sent email message, and canceled stamps (i.e., email that was deemed unwanted by the recipient) eventually results in the sender either having to pay to continue sending email, or, being blocked from sending email altogether. The adoption of these specific proposals remain uncertain, however, the fact that they all attempt to specifically address incentives appears to be a promising direction. Scalability As we have indicated throughout this chapter, scalability is a significant network security concern, both in terms of the volumes of data used to derive network intelligence, and in terms of mitigation mechanisms and services employed to protect the network and its users. Most existing network security solutions are 13 Network Security 511 aimed at the enterprise market. This is expected because, first, enterprises are more security aware and are therefore willing to spend money to protect the applications they use. Second, enterprises present a sweet spot in terms of scalability. Finding scalable security solutions in the consumer market remains a significant challenge. The fact that this market is very price sensitive will further exacerbate the problem. However, it could be argued that the lack of security solutions in the consumer space is a significant contributor to the overall security problem as unmanaged home networks make easy prey for botnet recruiting effort. It is therefore important that security solutions be found in this space. Internet Governance We touched in passing on the role of Internet governance when considering DNSSEC in Section 13.2.3. Internet governance is largely orthogonal to core network security concerns. However, understanding the proper role for national and international bodies in governing the Internet and the potential impact on the security of the Internet, depending on how those roles are defined, is an important open question. Cyber Critical Infrastructure Similarly, we considered the fact that commercial entities are responsible for running and maintaining critical infrastructure in Section 13.2.5. Given the global dependence on this infrastructure, both economic and governmental, it would behoove governments and other role players to better understand the implications of this. In the U.S., recognition of these dependencies lead to the establishment of Information Sharing and Analysis Centers (ISACs) [42]. ISACs have been established on a per-sector basis to share information concerning cyber threats between critical infrastructure owners, operators and government. These industry initiatives might be complemented by government making the best quality security solutions available to commercial industry, and encouraging the special solutions that are being devised for protection of government to be used commercially as well. Similarly, government procurements of services should be helping to establish infrastructure that can be applied to help protect telecommunications of all of the critical infrastructure categories that depend on reliable communications in worst-case scenarios. Acknowledgments We would like to acknowledge the contributions of many of our colleagues in AT&T Labs and the AT&T Chief Security Office (CSO) organization. To a large extent we are simply reporting on their efforts over many years. We especially want to thank Steve Wood for explaining to us his work on the AT&T email platform, Adrian Cepleanu and Tom Scholl for expounding details of network security configuration, Joseph Blanda Jr. for sharing details about network based security services, Dave Gross who pioneered much of the security analysis work and Ed Amoroso, AT&T’s Chief Security Officer whose technical leadership provides the structure in which much of this work takes place. We would also like to thank the editors as well as Bill Beckett, Dave Gross, Patrick McDaniel, Subhabrata Sen and Tom Scholl for providing insightful comments on earlier versions of this chapter. 512 B. Rexroad and J. Van der Merwe References 1. Daytona Data Management System. Retrieved from http://www.research.att.com/daytona/. 2. DNS Threats & DNS Weaknesses. Retrieved from http://www.dnssec.net/dns-threats.php. 3. North American Operator’s Group. Retrieved from www.nanog.org. 4. US-CERT United States Computer Emergency Readiness Team. Retrieved from www.us-cert. gov. 5. The continuing denial of service threat posed by DNS recursion (v2.0). (2006). Retrieved from http://www.us-cert.gov/reading room/DNS-recursion033006.pdf 6. Allman, M., Blanton, E., Paxson, V., & Shenker, S. (2006). Fighting coordinated attackers with cross-organizational information sharing. In Workshop on Hot Topics in Networks (HotNets), Irvine, CA. 7. Andert, D., Wakefield, R., & Weise, J. (2002). Trust modeling for security architecture development. Retrieved from Sun BluePrints OnLine. Retrieved from http://www.sun.com/blueprints. 8. Arbor Networks. ATLAS initiative services & requirements – A service provider’s guide to participating in the ATLAS initiative. Retrieved from www.arbornetworks.com. 9. Arbor Networks. Fingerprint sharing alliance – A community for coordinated, rapid attack resolution. Retrieved from www.arbornetworks.com. 10. Arbor Networks. (2009). Arbor peakflow SP pervasive network visibility, security and profitable managed services. Retrieved from http://www.arbornetworks.com/peakflowsp. 11. Arends, R., Austein, R., Larson, M., Massey, D., & Rose, S. (2005). Protocol modifications for the DNS security extensions. IETF RFC 4035. 12. Argyraki, K., & Cheriton, D. (2005). Network capabilities: The good, the bad and the ugly. In Workshop on Hot Topics in Networks (HotNets), November 2005. 13. Atkins, D., & Austein, R. (2004). Threat analysis of the Domain Name System (DNS). IETF RFC 3833. 14. AT&T Laboratories, Information Security Center of Excellence. Seven pillars of carrier-grade security in the AT&T MPLS network. 15. Claise, B. (Ed.). (2004). Cisco systems NetFlow services export version 9. IETF RFC 3954. Retrieved from http://www.ietf.org/rfc/rfc3954.txt. 16. Baran, P. (1964). On distributed communications: I. Introduction to distributed communications networks. In RAND Memorandum, RM-3420-PR. 17. Barbir, A., Murphy, S., & Yang, Y. (2006). Generic threats to routing protocols. In IETF RFC 4593, October 2006. 18. Bellovin, S., Clark, D., Perrig, A., & D. S. (Eds.). (2005). A clean-slate design for the nextgeneration secure Internet. www.geni.net/documents.html. In Community Workshop Report (GDD-05-02). 19. Bouchard, M., & Mangum, F. Beyond UTM – The value of a purpose-built network security platform. Available from http://www.fortinet.com. 20. Butler, K., Farley, T., McDaniel, P., & Rexford, J. (2007). A survey of BGP security issues and solutions. Retrieved from http://www.cs.princeton.edu/jrex/papers/bgp-security08.pdf. 21. Carrel, D., & Grant, L. (1997). The TACACS+ Protocol. IETF draft: draft-grant-tacacs-02.txt, January 1997. 22. Cisco Systems. Defeating DDOS attacks. Retrieved from http://www.cisco.com/en/US/ products/ps5888/prod white papers list.html. 23. Cisco Systems. (2009). Cisco security advisory: Cisco IOS software multiple features crafted UDP packet vulnerability. Retrieved from http://www.cisco.com/warp/public/707/ cisco- sa-20090325-udp.shtml. 24. Clark, D. (1988). The design philosophy of the DARPA internet protocols. In SIGCOMM ’88: Symposium Proceedings on Communications Architectures and protocols (pp. 106–114). 25. Computer Crime Research Center. (2005). Hackers: Companies encounter rise of cyber extortion. Retrieved from http://www.crime-research.org/news/24.05.2005/Hackerscompanies-encounter-rise-cyber-extortion/. 13 Network Security 513 26. Cranor, C., Johnson, T., Spatscheck, O., & Shkapenyuk, V. (2003). Gigascope: A stream database for network applications. In Proc. ACM SIGMOD, San Diego, CA (pp. 647–651). 27. Department of Homeland Security. (2003). The national strategy to secure cyberspace. Retrieved from http://www.dhs.gov/xlibrary/assets/National Cyberspace Strategy.pdf. 28. Ellison, C., & Schneier, B. (2000). Ten risks of PKI: What you’re not being told about public key infrastructure. Computer Security Journal, 16(1), 17. 29. Espiner, T. (2008). Georgia accuses Russia of coordinated cyberattack. Retrieved from http://news.cnet.com/83011009 31001415083.html. 30. Evers, J. (2005). Is latest can of worms a cyber-crime turf war? Retrieved from http://software. silicon.com/malware/0,3800003100,39151483,00.htm. 31. Feamster, N., Mao, Z. M., & Rexford, J. (2004). BorderGuard: Detecting cold potatoes from peers. In IMC ’04: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, New York, NY (pp. 213–218). 32. Goodell, G., Aiello, W., Griffin, T., Ioannidis, J., McDaniel, P., & Rubin, A. (2003). Working around BGP: An incremental approach to improving security and accuracy in interdomain routing. In Proceedings of the NDSS, San Diego, CA. 33. Guenther, P., & Showalter, T. (2008). Sieve: An email filtering language. IETF RFC 5228, January 2008. 34. Gupta, M. Single PAss Inspection Engine: The architecture for profitable MSSP services. Available from: http://www.ipolicynetworks.com/. 35. Handley, M., & Greenhalgh, A. (2004). Steps towards a DoS-resistant internet architecture. In FDNA ’04: Proceedings of the ACM SIGCOMM Workshop on Future Directions in Network Architecture, Portland, OR. 36. Hellweg, E. (2004). When bot nets attack. Retrieved from http://www.technologyreview.com/ Infotech/13771/. 37. Honeynet Project. (2006). Know your enemy: Honeynets. Retrieved from http://www. honeynet.org/papers. 38. Huston, G. (2007). The ISP Column – Trust. Retrieved from http://www.isoc.org/pubs/isp/. 39. IANA. IANA IPv4 Address Space Registry. Available from http://www.iana.org/assignments/ ipv4-address-space/. 40. Iannaccone, G., Chuah, C.-N., Mortier, R., Bhattacharyya, S., & Diot, C. (2002). Analysis of link failures in an IP backbone. In IMW ’02: Proceedings of the 2nd ACM SIGCOMM Workshop on Internet Measurement, New York, NY (pp. 237–242). 41. ICANN Security and Stability Advisory Committee. (2008). SSAC advisory on fast flux hosting and DNS. Retrieved from http://www.icann.org/en/committees/security/sac025.pdf. 42. ISACCOUNCIL.ORG. (2009). The role of information sharing and analysis centers (ISACs) in private/public sector critical infrastructure protection. Available from http://www.isaccouncil. org. 43. ITU-T telecommunication standardization sector of ITU. Series X: Data networks, open system communications and security. Information technology – Open systems interconnection – The directory: Public-key and attribute certificate frameworks. ITU-T Recommendation X.509, 2008. 44. John, J. P., Moshchuk, A., Gribble, S. D., & Krishnamurthy, A. (2009). Studying spamming botnets using botlab. In Proceedings of the Second Symposium on Networked Systems Design and Implementation (NSDI). 45. Juels, A., & Brainard, J. (1999). Client puzzles: A cryptographic countermeasure against connection depletion attacks. In Proceedings of the 1999 Network and Distributed System Security Symposium. 46. Jung, J., Krishnamurthy, B., & Rabinovich, M. (2002). Flash crowds and denial of service attacks: Characterization and implications for CDNs and web sites. In Proceedings of the 11th International Conference on World Wide Web, ACM Press, Honolulu, Hawaii (pp. 252–262). 47. Kalafut, A. J., Van der Merwe, J., & Gupta, M. (2009). Communities of interest for Internet traffic prioritization. In Proceedings of IEEE Global Internet Symposium. 514 B. Rexroad and J. Van der Merwe 48. Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Voelker, G., Paxson, V., & Savage, S. (2008). Spamalytics: An empirical analysis of spam marketing conversion. In 15th ACM Conference on Computer and Communications Security (CCS), Alexandria, VA. 49. Karasaridis, A., Rexroad, B., & Hoeflin, D. (2007). Wide-scale botnet detection and characterization. In Conference on Hot Topics in Understanding Botnets (HotBots), Cambridge, MA. 50. Kent, S., Lynn, C., & Seo, K. (2000). Secure border gateway protocol (S-BGP). IEEE JSAC, 18(4), 582–592. 51. Krishnamurthy, B., & Blackmond, E. (2004). SHRED: Spam harassment reduction via economic disincentives. Retrieved from http://www.research.att.com/bala/papers/shred-ext.ps. 52. Kuerbis, B., & Mueller, M. (2007). Securing the root: A proposal for distributed signing authority. Retrieved from http://internetgovernance.org/pdf/SecuringTheRoot.pdf. 53. Lau, B., & Svajcer, V. (2008). Measuring virtual machine detection in malware using DSD tracer. Journal in Computer Virology. Retrieved from http://www.springerlink.com/content/ d71854121143m5j5/ and http://www.citeulike.org/article/3614541. 54. Leiner, B. M., Cerf, V. G., Clark, D. D., Kahn, R. E., Kleinrock, L., Lynch, D. C., Postel, J., Roberts, L. G., & Wolff, S. (2003). A Brief History of the Internet, version 3.32. Available from:www.isoc.org. 55. Lemos, R. (2009). Cyber attacks disrupt Kyrgyzstan’s networks. Retrieved from http://www. securityfocus.com/brief/896. 56. Leyden, J. (2004). The illicit trade in compromised PCs. Retrieved from http://www. theregister.co.uk/2004/04/30/spam biz/. 57. Mao, Z. M., Sekar, V., Spatscheck, O., van der Merwe, J., & Vasudevan, R. (2006). Analyzing large DDoS attacks using multiple data sources. In SIGCOMM Workshop on Large Scale Attack Defense (LSAD). 58. Marshall8e6. TRACElabs. Retrieved from http://www.marshal8e6.com/TRACE/. 59. McDaniel, P., Sen, S., Spatscheck, O., Van der Merwe, J., Aiello, B., & Kalmanek, C. (2006). Enterprise security: A community of Interest based approach. In Proceedings of Network and Distributed Systems Security 2006 (NDSS). 60. Moore, D., Voelker, G., & Savage, S. (2001). Inferring Internet denial-of-service activity. In Proceedings of the USENIX Security Symposium (pp. 9–22). 61. Ng, J. (2004). Extensions to BGP to Support Secure Origin BGP (soBGP). Internet Draft: draft-ng-sobgp-bgp-extensions-02.txt. 62. Ohm, P., Sicket, D., & Grunwald, D. (2007). Legal issues surrounding monitoring during network research. In Internet Measurement Conference (IMC). 63. Patrick, N., Scholl, T., Shaikh, A., & Steenbergen, R. (2006). Peering dragnet: Examining BGP routes received from peers. North American Network Operators’ Group (NANOG) presentation. 64. Poulsen, K. (2003). Slammer worm crashed Ohio nuke plant network. Retrieved from SecurityFocus, http://www.securityfocus.com/news/6767. 65. Provos, N. (2004). A virtual honeypot framework. 13th USENIX Security Symposium. 66. Ramachandran, A., & Feamster, N. (2006). Understanding the network-level behavior of spammers. In Proceedings of the ACM SIGCOMM, Pisa, Italy. 67. Rescorla, E., & Korver, B. (2003). Guidelines for writing RFC text on security considerations. IETF RFC 3552. 68. RIPE NCC. (2008). YouTube Hijacking: A RIPE NCC RIS case study. Retrieved from http:// www.ripe.net/news/study-youtube-hijacking.html. 69. Rohde, D., & Gittlen, S. (1998). AT&T frame relay net goes down for the count. Retrieved from http://www.networkworld.com/news/0414frame2.html. 70. Security and Prosperity Steering Group APEC Telecommunications and Information Working Group. (2008). Best Practice for cooperative response based on public and private partnership. Available from http://www.apec.org/. 71. Senie, D., & Sullivan, A. (2008). Considerations for the use of DNS reverse mapping. Internet draft: draft-ietf-dnsop-reverse-mapping-considerations-06. 13 Network Security 515 72. Shaikh, A., & Greenberg, A. (2004). OSPF monitoring: Architecture, design, and deployment experience. In Proceedings of the First Symposium on Networked Systems Design and Implementation (NSDI). 73. Simon, D. R., Agarwal, S., & Maltz, D. A. (2007). AS-based accountability as a cost-effective DDoS defense. In Conference on Hot Topics in Understanding Botnets (HotBots). 74. Snoeren, A. C., Partridge, C., Sanchez, L. A., Jones, C. E., Tchakountio, F., Kent, S. T., & Strayer, W. T. (2001). Hash-based IP traceback. In Special Interest Group on Data Communication (SIGCOMM) Conference. 75. Sotirov, A., Stevens, M., Appelbaum, J., Lenstra, A., Molnar, D., Osvik, D. A., & de Weger, B. (2008). MD5 considered harmful today – Creating a rogue CA certificate. Retrieved from http://www.win.tue.nl/hashclash/rogue-ca/. 76. Spiekermann, S., & Faith Cranor, L. (2009). Engineering privacy. IEEE Transactions on Software Engineering, 35(1), 67–82. 77. Spitzner, L. (2003). Honeypots: Definitions and value of honeypots. Retrieved from http://www.tracking-hackers.com/papers/honeypots.html. 78. TEAM CYMRU. BGP/ASN Analysis Report. Retrieved from http://www.cymru.com/BGP/ summary.html. 79. United States Government Accountability Office. (2005). Prevalence of false contact information for registered domain names. Retrieved from http://www.gao.gov/new.items/d06165. pdf. 80. U.S.-Canada Power System Outage Task Force. (2004). Final report on the August 14, 2003 blackout in the united states and Canada: Causes and recommendations. Available from https://reports.energy.gov/. 81. US-CERT. Vulnerability note VU#800113 – Multiple DNS implementations vulnerable to cache poisoning. Retrieved from http://www.kb.cert.org/vuls/id/800113. 82. Vamosi, R. (2007). Cyberattack in Estonia – What it really means. Retrieved from http://news. cnet.com/Cyberattack-in-Estonia-what-it-really-means/2008-7349 3-6186751.html. 83. Van der Merwe, J., Cepleanu, A., D’Souza, K., Freeman, B., Greenberg, A., Knight, D., McMillan, R., Moloney, D., Mulligan, J., Nguyen, H., Nguyen, M., Ramarajan, A., Saad, S., Satterlee, M., Spencer, T., Toll, D., & Zelingher, S. (2006). Dynamic connectivity management with an intelligent route service control point. SIGCOMM Workshop on Internet Network Management (INM). 84. Vasudevan, R., Mao, Z. M., Spatscheck, O., & Van der Merwe, J. (2007). MIDAS: An impact scale for DDoS attacks. In 15th IEEE Workshop on Local and Metropolitan Area Networks (LANMAN). 85. VeriSign. (2008). Root zone signing proposal. www.ntia.doc.gov/DNS/VeriSign DNSSECProposal.pdf. 86. Verkaik, P., Spatscheck, O., van der Merwe, J., & Snoeren, A. (2006). PRIMED: A communityof-interest-based DDoS mitigation system. In Proceedings of SIGCOMM Workshop on Large Scale Attack Defense (LSAD). 87. Vrable, M., Ma, J., Chen, J., Moore, D., Vandekieft, E., Snoeren, A., Voelker, G., & Savage, S. (2005). Scalability, fidelity and containment in the Potemkin virtual honeyfarm. In Proceedings of ACM Symposium on Operating Systems Principles (SOSP). 88. Wright, C. (2008). Understanding Kaminsky’s DNS Bug. Retrieved from http://www. linuxjournal.com/content/understanding-kaminskys-dns-bug. 89. Yu, W., Fu, X., Graham, S., Xuan, D., & Zhao, W. (2007). DSSS-based flow marking technique for invisible traceback. In IEEE Symposium on Security and Privacy. 90. Zheng, C., Ji, L., Pei, D., Wang, J., & Francis, P. (2007). A light-weight distributed scheme for detecting ip prefix hijacks in real-time. SIGCOMM Computer Communication Review, 37(4), 277–288. Chapter 14 Disaster Preparedness and Resiliency Susan R. Bailey 14.1 Introduction The most important thing to remember in this chapter is its title. The previous working version of the title was “Disaster Recovery,” which is certainly the most common phrase used to describe the set of activities associated with managing operations (including networks) through the most severe catastrophic incidents. Indeed, the kinds of activities that get publicity, make the headlines, and become the material for rewards and recognitions are the recovery activities that take place following a major disaster. The activities certainly do involve heroic acts and significant achievements worthy of credit. However, the problem with the term “Disaster Recovery” is the adjective, which places emphasis on the recovery activities that by definition take place after an event happens. What is missing in the term “disaster recovery” are the events leading up to a disaster. To be most effective in recovering from a disaster, the bulk of investment of time and money, as well as the most significant point of leverage to substantially improve recovery performance, all should happen before the disaster occurs, so that we are prepared to act, and can act quickly and efficiently. This is true for any enterprise in any industry, and is most certainly true in running networks. Even industry’s premier educational and certification program for those engaged in the practice of disaster management, formed in 1988 and known as the “Disaster Recovery Institute,” has changed its name to “DRI International: The Institute for Continuity Management,” signaling the important role in the full scope of activities that take place before, during, and after a disaster. The way to achieve successful disaster recovery is to implement disaster preparedness. The terrorist attacks of September 11, 2001 demonstrate many of the dimensions of disaster planning and management, as well as the resiliency challenges that are involved in managing networks. The terrorist-piloted airplanes that crashed into the World Trade Center in New York City destroyed major communication hubs S.R. Bailey () AT&T Global Network Operations e-mail: srbailey@att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 14, c Springer-Verlag London Limited 2010 517 518 S.R. Bailey housed in the World Trade Center itself and its nearby buildings. These hubs were a core component of the network infrastructure serving lower Manhattan as well as the broader New York and East Coast area. At exactly the time when significant network capacity and connectivity was destroyed, a huge surge of traffic volume hit the network as people tried to reach their loved ones using any and all means possible, increasing volumes on the telecommunications infrastructure by double or more. And this volume was not nicely distributed around the USA and the world, but rather was primarily concentrated into and out of lower Manhattan. We call this phenomenon focused overload. This scenario is a network manager’s nightmare: trying to handle an extraordinarily high traffic surge when you have less capacity available to handle the load. The same network that was being used for mass communications was also being used for many command-and-control activities by police, emergency management, and government officials, requiring real-time traffic prioritization decisions during the times of peak congestion. During times of peak-traffic volume on that day, phone calls destined for edge switches (known as “end offices”), which were known to be damaged and out of service, were restricted from entering the network and consuming capacity when it was clear that the phone calls could not complete successfully anyway. Phone traffic, which did not need to travel through the New York area (e.g., traffic destined from Atlanta to Boston), was redirected away from New York through the use of network management traffic controls. Yet, in the face of the enormously disastrous scenario, the network infrastructure did not collapse despite localized congestion. Ninety-six percent of AT&T’s Government Emergency Telecommunications Services (GETS) traffic completed successfully even in the height of the event, AT&T’s network hub was recovered and ready for service within 48 h, and the New York Stock Exchange and financial industry of lower Manhattan reopened in less than a week. Quite simply, this rate of recovery would have been impossible without the “silent heroes” who planned and practiced disaster preparedness well before that awful day. A couple of years later, in October 2003, severe wildfires threatened the area surrounding San Diego, California. One of AT&T’s mission critical network management work centers was dangerously close to the fire; so to protect its operational functions, the work center invoked its business continuity plan. Temporary operations were established at an alternate site several hundred miles away. Network managers were deployed to the alternate site, operational support system (OSS) access was established, and phone calls to the San Diego work center were redirected so that the staff at the alternate site could do all the mission critical work normally done in the San Diego work center. This was all done while still running the network, with no loss of operational functionality during the transition. The work center was ultimately not damaged by the fire. But if it were, the alternate location was prepared to operate indefinitely until a permanent replacement could be built. Hence, the title of this chapter, and for that matter the content of the chapter, focuses on disaster preparedness, which includes creating, exercising, and ongoing management of disaster recovery plans. Maintaining a state of readiness enables quick, disciplined recovery to minimize service disruptions. With good disaster preparedness, disaster recovery becomes the disciplined management of the execution of disaster recovery plans. 14 Disaster Preparedness and Resiliency 519 Section 14.2 addresses the role of carrier networks as national critical infrastructure, and the resulting expectations for sustained operational service in the face of disasters. Section 14.3 reviews the types of considerations involved in sustaining continuity of operations in a network environment, pointing out that a full operational continuity program includes much more than simply protecting the network itself. Section 14.4 provides an overview of the discipline of business continuity management, including techniques to structure a business impact assessment and risk-management program.1 Section 14.5 addresses some considerations for designing resiliency into the architecture of the network itself. Section 14.6 addresses preparations involved when a specific disaster such as a hurricane is predicted. Section 14.7 reviews the operational activities that come into play once a disaster happens. Section 14.8 highlights some important technologies associated with disaster recovery. The chapter closes with Section 14.9, a discussion of open questions and future research to further improve disaster preparedness and resiliency. 14.2 Networks as Critical Infrastructure Network carriers have a lot of responsibility. The networks provide emergency lifeline communication for tens of millions of customers in the communities the carriers serve, including capabilities such as contacting fire and police departments or 9–1–1 emergency services. The fact that life and safety are involved makes it absolutely essential that these services operate continuously in spite of a disaster, because that is precisely when these services are needed the most. Carrier networks carry huge volumes of daily communication traffic. In 2008 AT&T, for example, network traffic volume for all Internet Protocol (IP), transport, and voice services exceeded 16 petabytes per day. While all this traffic is valuable to those communicating, a relatively small but growing percentage of this huge volume is truly “mission critical.” Government and other emergency management agencies such as FEMA and the Department of Defense depend on carriers’ networks to perform the data and voice communication required for command, control, and communication functions activated to manage disasters. In addition, as more industries become technology-based, communications networks become an increasingly mission critical component of other national “critical infrastructure” industries, such as the financial sector, and power generation and distribution. As the Internet backbone serves as an increasingly essential core for business operations and commerce, and infrastructure industries implement more electronics-enabled and automated processes, government and industry depend on the continuous availability of their network infrastructure and the carriers who provide it. 1 The term “business continuity” is generally used to encompass aspects of planning and managing operational continuity for any type of operation, and in this chapter is not limited strictly to commercial businesses. The fundamental techniques are equally applicable for government, academic, and not-for-profit operations. In the government environment, business continuity is often referred to as Continuity of Government (COG) or Continuity of Operations (COOP). 520 S.R. Bailey 14.3 Business Continuity in a Network Environment Networks are subject to many kinds of threats, some obvious and well-known, others less obvious but equally as devastating. To characterize the threats, it is helpful to understand an abstract view of what a network entails. This section describes three major components: the network itself, network management work centers, and operational support systems (OSSs). In simple terms, a network involves links and nodes. Nodes look like computing equipment housed in buildings of various shapes and sizes, ranging from small aggregation or regeneration equipment housed in tiny “huts” with a footprint of only tens of square feet, to huge data centers and central office buildings spanning tens of thousands of square feet. The links are the connections between the nodes, which are typically carried at a physical level on fiber cables that are buried underground, under the oceans, or in some cases strung aerially. The AT&T network, for example, involves more than 9,000 major buildings and another 200,000 smaller locations. There are about 900,000 sheath-miles of fiber just in the core of the AT&T backbone, and that does not even include the magnitudes of cabling to connect each customer to the AT&T backbone. The network itself is not very useful without the operational functions that operate 24 7 to keep the network running, including maintenance and repair, configuration management, capacity management, and provisioning customer services. These functions are executed in work centers, which are typically administrative buildings staffed with network managers on a 24-h-per-day, 7-day-per-week basis. Work center functions can vary in their mission criticality, usually based on the requirement for these functions to be fully operational in order to keep a production network functioning. For example, an enterprise might determine that provisioning of new customer orders can be suspended temporarily at the time of a disaster, in which case work center functions involved with provisioning of orders might take days or weeks. By comparison, work center functions directly involved in maintaining network traffic flow and repairing network problems are usually deemed mission critical, and must be able to recover almost immediately. These work centers interact with the network itself using operational support systems (OSSs), software applications, and their associated databases that perform functions such as alarm management and the tracking of individual work activities such as orders and tickets. Without the OSS, the people operating out of the work centers are unable to interact with the network to execute their required functions. The mission criticality of an OSS is correlated to the mission criticality of the work centers that use it. So, a threat to the network can be anything that impacts the nodes and links of the network itself (whether physically or logically), the work centers and operational processes executed in them, or the OSS. The list of potential threats is practically endless, but some examples include: Physical damage to network nodes can be due to incidents such as fires and floods. Physical damage to network links. By their very nature of being geographi- cally distributed and exposed to outdoor conditions, many network components 14 Disaster Preparedness and Resiliency 521 (especially fiber cable) are exposed to significant environmental and man-made threats such as train derailments, ships dragging anchor and snagging an undersea cable, earthquakes, or mudslides. Widespread and extended loss of electrical power. Since all the electronics in a network require electrical power, a loss of commercial electrical power can be a significant threat to the function of a network. The more widespread the power loss and the longer the duration, the more significant the disaster can be. Denial-of-Service attacks, or any other mass traffic event, injects mass traffic toward specific components of the network, disabling them by overloading them. Worms and viruses disable network components or OSS by destroying their logic or their databases. Physical threats to work center buildings and their inhabitants, which require the inhabitants to evacuate their normal operating environment. Work centers are subject to the same kinds of physical threats as network nodes (fires, floods, etc.). In addition, work centers can be impacted by threats that do not necessarily damage the building, but require the people who work in the building to evacuate and/or stay out of their normal operating environment. For example, a gas or chemical leak on the ground or in the air within a building or in the surrounding geographic area can require rapid evacuation. More severe examples can include bomb threats, or even worse, intentional attack using chemical, biological, or radiological weapons or “dirty bombs.” Other threats to work center personnel can range from the loss of mass transit that impacts the movement of personnel between facilities, to a job action relation to union contract negotiations (commonly known as a “strike”), to a health pandemic that disables the workforce directly through illness (or death) or indirectly through the need to care for ill family members or fear of entering the work environment and becoming ill. Failures in operational support systems used to run the network, such as alarm management systems, ticketing systems, and remote testing platforms. These systems are subject to many of the same risks as the network itself, such as loss of power to data centers and other modes of failure to the flow of telemetry, the server hardware, and the application software. To achieve a level of network resilience, a comprehensive disaster preparedness and business continuity program should encompass the physical components of the network itself (e.g., electronics and cabling), as well as work centers and their mission critical functions, and tools used. One way of representing this is in Fig. 14.1. The diagram shows a pyramid for each of the three major business continuity components: Work Centers, Network, and Operational Support Systems. Each step up the pyramid shows an increasingly significant recovery mechanism. Many of these can share common foundational elements, at the base of each pyramid, which can be applied to any kind of asset, whether it is a work center, a component of the network, or the operational support systems and databases that are used to run the network. These include: Business Continuity Discipline: the disciplined approach to Business Impact Analysis and Risk Management that are described in much of this chapter, 522 S.R. Bailey Network Operational Support Systems Work Centers Site Recovery warm/hot standby, mobile recovery Functional Recovery: failover, load sharing, decentralized “virtual” operations via telecommuting Network Disaster Recovery Traffic Switching: IP/OSPF, filters, bandwidth controls Facilities: optical mesh restoration, SONET rings, equipment switching Applications: data center diversity Data Warehouses: file backup and storage, disc mirroring Servers: load balancing, redundancy and failover People and Process – Change/Configuration Management, Incident Management Infrastructure – power backup, cooling, physical diversity, plant hardening and protection, security Business Continuity Discipline – Business Impact Analysis, Threat Analysis, Risk Management, Exercises Fig. 14.1 Enterprise business continuity protocol including the analysis of threats and vulnerabilities, risk mitigation, development disaster recovery plans, and ongoing exercising of those plans. Infrastructure: the technologies required to provide basic environmental and power conditions to an office. All buildings require power and air handling, whether they house network equipment such as routers and switches, operational support system servers and storage equipment, or work space for people. People and Process: a broad category covering the operational discipline in executing well-defined procedures, with vigilance and constant attention to the impact of any action on the network, its services, and its customers. Above the shared foundation, the specific approaches to achieving business continuity can differ. In the center of the diagram, Network recovery focuses on moving service to di- verse physical components, which are unaffected by a disaster. This can include recovery of service on alternate facilities such as backup equipment or alternate physical paths, or redirection of traffic onto alternative available capacity. Facility and traffic prioritization can be used to identify high-priority services, so that these services can be restored first with the least amount of downtime, as is discussed in more detail in Section 14.7.2. In situations of most extreme damage, rebuilding of components that are destroyed can be accomplished using specialized disaster recovery equipment. On the left of the diagram, Work Center recovery involves functional relocation and assuring the availability of skilled staff to pick up the operations when a primary work center is rendered unavailable. It can include the distribution of critical functions into alternative work centers or a telework model, or a full site 14 Disaster Preparedness and Resiliency 523 recovery, in which a dedicated backup site is maintained for purposes of recovering all of a work center’s mission critical functionality. On the right of the diagram, Operational Support System recovery focuses on restoration of the applications in backup data centers, and also involves recovery of databases so that any data loss is minimized. It includes backup servers to recover from hardware failures, data backup and storage to protect against loss of data, and application-level recovery through diversifying the platform into multiple data centers. 14.4 Business Continuity Management So, you have decided that you want to be prepared for disasters and resilient against their impact. Where do you start? The almost limitless list of potential disasters and the breadth of assets to worry about can make the task seem unwieldy. While it would be nice to protect everything from any disaster, that is virtually impossible. The key is understanding the most significant problem areas and effectively prioritizing investments in mitigation. The following terms are useful foundation to provide structure to this process. Threats are factors that have the possibility of causing damage. Threats can take many forms, including physical, logical, economic, political, or social. Examples of threats include weather-related disasters such as hurricanes, floods, and tornadoes, financial disaster, disease pandemic, and the outbreak of war. One way of reducing the risk is to eliminate or reduce the possibility that a threat will occur. For example, a medical vaccine can reduce the possibility of a health pandemic occurring. Often, it is impossible to actually reduce or eliminate the probability of many kinds of threats, such as natural disasters; so the focus is on protection even in spite of the existence of the threat. Vulnerabilities are characteristics of an asset that make it susceptible to damage by a threat. If it is difficult to eliminate a threat, another way of reducing an overall risk profile is to eliminate or reduce an asset’s vulnerabilities. A classic example of reducing vulnerability is sandbagging to prevent the impact of a flood. The high water will still rise with the same probability whether or not there are sandbags, but the sandbags reduce vulnerability by physically holding back flood waters and protecting buildings, people, and equipment from damage. Impact is the magnitude of damage to an asset in the event that a specific threat exploits a vulnerability. If you cannot adequately address a risk by eliminating threats and vulnerabilities, it can be possible to reduce risk exposure by providing mechanisms to control the impact. Continuing with the flood example mentioned above, an example would be providing pumps and other equipment to move water away once the flood waters have breached a vulnerable area. 524 S.R. Bailey 14.4.1 Know What Is Important The first step is to enumerate all the assets critical to sustained operation. Assets can be physical equipment, people and processes, customers, databases, or operational support systems. Assets can also include external assets including third-party suppliers and equipment vendors. Not all assets are created equal, of course. An important step is to understand which assets are mission critical, meaning that it is impossible to sustain operation effectively without them. Assets that are not quite mission critical, but are important to recover within reasonable time can be assigned ratings of lower importance. Once assets are prioritized, this enables attention to be focused on those with the highest priority. 14.4.2 Analyze Risks Business Impact Analysis is a formal process to provide structure to the identification and prioritization of threats based on understanding the potential impact to an operation or business. It starts with identifying credible threats that could cause an interruption to an organization’s business. Each asset can be evaluated against each threat to determine whether that asset has a vulnerability associated with the threat. Those vulnerabilities can be evaluated and scored quantitatively or qualitatively on three measures: The probability that the threat will exploit the vulnerability The magnitude of the service impact if the above happens The ability to control the impacts. These scores can then be rank-ordered and summarized into a risk matrix, such as the example shown in Fig. 14.2. This matrix is one of the most important elements Fig. 14.2 Risk matrix example 14 Disaster Preparedness and Resiliency 525 of a risk assessment, and is used to provide a prioritization in support of building plans to mitigate the most critical risks. In this example, the weight .W / is a judgment of the overall level of importance on a scale of 0–1.0, the probability factor .P / represents the likelihood that the vulnerability is exploited by a threat, the service impact .S / is the magnitude of the impact if a vulnerability is exploited, and controllability.C / represents the ability to control the impacts. These variables are then combined (in this case, multiplied, though other means of combining are possible) to provide an overall score, with a high score indicating the risk factor that poses the highest exposure, and therefore warranting the most focused attention to mitigate. 14.4.3 Develop a Plan Risk mitigation involves identifying ways to reduce risk, by eliminating or reducing threats, eliminating or reducing vulnerabilities, or reducing impact by providing control mechanisms. The various proposed solutions can be evaluated by their overall impact on reducing risk, feasibility, and time and cost to implement. When addressing the recovery of any asset, whether it is a network itself, a support system, or a work center that operates the network, two important variables can be considered when designing risk-mitigation solutions. Recovery Time Objective (RTO): This is measured as the targeted duration of time between the occurrence of a disaster and the time that functionality is restored. For the network itself, this is measured as the outage downtime. For a work center, this would be the time between the declaration of a disaster impacting the work center and the recovery of its functionality in an alternate arrangement at one or multiple backup locations. Recovery Point Objective (RPO): This involves the amount of data that is expected to be lost as a part of the recovery. It can be thought of as the point that you can roll back to and recover all critical configuration data. It is measured as the time between the start of the disaster and the time before the disaster when the databases and configurations were last updated and able to be recovered. For network components, this translates into the amount of provisioning activity prior to a disaster, which is lost after the network itself is recovered, and therefore must be reprovisioned. RTO and RPO can be depicted visually in Fig. 14.3. Ideally, these variables would be near zero, indicating instantaneous recovery with no loss of data. Typically, shorter RTO and RPO require more expensive solutions. So, definition of RTO and RPO objectives require very careful consideration of exactly what an operation really needs and how much loss it can handle, based on the impact of downtime to the business. Once defined, they also provide a very straightforward measurement, useful during drills and exercises to evaluate the adequacy of the execution of disaster recovery plans. 526 S.R. Bailey Fig. 14.3 Timeline of a disaster Fig. 14.4 Risk mitigation example Frequently, an enterprise faces choices between alternative mechanisms to mitigate risk, which can vary in cost to implement and feasibility. One way to assess risk-mitigation options is to quantitatively compare their relative impact to the overall risk exposure. An example is shown in Fig. 14.4. Assume that there are 14 Disaster Preparedness and Resiliency 527 two options for mitigating risks, portrayed in this figure as Plan A and Plan B. In this case, Plan B significantly improves the overall risk exposure compared to Plan A. Once risk-mitigation strategies are selected and implemented, a full disaster recovery or business continuity plan should be documented, outlining the exact procedural steps that should be taken. Anyone involved in managing or implementing the disaster recovery or business continuity plan needs to be advised and fully trained on what they are required to do in support of the plan. 14.4.4 Test and Manage the Plan Conditions constantly change. New threats emerge as world events change. Asset bases evolve. New technologies are introduced. People move out and move in to organizations. To sustain a state of preparedness, life-cycle management of disaster recovery and business continuity plans is extremely important. One aspect of life-cycle management is maintaining up-to-date documentation of disaster recovery and business continuity plans. Contact information, process updates, and other information must be kept current. Another aspect is training and communication, to maintain overall awareness as well as more detailed training on specialized duties as people transition in and out of organizations. People need to be so familiar with what they need to do after a disaster that when it happens, they do what they need to do almost by habit. Finally, practice, practice, practice! To maintain a state of preparedness, there is absolutely no substitute for actually invoking a business continuity and disaster recovery plan and seeing how it works. Simulate the actual implementation of disaster recovery and business continuity plans. Make it as real as possible. If equipment is involved, use the equipment and make sure it works. Measure RTO and RPO against organizational objectives. Make sure people are trained, including newcomers to an organization. Record problems and findings. Create a list of improvements and changes identified as a result of the exercise, with clear ownership and accountability. Sometimes, it is impossible to actually test a business continuity and disaster recovery plan. Sometimes, the scenario being tested is too broad in scope to feasibly be exercised. Sometimes, the act of practicing the loss of functionality and the failover to an alternate arrangement introduces undue risk to customer service and network traffic. Where it is impossible to actually practice a real recovery situation, it might be necessary to practice on a testbed environment separate from the production network, or use traffic simulations based on mathematical models of network traffic across a network topology. Walk-throughs and “table-top exercises” can be used to approximate the movement of functions like work centers in a disaster scenario. 528 S.R. Bailey 14.5 Design for Resiliency Being prepared to handle a disaster scenario starts with the design and architecture of the network itself. If network traffic is impacted by the failure of a component, that component is identified as a single point of failure (SPOF), and is a point of high vulnerability. Any component can be an SPOF, including network components, physical cabling within a building, long-distance fiber segments, undersea cable systems, power substations, or power cabling within an office. A component that is designed to have a backup or alternate arrangement, so that network traffic can persist even if the component fails, is said to have diversity. Generally speaking, the more simultaneous failures a particular platform can handle without impacting traffic, the more resilient is the network design. Applications with extremely high requirements for uninterrupted traffic flow under any circumstance, in a global design that contains many components spread across a wide variety of geographic environments, have been designed with eight or more layers of protection (meaning that the architecture can support seven simultaneous failures without impacting traffic flow). It is painstaking work to design a resilient network, systematically eliminating all SPOFs, ranging from physical fiber diversity to software applications and the servers on which they ride. Traffic simulations based on specific real-time network topology and traffic patterns, such as those described in Chapter 2, can simulate the impact to network traffic under various failure conditions across a network topology. Diversity and redundancy also apply to data storage and backup. Mission critical data, such as circuit layouts, routing policies, and customer configurations, can be replicated and stored in multiple diverse data centers, so that the data can be recovered even in spite of the complete loss of an entire data center and all the equipment in it. Data recovery is so important in the financial industry that it has regulatory guidelines issued by the Financial Industry Regulatory Authority (FINRA), requiring explicit plans for data backup and recovery (Financial Industry Regulatory Authority, FINRA Manual, section 3510). Because of the significant dependence on uninterrupted power required to operate a carrier-grade network, carrier-grade network design normally includes the following elements: Completely redundant power cabling in major network buildings, with diverse building entrances and connectivity to geographically separated electrical substations on the power grid. Battery backup, with near seamless transfer to battery power. Diesel or natural gas-powered generators with autostart capabilities and switchgear so that the generators activate even if the office is unstaffed. Because batteries have limited storage, they are not sufficient for long-term power outages, lasting days or weeks. With proper maintenance and a well-run refueling program, generators can run almost indefinitely. Carrier network buildings have been known to operate for months on generator, for example after significant hurricanes. 14 Disaster Preparedness and Resiliency 529 A typical power design engages the batteries immediately on loss of commercial power. The batteries provide power for the first few minutes until the generators can ramp up and are ready to carry the load, and which time the load switches over to the generators for longer-term service. Remote equipment installed in small nodes outside of major network offices pose significant challenges in the area of power protection. These small nodes serve as aggregation points or signal regeneration or amplification, and can number hundreds of thousands in a large network. The cost to install permanent generators to all these locations can be prohibitive, so these types of nodes often rely on battery backup, followed by dispatch and hookup of portable generators to power the nodes. A work center without a business continuity plan can be an SPOF as well. Business continuity decisions factor into the design of work centers and their operational support systems. There are many approaches to designing work center business continuity. One possibility is to design a backup site, geographically distant from the primary work center location, which exists solely for the purpose of recovering the primary site. Another possibility is to build two or more centers that share the workload under normal operating conditions, with mechanisms to flow work inputs away from a center that is unable to operate due to a disaster, e.g. by redirecting phone calls, alarms, tickets, or other work drivers. Since many work centers perform multiple functions, another approach is to flow each function to an alternate arrangement, without necessarily moving all the work in the center as a whole to another work location. The operational support systems and their associated databases used by employees in a work center should not be SPOFs either, and the time to work SPOFs out of the system architecture is during design and installation. Typically, this involves installation of servers at geographically distant data centers, with software designed so that it can fail over from one server or site to another. 14.6 When You See Disaster Coming In many cases, we are lucky enough to see an impending disaster before it happens. For example, we can watch hurricanes form and travel across the water, and we are glued to the weather forecasts, which predict precisely when, where, and how severe the storm will be. This advance warning gives us precious lead time to take very specific preparatory actions, way beyond the broad planning and preparedness discussed so far. It is helpful to maintain a checklist of pre-event activities to perform once a specific disaster risk is identified. Here are a few kinds of activities to do in preparation for a disaster. Batten down the hatches. Take action to protect assets from physical damage. This can include sandbagging, boarding up windows and doors, and closing marine doors 530 S.R. Bailey to prevent floodwaters from entering a building, installing chicken wire around roofing to help hold the shingles on, or welding manholes shut to prevent unauthorized access by those intending to do harm in preparation for mass political demonstrations. Work centers that are in harm’s way might elect to activate disaster plans and move mission critical functions to other facilities outside of the risk area. Lock down the network. Any time the network is being “touched,” it is being put at risk. Any change can introduce a new problem. People can make mistakes, and software changes can introduce new bugs. During a disaster, the focus should be on handling the disaster, and any other problem is a distraction. The idea here is to call off as much unnecessary work as possible, planned or routine maintenance, software upgrades, etc., so that as much effort as possible is available to focus on handling the disaster itself. The scope of such a lock-down should include all assets at risk of damage, as well as any assets that might be needed to restore service due to other damage. Fix everything possible. Anything that is broken in the network can be thought of as capacity that is not available to use, even if it is not directly service impacting. For example, if a line-card in a router is failed, but a redundant card is being used to sustain service, the failed line-card is not available to restore service in the event that the redundant card fails. Carriers call this situation a simplex condition, and all simplex conditions should be remediated before the disaster strikes. Even if the particular simplex component is not directly threatened by the disaster itself, disasters often result in difficulty of basic movement of equipment and suppliers, so the component would be difficult to replace even if it fails due to nondisaster-related conditions. Besides, that capacity might be needed to restore other service. Know what is at risk. Take stock of what physical components (buildings, equipment, and cable routes) are at risk, and what services and what customers are riding on those components. When simulation tools are available, simulate the failure of those components to determine the potential service impact, and take preventive action where possible. Move services out of harm’s way. In many cases, services can be moved to alternate facilities and nodes that are not at the risk of damage, for example, by ring-switching synchronous optical (SONET) facilities or assigning an extremely high logical cost to IP backbone connections that are carried on fiber paths likely to sustain damage. The advantage of moving traffic in advance of an impending disaster, rather than waiting to see exactly what is damaged and triggering an automatic failover, is that the traffic move can be implemented under more controlled conditions, which can result in less overall service impact. Stage emergency equipment and supplies. If repair and rebuilding is likely, identify the equipment and supplies that would be needed and stage them in preparation for deployment after the disaster. But be careful, you do not want your emergency supplies to be too close, or they can be subject to the same threats as your primary components. It can be prudent in some cases to leave the equipment in protected warehouses, to be dispatched after the threat has passed. 14 Disaster Preparedness and Resiliency 531 Preplan the recovery. Any design work that can be done in advance of a disaster eliminates precious time lost after a disaster doing “engineering on the fly.” In some cases, network traffic can be rerouted to alternative equipment, which has been permanently installed in the network. For example, an IP-based application can reside on servers that are installed into geographically diverse data centers, with data backup and failover capabilities. In other cases, especially where connectivity needs to be re-established, equipment needs to be deployed near the disaster area. For this kind of scenario, AT&T’s Network Disaster Recovery program utilizes software that preplans exactly which equipment and which trailers need to be deployed, with software that provides a design for the specific connections between the trailers and preprinted labels for the intertrailer cabling. In addition, configuration management software maintains current configurations of each component in the production network, so that the configuration can be downloaded en masse so that the trailerized disaster recovery equipment can take on the identity of the damaged equipment without the need to rebuild the mappings and configurations from scratch. This configuration management alone saves weeks (and potentially months) of restoration time. Communicate to customers. In advance of a disaster, customer communication tends to focus on actions being taken to protect their service and to inform on special communications that the carrier will implement to keep their customers apprised as the event unfolds. It can be a calming influence for customers to know that their service is in professional hands and that the carrier is acting proactively and professionally to protect the network and customer traffic. 14.7 When Disaster Strikes Regardless of the best efforts to prevent it, eventually a disaster will hit a network. That is when the network management team kicks into high gear, focusing all their energies on keeping the network alive, restoring any service that has been impacted, and addressing customer needs. 14.7.1 First and Foremost, Exercise-Disciplined Command and Control Most network carriers have a command center or emergency management center that exists for the primary purpose of managing disasters of various sizes. AT&T’s command center is its Global Network Operations Center (GNOC), as shown in Fig. 14.5. The GNOC has a military-style command-and-control structure, complete with predefined threshold-triggered actions and specific 24 7 duty officer assignments. A fundamental component of most command-and-control processes is 532 S.R. Bailey Fig. 14.5 AT&T Global Network Operations Center an emergency management bridge, a secure conference call that serves as the focal point for critical communication of status, obtaining resources, and providing overall strategic priorities and direction. Kicking off this bridge is normally the very first order of business involved with almost any disaster recovery activity. A good command-and-control process is structured with predefined actions based on identified scenarios including outages, incidents, attacks, crises, indicators, and threats. It should cast a wide net of information sources, including network management, government, customer, and other sources of information. The control bridge serves a number of purposes, including: Ensuring proper flow of information and response decisions across the company Approving tactical plans and making critical decisions Ensuring that adequate resources are provided Coordinating incident response across organizations Assessing impact in near real time Prioritizing restoration Authorizing communications, including press releases and customer notifications. Every person participating on the control bridge should have a specific assignment. Participants range from various functional organizations within a large carrier’s company, including network operations, real estate, public relations, customer servicing, and security. Because 24 7 coverage and immediate response is fundamental to the success of a command-and-control structure, reachability is essential, and delegation is required for even brief times when a bridge member might 14 Disaster Preparedness and Resiliency 533 be unreachable. The notification procedure itself requires significant preplanning, with the command center knowing alternate reach information, knowing personal schedules and delegations for all bridge participants. When an incident occurs, all participants whose functions are potentially involved should be notified to join the bridge, and those not required can be dismissed once the situation is understood sufficiently. It is very helpful to separate those managing the command-and-control structure from those directly involved in fixing a problem. This is for two reasons. First, it enables those fixing a given problem to remain focused on their task of restoring service, without getting distracted by tasks such as crafting communications documents. Second, it is common for those extremely close to a problem to lose sight of the bigger picture, while a neutral oversight function can more easily see the overall view of priorities, service impacts, and response activities. Liaison functions often are established to bridge between those directly involved in repair and the command center. Once a disaster situation is stabilized, the same command-and-control structure can take the lead in analyzing the event and capturing lessons learned after the fact. Every incident is an opportunity to learn, and lessons can range from prevention of future occurrences to improved response if the event ever happens again. Lessons learned can be translated into permanent improvements, including procedural changes and the creation or enhancement of preparedness checklists. It is helpful for these improvement opportunities to be captured and formalized on an action register, with clear ownership and accountability and time-bound expectations for implementing the identified improvements. Placing oversight of the implementation of the improvement program under the auspices of the formal command-and-control structure tends to ensure that the improvements are done quickly and completely, which ultimately makes the organization and the network more resilient in the face of future disasters. It is important to consider resiliency of the functions and tools used for command and control, such as work centers, operational support systems, and conference bridges. They can themselves have single points of failure and be subject to threats. At a minimum, a strong command-and-control structure should have alternate backup notification and communication mechanisms, in case the primary mechanisms are failed. Backup arrangements can range from basics such as alternate conference bridges using geographically separated equipment, through to extreme cases such as radio backup in case all commercial telecommunications are failed. 14.7.2 Manage Traffic Congestion Many disasters involve some form of traffic congestion, which is most simply described as too much traffic trying to go through too little capacity. Network nodes such as routers become congested by overrunning limits such as CPU capacity or memory. Links between nodes become congested when the traffic flowing down 534 S.R. Bailey them exceeds the bandwidth of the connection. Just like a traffic jam on a highway, network congestion can disable network components that are overloaded, spread outwards to adjacent network components, and ultimately bring a network to its knees if not properly managed. Traffic congestion is often an immediate after-effect that requires immediate action on the part of network managers. The basic approach is to service as much network traffic as possible, even if that means making difficult decisions to impact some traffic to keep the network from failing under the weight of the load. Traffic controls have developed on historical circuit-switched networks over decades, and include capabilities such as traffic redirection and bandwidth management. In today’s IP networks, traffic adjustments require changes to traffic filters and routing policies, but can be utilized to achieve a similar goal. In making traffic manipulation decisions, basic traffic management principles should be followed to protect the network 1. Utilize all available resources. During a congestion situation, all network capacity should be put into service to handle as much traffic as feasibly possible. To do this, traffic managers observe load conditions on nodes and links, simulate configuration changes, and implement routing changes to adjust traffic flows away from congestion bottlenecks and toward underutilized capacity. In the September 11 example cited in Section 14.1, there was no need for traffic traveling from Atlanta to Boston to go directly through New York because it was not originating or terminating there. Because network capacity in New York was congested and there was sufficient capacity outside of New York to handle additional load, network managers redirected this “via” traffic away from the congested area, so that all the network capacity in the New York area was being utilized to serve directly the traffic needing to get in or out of New York. 2. Keep all available resources filled with traffic, which has the highest probability to result in effective communications. In this principle, “effective communications” simply means that the traffic reaches its destination successfully. If a network manager has information that the destination is not able to receive traffic, there is no use consuming any network capacity to carry the traffic across the network only to fail at the end. Network managers attempt to restrict this traffic as it enters the network, so as to minimize unnecessary consumption of capacity for unsuccessful communications. In the traditional voice network, network management controls include “cancel-to,” which are applied as traffic enters the network and direct the originating switch to fail phone calls destined for a terminating “end office” or terminating edge switch, for example if that switch is known to be down. Such controls do not yet exist in the IP network, other than brute-force application of things like Access Control Lists (ACLs), but they are a future opportunity as discussed in Section 14.9.2. 3. In case of congestion and/or overload, give priority to traffic that makes the most efficient use of network resources. The more network resources used to deliver a unit of traffic, the less overall traffic can be delivered on the available capacity. To apply this principle, network managers like to exercise directional bandwidth controls to enable traffic out of a disaster area, which greatly reduces the traffic 14 Disaster Preparedness and Resiliency 535 demand into the area. In the voice network, these controls can be applied to restrict fixed percentages of traffic headed directionally from one switch to another. This kind of control remains a future opportunity in IP networks of today. 4. Inhibit traffic congestion and prevent its spread. Ideally, all offered load would be completed, and that is indeed true at low and moderate load levels. At these levels, the entire offered load is easily completed. As the offered load increases further, network performance begins to degrade as the network components become more consumed in administrative tasks required to keep themselves up and running. Hence, the curve begins to drop off at high offered load levels, and not all that load is completed successfully. If offered load continues even further, performance continues to suffer gradually until a “break point” is reached, the load simply overwhelms the equipment, and the equipment becomes unable to service any traffic at all. Once this point is reached on one component on the network, the congestion spreads extremely quickly and other components are subsequently vulnerable to the same phenomenon. This scenario can get quickly out of hand. The network manager’s task is to deliver traffic along this curve, completing as much traffic as possible on each network component without letting any component “fall off the cliff.” This can mean intentionally failing some traffic so as to protect the broader infrastructure, which because it entails failing some amount of customer service, should be applied as much as is needed but as little as possible. While TCP offers endpoints the ability to throttle traffic based on perceived congestion between them, the underlying IP network operates largely on simple notions of links being up or down, with no ability to gate traffic volume in response to congestion. So congestion-related controls remain an opportunity for development in IP networks. To support the prioritization of emergency communications, the US Department of Homeland Security works with network carriers to implement programs to support the prioritization of emergency communications for Government, Defense, emergency responder, and critical infrastructure communications. These include: Telecommunications Service Priority (TSP) is a program, which enables indi- vidual circuits on a backbone network to be identified as mission critical. This enables the circuit to be prioritized for restoration. Referring back to Fig. 14.1, TSP is an example of prioritization at the facility level. Government Emergency Telecommunications Service (GETS) has existed for many years on the circuit-switched voice network, and enables individual phone calls to be identified as “emergency” and prioritized above others when capacity is limited. Enhancements to GETS include expansion to Internet Protocol and Voice over Internet Protocol services. In Fig. 14.1, GETS is an example of Traffic-level service prioritization, since no circuits are actually restored, but individual traffic sessions are identified and prioritized for first treatment within the available capacity limits. Wireless Priority Service (WPS) operates similar to GETS, except that the voice calls that it operates on are wireless, cellular phone calls. 536 S.R. Bailey 14.7.3 Restore First, Repair Later When we encounter any type of problem, it is human nature that we try to diagnose and fix it. This is not necessarily the best course of action, in that diagnosis and repair can take a long time, during which customers are unable to use the network for their communication needs. An experienced network manager will keep in mind that the first priority is to restore customer service as quickly as possible, regardless of whether the root cause of the problem was found and fixed. In fact, often it is possible to restore customer service using alternate restoration mechanisms much faster than diagnosing and fixing the original problem that caused the service outage in the first place. Because of the long time that can be involved in repairing widespread damage, service restoration is almost always the preferred solution to bring back a network’s ability to handle traffic immediately after a disaster. Because restoration takes advantage of equipment already deployed and ready to use, it requires little or no manual work, and can often be done remotely by network managers in a work center far from the disaster area. It takes a seasoned network manager to ensure that adequate attention is put toward restoration options, whether designed into the network architecture or designed in real time at the time of the disaster. This can be in parallel to (or instead of, if the technical expertise and staffing is limited) working on permanent repairs. 14.7.4 Replace Damaged Equipment In the event of a “smoking hole” scenario in which network equipment is damaged and must be replaced, temporary recovery equipment can be an extremely valuable solution to restore the network and its services temporarily until permanent replacements can be acquired and installed. Since restoration time is often extremely critical, here are some steps to shorten the time it takes to deploy and configure this equipment: Procure the equipment in advance. This eliminates any time lost in the purchase, manufacture, and shipping of the equipment. Mount the equipment in mobile deployable units. Large network carriers can deem it worthwhile to invest in equipment built to be transportable and dedicated to the purpose of disaster recovery. Network equipment can be mounted into mobile units that are designed to operate like “data centers on wheels,” complete with self-contained power, cooling, and racks. These can be tractor-trailers that can be transported by truck, or fly-away containers that fit into airplanes. Ensure the recovery equipment is in a constant state of readiness. Check the equipment regularly and perform preventive maintenance on it. If the production network is being upgraded, for example to new software releases, then upgrade the recovery equipment along with the production network equipment, so that it matches the equipment it would be replacing as closely as possible. 14 Disaster Preparedness and Resiliency 537 Restoration equipment can be connected into monitoring systems for ongoing alarm management, so that any problems in the equipment can be detected and remediated, to ensure that all of the equipment is known to be active and operable at all times. All this work ensures that readiness saves valuable time when a disaster strikes. Maintain current, complete, and accurate backup files for all equipment configurations. These backups downloaded into temporary restoration equipment, enabling it to take on the identity of the equipment it is replacing without spending additional time configuring the equipment from scratch. Preplan network designs. Even if replacement equipment is readily available, precious time can be lost designing the architecture of the replacement solution. Preplanned designs, or software support for ad hoc designs, can reduce this time and shorten the overall recovery. 14.7.5 Open a Customer Service Command Center Whether or not a disaster impacts a network customer’s service, very often customers will have special needs that must be channeled and prioritized. Sometimes, customers need to report network outages. Other times, even without a network outage, customers have new demands on the network, such as additional bandwidth requirements to handle increased communication needs, or the provisioning of new services to deal with relocation of customer data centers or administrative offices. A customer service command center can accept all the customer needs, funnel them into appropriate channels, prioritizing as needed, for example to ensure that national security, emergency management, and critical infrastructure needs are handled first. This command center can also service as a bidirectional communication interface between customer service or sales teams and the network managers, so that customers can obtain timely and accurate information about restoration and recovery activities taking place. 14.8 Technologies A wide variety of technologies play an important part in disaster preparedness and management for networks. 14.8.1 Restoring Connectivity Whether network cabling can be terrestrial (i.e., buried underground), aerial (i.e., strung along utility poles), or undersea, all cabling is exposed to many 538 S.R. Bailey environmental threats. One of the most common types of damage to networks is the loss of physical connectivity due to cable damage. A number of technologies are available to restore connectivity while the cable is undergoing physical repair. These include: Cell Sites on Wheels (COWs) and Cell Sites on Light Trucks (COLTs) are often deployed in disaster areas to bolster a wireless network’s capacity. These operate exactly like a permanent cell tower, except that they are mobile. Point-to-point radio technologies: A variety of technologies using licensed and unlicensed spectrum, which can be chosen based on distance and speed requirements. Typically, these systems are used to establish a connection to a hop-on point onto the network backbone. An advantage of these technologies is that they can be portable units, easily moved, and quickly installed. However, they frequently require line-of-sight, and can be impacted by the same kinds of conditions that impact visibility (i.e., severe fog, foliage). Also, they typically do not provide connectivity over long distances, though in some cases multiple radio connections can be connected end-to-end to achieve somewhat longer connections. Satellite communications: Portable satellite base stations, for example mounted into vans or trucks, can be deployed to establish network connectivity. Because satellite communications can be widely used across any geographic area and can be quickly configured, they are frequently used in the earliest stages of a disaster. However, delay conditions can make satellite communications infeasible for applications that are subject to latency. Free-space optics: This emerging technology involves laser-based optical connectivity. But unlike traditional fiber-optic solutions, the laser travels across air instead of glass fiber. Free-space optics offers similar benefits of point-to-point radio solutions, with the additional advantage that the optical connection can support much higher data rates. However, connectivity can be interrupted if anything disrupts line-of-sight between the laser transmitter and the receiver. 14.8.2 Restoring Operational Support System Essential to the recovery of operational support systems is the ability to access the data used by these systems. Databases are absolutely critical to network management, used for functions such as recording tickets, maintaining the network inventory, and storing usage statistics for billing purposes. Remote data storage enables the recovery of any data despite the loss of primary data storage mechanisms. A number of technologies are available for remote data storage, ranging from mature technologies such as tape backup through more recent technologies such as disk mirroring and replication for more continuous, real-time remote data storage. Another dimension of OSS recovery is the recovery of the servers and software applications. This can be accomplished through failover on alternate equipment, ideally installed in data centers that are geographically separated from the primary 14 Disaster Preparedness and Resiliency 539 servers. Failover can either be triggered manually or automatically, with automatic failover typically involving less downtime because of the speed of response. Emerging capabilities such as virtualization support more dynamic allocation of computing resources to apply toward applications needing additional computing power, whether it be due to a surge in extraordinary demand or physical loss of primary computing power. 14.8.3 Restoring Power Another very frequent form of damage to networks is the loss of power. Some technologies available to restore power include: Portable generators: These can range from small units like homeowners might purchase for personal use, to large units with higher power capability and hauled by truck. Permanently installed generators: These are especially suitable when power demands are very high, and the generators themselves are very large. Some permanently installed generators can operate off natural gas, eliminating the need for refueling as long as the natural gas supply is not interrupted. This is an appealing option because movement of people and supplies can be very difficult in a disaster situation. Batteries: Since batteries can only provide power until they are discharged, they are normally used to support network equipment only temporarily until generators are able to meet the power requirements. When generators are installed permanently and with autostart capability, batteries are expected to support the equipment for only a few minutes, until the generators are activated. At this point, specialized control equipment can transfer the load from the batteries to the generators without interruption to the equipment. 14.8.4 Enabling a Safe Work Environment When chemical, biological, or radiological agents are involved in a disaster, a network provider might need to have people working on the network itself, repairing or rebuilding network electronics and cabling, while also cleaning up these dangerous agents. Often, a subset of the network operations team is established as a “Hazmat Team,” trained on how to operate in the presence of hazardous materials, using special gear. Decontamination units focus on protecting people by removing dangerous agents from them. They are dispatched to the “warm zone,” which is defined as the area between the “hot zone” where the agents are prevalent and the “cool zone,” which is safe for people to operate. People enter the unit from the hot zone, 540 S.R. Bailey Fig. 14.6 Chemical, biological, radiological, or nuclear response go through various stages of being “hosed down,” and leave to the cool zone with replacement clothing. A wide variety of protective gear can be used by people who need to operate within an area with environmental hazards. These can range from simple paper air filtering masks to fully protective suits and self-sustaining breathing equipment, such as those shown in Fig. 14.6. 14.9 Open Questions/Future Research 14.9.1 Improving Predictability Looking ahead, the most exciting opportunity in the area of disaster preparedness and resiliency is technology to improve the predictability of disasters. It is always easier to manage a disaster when you know it is coming. A surprise is always harder to manage than a predicted event. Technological advances can take what is a surprise today and make it a predicted event tomorrow. 14 Disaster Preparedness and Resiliency 541 To see how important predictability can be, one need only look back in history at the evolution of hurricane prediction. Weather forecasting was almost nonexistent, a 100 years ago, so hurricanes could not be predicted or tracked. This is largely what caused the extreme deadliness of the famous Galveston hurricane of 1900, in which 6,000 people died. Earlier that day, life seemed normal on this small Texas island, and the weather deteriorated so quickly that it was impossible to evacuate or make any preparations before the storm hit. Now, with advancements in weather-forecasting technologies, we can watch hurricanes form from their earliest stages as tropical depressions thousands of miles away from land. We can track their path and predict the location and severity at landfall. This provides precious time to take precautionary action. Our opportunity space for future research is to consider “surprise” disasters that can happen today, determine leading indicators and signatures of those events, and provide the measurement and alerting capabilities so that we can get warning in advance. The possibilities are practically endless, and can include software bugs, cyber attacks, earthquakes, or health pandemics. In almost all cases, the essence of prediction is complex data mining and correlation to detect underlying patterns, trends, and anomalies in the earliest stages of the incubation of a potential disaster. In the case of cyber attacks, the data sources can include network traffic patterns. Earthquake prediction could involve analysis of seismological data. Health pandemic prediction could involve analysis and correlation of leading medical indicators, such as medication purchases, emergency room visits, and test results. 14.9.2 Managing Traffic Another exciting research opportunity is in the area of management of Internet Protocol traffic flows to help manage extreme traffic volumes and congestion conditions that arise during disasters. We have learned that when a network is used for lifeline and safety communications as well as flow of critical information, it is impossible to engineer a network to handle the extraordinary volumes and patterns that occur at the time of a major disaster. This is exacerbated even more when the same network experiences any loss or damage to capacity. Network managers need the ability to prioritize traffic and control the flow to avoid congestion conditions. These capabilities are largely “brute forced” in today’s IP network. Technological advancements to provide intelligent traffic routing and congestion management capabilities can dramatically improve network resiliency moving forward. 14.9.3 Other Opportunities Technology advancements will also continue in areas that shorten the time to design and implement a disaster recovery solution. Software to provide specific engineering designs to address unique scenarios at the time of a disaster can shorten the design 542 S.R. Bailey time. Configuration management software can improve the ability to maintain realtime configurations offline, which reduces the time it takes to recover service on backup equipment and minimizes any loss of data. Even improvements on cabling and physical connections can reduce recovery time, by providing quick-connect mechanisms that reduce slow manual processes such as splicing and wiring. 14.10 Conclusion Disaster recovery starts with disaster preparedness. To achieve and sustain network resiliency requires significant investment of effort and financial resources, starting way before an identified disaster is imminent. It starts with the earliest phases of technology selection and network design, and continues through installation and ongoing operation. It means understanding risks and vulnerabilities and making sound business choices. It also involves disciplined execution of disaster recovery plans, once a disaster strikes. Because of the significant investment of effort and financial resources involved, the message about importance of business continuity and disaster preparedness needs to start from the very top of the company. Formalizing a governance process, instituting standards and policies, and establishing a dedicated and empowered planning function of business continuity professionals are all important steps toward achieving a resilient network. 14.11 Best Practices Below are the key messages of this chapter, captured into a brief summary of best practices to keep in mind as you develop your approach to disaster preparedness and resiliency. Disaster Management and Resiliency “Best Practice” Principles Understand which assets are truly mission critical, and focus planning efforts accordingly Prioritize investments using a risk-assessment methodology which quanti- fies probability, impact, and ability to control outcomes Institute disciplined command-and-control capabilities Practice disaster management plans and use the drills to identify improve- ment opportunities Provide ongoing life-cycle management attention to disaster-management plans Strive toward anticipating and predicting disasters wherever possible 14 Disaster Preparedness and Resiliency 543 References2 1. Caralli, R. A., Stevens, J. F., Wallen, C. M., White, D. W., Wilson, W. R., & Young, L. R. (2007). Introducing the CERT Resiliency Engineering Framework: Improving the Security and Sustainability Processes, from http://www.sei.cmu.edu. 2. Coates, J. (2006). Anticipating disaster from research, or putting the fear of God into top management. Research-Technology Management, 49(1), 6–9. 3. Coutu, D. (2002). How resilience works. Harvard Business Review. 80(3), 46–55. 4. Elliott, D., Swartz, E., Herbane, B. (2001). Business continuity management: A crisis management approach. Boca Raton, FL: Taylor & Francis. 5. Financial Industry Regulatory Authority. (2008). FINRA Manual, from www.finra.org. 6. Flin, R. (1996). Sitting in the hot seat: Leaders and teams for critical incident management. Chichester/England: Wiley. 7. Hiles, A. (2007). The definitive handbook of business continuity. New York: Wiley. 8. Hollnagel, E., Woods, D., & Leveson, N. (2006). Resilience engineering: Concepts and precepts. Aldershot/England: Ashgate. 9. Keanini, T. (2003). Vulnerability management technology: A powerful alternative to attack management for networks. Computer Technology Review, 23(5), 18–19. 10. McEntire, D. (2001). Triggering agents, vulnerabilities and disaster reduction: Towards a holistic paradigm. Disaster Prevention and Management, 10(3), 189–196. 11. Reinmoeller, P., & van Baardwijk, N. (2005). The link between diversity and resilience. MITSloan Management Review, 46(4), 61–65. 12. Sheffi, Y. (2005). The resilient enterprise: Overcoming vulnerability for competitive advantage. Boston, MA: MIT Press. 13. Snedaker, S. (2007). Business continuity and disaster recovery planning for IT professionals. Amsterdam: Elsevier Science & Technology Books. 14. U.S. Department of Homeland Security. (2006). National Infrastructure Protection Plan, from http://www.dhs.gov. 15. van Opstal, D. (2007). Transform: The resilient economy: Integrating competitiveness and security, from http://www.compete.org. 16. Wallace, M., & Webber, L. (2004). The disaster recovery handbook: A step-by-step plan to ensure business continuity and protect vital operations, facilities, and assets. New York: AMACOM. 17. Weichselgartner, J. (2001). Disaster mitigation: The concept of vulnerability revisited. Disaster Prevention and Management, 10(2), 85–94. 2 The following references are offered for those who would like to learn more about the subjects of business continuity, disaster management and recovery, and resiliency. They include descriptions of techniques and technologies, summaries of national policy challenges, and insights into the role of leadership before and during a disaster. Part VII Reliable Application Services Chapter 15 Building Large-Scale, Reliable Network Services Alan L. Glasser 15.1 Introduction This chapter is concerned with a particular class of software: large-scale network services, such as email systems providing service to millions of subscribers or web servers supporting e-commerce services to many customers simultaneously. To set the context of network service software, it may be helpful to understand how such software is similar to or different from a few other classes of software. Network services are generally expected to be “highly available;” that is, they are expected to be available at any time, 365 days of the year. This is in contrast with many Information Technology (or IT) systems that are designed to support a specific business function, and may be allowed significant periods of scheduled downtime, e.g., they may be down for “maintenance” on weekends. It is also informative to contrast network services software with end-user (or “shrink wrapped”) software. While end-user software often gets deployed in far higher quantity than network services, such software gets upgraded in a manner that is often known to and under the control of the end-user. Microsoft, Apple, and other vendors have conditioned their end-user communities to expect and tolerate upgrades. Such conditioning of endusers has not been the case for network services, and expectations are generally much higher for network service availability. Finally, while we present techniques for producing reliable software for large-scale network services, this chapter will not cover “carrier-class” software (the quintessential example being the software running the public switched telephone network). We define carrier-class software as software that, if it fails at all, is typically down for at most 5 minutes per year, or stated alternatively, software that has an availability of at least 99.999%. While there are many other characteristics of carrier-class software, another key one that contrasts somewhat with network services is that carrier-class software rarely requires operator intervention. A.L. Glasser () Distinguished Member of Technical Staff, AT&T Labs Research, Middletown, NJ, USA e-mail: aglasser@att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 15, c Springer-Verlag London Limited 2010 547 548 A.L. Glasser The class of software covered in this chapter is not expected to run without any human intervention; in other words, the software need not be completely non-stop and self-healing (as is often the case in “carrier-class” software). It is far more expensive to develop software that needs no (or minimal) human intervention. However, it is important to recognize that with human (or operator) intervention come failures [1] that have their own costs; it is often reported that for highly reliable systems, one-third of the failures are due to hardware failures, another third are due to software failures, and the remaining third are due to human, i.e., operator, error. Within this scope, we will identify the key concepts and techniques that have proven to be valuable in the production of reliable network service software. The overall approach may be characterized as “documentation heavy,” since reliable software requires a very clear, well-understood view of the system. Many of these concepts and techniques are also applicable to the production of other classes of software. Network services need to be reliable, because the businesses that they support cannot afford the impact of frequent software failures. Reliable software is not defined to be bug-free software; it is the software with a particular probability of running without failure in a given environment for a specified period of time. Availability is a measure of the percentage of time that the service is available for end-users to use over a period of time, typically one year. For example, a software system may be designed to be available 99.99% of the time over a year. Alternatively, the design requirements may allow the system to be unavailable for 0.01% of a year over a year or approximately 53 minutes per year. Many network services are sold with a Service-Level Agreement that includes, among other terms, a stated mechanism for measuring availability, a contracted level of availability, and a set of financial penalties (including early contract termination) for failure to meet the contracted level of availability. Cost is a key factor in producing reliable software. The fundamental engineering problem to be solved is to make an appropriate trade-off between the costs of producing reliable software against the cost of any businessimpacting service failures. In the remainder of this chapter, we discuss techniques for producing reliable software. Section 15.2 presents an overview of the system development process, Section 15.3 presents the generation of requirements, Section 15.4 presents the architecture deliverables, Section 15.5 presents the design and implementation process, Section 15.6 presents testing, and Section 15.7 presents the support processes. 15.2 System Development Process In many parts of the software industry, process is considered as a panacea. Unfortunately, many IT groups are burdened with rigid formal software development processes that add overhead without contributing much to software reliability. The perspective presented in this chapter is that process is a way of thinking, not a substitute for thinking. This chapter does not cover process definition or process 15 Building Large-Scale, Reliable Network Services 549 improvement (this material is covered amply elsewhere; see, for example, [2]). Instead, we describe the activities necessary to produce reliable, robust large-scale network service software. While some software is built by a single individual, working alone, this chapter addresses the more common production of software developed by a group of people [3]. Development of large-scale network services requires execution of distinct functions: requirements, architecture, design, implementation, test, and support. While a large organization may be able to assign each staff member to perform only a single function, most organizations will not have that luxury. Care should be taken when assigning staff to carry out more than one function. Requirements are best produced by staff with no other job function; when this is impossible due to resource constraints, an effective second task for the requirements staff is the test function. Architecture is best performed by a single, designated architect. This individual should be accountable for the service and is often supported by a number of subject matter experts and/or experienced designers (more on this in the Architecture section, below). Design and implementation are functions that should be treated effectively as a single function; they should not be assigned to separate staff groups. Test is ideally a dedicated team, but may be combined with requirements. Testing should never be combined with design and implementation, as this defeats the notion of an independent system test. Finally, deployment and support, while ideally staffed as an independent team, may be combined with design and implementation (it should be pointed out that software with a history of production-discovered problems warrants an independent deployment and support team, as this will allow the design and implementation team to better adhere to a project schedule and deliver on development commitments). The most important element for the successful development of production software by a group of people is a common understanding of the project and its details. Groups that lack a common understanding of the project cannot avoid failures which, at best, result in project delays as they are remedied, and at worst, result in project cancellation. The goal of any network service development project is to produce reliable software that satisfies its customers. The best way to ensure that all the people involved in a project have a common understanding is for the project to produce and use high-quality documents. Some key document categories include requirements, design, test plans, test cases, and project plans. Drafts of these documents are produced first and reviewed by the project team for completeness, clarity, and accuracy; the goal of the review process is to produce higher quality documents than would otherwise be realized. When a document is deemed complete, it is placed under a formal change control process, which simply means that any future changes to the document require a careful review and approval process that assesses all the impacts on the project that such changes would engender. For example, a significant new feature might be deferred to a future release due to the rework it might require on the current release and the concomitant schedule extension necessary to accommodate the rework. The author of a piece of software makes certain assumptions in its development. Wherever assumptions are made, either in requirements or architecture documents 550 A.L. Glasser or anywhere else, they should be clearly documented, and as design, implementation, test, and deployment occur, they should all be continually tested and confirmed. If assumptions made are no longer valid, a re-evaluation may be necessary. 15.3 Requirements Requirements typically start at a relatively high level, describing the primary business functions of the software. The most important questions that the initial, highlevel requirements document should answer are: What is the problem you are trying to solve? Who is the customer of the system? Who are the users of the system? An initial high-level requirements document would, when complete, be followed by a software architecture document (described in the next section). After the architecture is produced, lower-level requirements are produced for each of the major components of the architecture. The key consumers of the high-level requirements document are the customer (or sponsor) of the system and the system architect. The customer or sponsor reviews the document to confirm that their needs are accurately captured therein1 . The highlevel requirements document governs and guides the architecture. Two key consumers of lower-level requirements documents are the design and implementation team and the test team. The design and implementation team builds software that implements the requirements, and the test team develops test cases that confirm that the requirements are implemented properly. It is important that requirements be written in a form that allows for the implementation of both code and test cases. The requirements should be written in a manner that avoids, as much as is feasible, specifying how the software should be built, and instead should focus on specifying what it is that the software should do. How the software should be built is a design, and not a requirements activity, and designers resent having design choices dictated as requirements as they usually place unnecessary constraints on possible solutions. Also, mixing design and requirements will usually reduce the clarity of the requirements and may confuse testers. In addition, the requirements must be testable (e.g., a requirement that includes non-quantitative, vague language, such as “the system must gradually shift connections” is not testable). Writing good requirements is as difficult as, if not more difficult than, writing good code. Requirements fall into two broad categories: functional and non-functional. For example, an Internet Service Provider (ISP) email platform would include the functions of accepting email from subscribers for forwarding to recipients, accepting 1 Gaining concurrence from the customer or sponsor may require more than the production of a high-level requirements document, such as the development of demonstration software. 15 Building Large-Scale, Reliable Network Services 551 email for subscribers from other email systems, allowing subscribers to access their mailbox, and minimization of unwanted/undesired messages. The functional requirements will address what the software must do to provide those functions and specify all the features made available to the service’s end-users. Use cases [4] should be used to document each of the key features. Additionally, the email platform will have non-functional (also known as operational) requirements – for example, a requirement that the email platform must log data about every SMTP session and that each log entry must include various parameters that describe that session, such as the start time, end time, and IP addresses involved. The non-functional requirements address all the capabilities necessary for operating and supporting the service. The high-level non-functional requirements for network services must spell out the expectations for reliability and availability clearly and specifically, because meeting these expectations will drive architecture and design decisions. For example, the requirements must define the hours of operation of the service. If it is 24 hours per day, 7 days per week, it needs to be stated. If something else,that needs to be stated. The requirements for network services typically include a time for upgrades and other maintenance activities that will impact the end-user availability2 . Such periods are called “maintenance windows”. The expected duration, maximum allowed duration, expected frequency, and maximum allowed frequency of maintenance windows are important requirements that will drive the architecture. These requirements may be as simple as “at most one maintenance window per week of no more than 2 hours duration beginning at 08:00 UTC.” All the reliability and availability requirements will be driven by the service’s sponsor as well as the organizations supporting and operating the service. For network services, the non-functional requirements are typically as numerous and complex as the functional requirements. These requirements address manageability, operability, availability, reliability, system capacity, throughput, latency, and other non-functional areas3 . The high-level requirements document defines these requirements for the overall system. The lower-level requirements documents define these requirements for each individual component. Other important nonfunctional requirements cover behavior under overload, upgrades, and compatibility. Additionally, four areas of non-functional requirements that must be addressed are: 1. Provisioning: describes features that the software must provide to allow objects (e.g., subscribers, accounts, mailboxes, etc.) that need to be known to the system to be added to, changed, or removed from the system. 2. Operations: describes features that the software must provide to the operations staff to allow them to operate the service. For example, what software commands must be provided to allow an operator to determine if the software is operating 2 Overall system availability (e.g., 99.99% availability) excludes such maintenance activities; i.e., availability is measured against all time other than scheduled maintenance activities. 3 The sponsor or customer should provide a load forecast to aid in the formulation of the performance requirements. 552 A.L. Glasser normally and, if it is not, what software commands must be provided to allow an operator to restore correct operation. 3. Administration: describes capabilities that the software must provide to allow the system to be configured (or administered) to support the various range of configurations needed to support the business. These requirements also typically address the configuration and administration of the underlying operating system and the security requirements for the system. 4. Maintenance: describes the requirements on software needed to support periodic or on-demand maintenance tasks, such as periodic backups or on-demand connectivity tests. The non-functional requirements tend to distinguish network services (as well as “carrier-class” software) from other classes of software. Careful, thoughtful generation and review of requirements is expensive in terms of staff effort and calendar time. However, since poorly specified, vague requirements rarely, if ever, result in the desired service behavior, and the alternative is even more expensive rework and calendar time or cancellation of the effort. 15.4 Architecture Large-scale network services are best developed by first developing a well-thought architecture in response to the high-level requirements. A successful architecture must have conceptual integrity. Ideally, the architecture should be produced by a single individual, or, if not feasible, by a small team led by “the architect” [3]. The architect can delegate the architecture of each subsystem in the logical architecture (see below) to different team members. Also, in large organizations, where individuals can specialize in relatively narrow technologies, it is advantageous for the architect to consult with these subject experts (e.g., server and storage experts) in producing the physical architecture. There are three components of architecture: the logical architecture, the physical architecture, and the performance and reliability model. The performance and reliability model is covered in greater detail in Chapter 16 of this book. Here, we simply note that this portion of the system architecture is driven by the logical and physical architecture and needs to include usage and traffic assumptions, demand forecasts, transaction flows (through the physical architecture), capacity and usage forecasts, component resource budgets, analysis of reliability, and establishment of component downtime budgets. Additionally, this component of the architecture needs to provide “back-of-the-envelope” estimates as to how the architecture meets the throughput and latency requirements. The performance of the production system needs to be measured and reported to ascertain whether it is meeting the architectural and design expectations, and, often more importantly, whether it is meeting any performance criteria set forth in any service-level agreement. The architecture must address how these needs will be met. 15 Building Large-Scale, Reliable Network Services 553 As stated earlier, the expected system capacity is an important non-functional requirement of the system. This and related performance requirements (e.g., throughput and latency) drives the architecture to a specific design point that will support those requirements. It is rare indeed that a system architected for a given design point functions cost effectively (or, for that manner, in any way effectively) beyond an order of magnitude above or below the design point. For example, in architecting a system capable of supporting 1,000 transactions per second, it would be rare to find this same architecture effective at less than 100 or more than 10,000 transactions per second. Thus, the architecture should be completely re-thought if the design point needs to change by an order of magnitude or more. Another crucial aspect of architecture is designing how the system will behave under overload. The system should be engineered to process requests up to a given offered load level within the latencies specified in the requirements. When the offered load exceeds that specified in the requirements, the system should behave as gracefully as possible. When the protocols used to provide service are TCP-based, a simple technique is to set a maximum on the number of connections supported. Additional offered load above the connection limit will consume some network and CPU resources, but existing connections should be serviced reasonably, possibly with longer latencies (due to resource contention). Services that utilize UDP (or other, non-connection protocols) present more challenges for handling excessive load. Load-shedding techniques, like simply dropping some fraction of requests, may be warranted. It is clearly desirable to minimize the processing of any work that will ultimately be dropped; in other words, requests should be dropped early, before the system consumes or commits resources. Load-shedding mechanisms should not themselves increase the amount of processing that the system must perform. However, for protocols like SIP (utilizing UDP transport), a more intelligent approach to load-shedding may be applied, such as refusing to establish any new sessions while continuing to service requests for established sessions (i.e., dropping all requests to establish new sessions and dropping no requests related to existing sessions). Stress testing (see Section 15.6, below) should be used to measure the system behavior under overload. After production of the architecture deliverables, the project should undergo an Architecture Assessment (see Section 16.3). Following Architecture Assessment, the production of lower-level requirements documents occurs, each corresponding to one of the major functional components identified in the architecture. The remainder of this section describes key items that the logical and physical architecture deliverables must address. 15.4.1 Logical System Architecture The logical system architecture document provides a high-level logical solution to the initial high-level requirements. 554 A.L. Glasser The scope of the document needs to be clearly stated; in particular, it is essential to clearly enumerate what is not covered by this document, stating why those items are not being covered and where (in what other documents) those items will be addressed. The constraints placed on the system need to be covered. These can be constraints placed on component selection (e.g., only software from a particular vendor may be used for database needs). Alternatively, they can be constraints related to integration with other systems (e.g., billing information will need to be in a particular format due to this system’s integration with an existing billing system) or the relationship that this service might have with other services (e.g., network access controls constraints on this service due to its bundling with an access service). Finally, the most important set of constraints provided are those that are related to the service’s performance and reliability. Performance constraints are typically pertransaction latency requirements (e.g., 99% of the time, a transaction of type XYZZY must be completed within 100 ms) or scalability requirements (the system must be able to support 10,000 concurrent XYZZY transactions). The reliability constraints are typically specified in terms of the service’s availability (e.g., the system must be available 99.99% of the time). The architect should distill and clearly document the principles followed in producing the architecture, so that all of the subsequent design efforts can adhere to these principles. Two examples of such principles are where the state is maintained (or where the state is not maintained), and how data is replicated (and the strategy employed for replicating a database master). A key role for the architect following the production of the architecture deliverables is monitoring adherence to the architecture, a key aspect of which is adherence to the architecture principles. This “policing” role is key to maintaining the conceptual integrity of the architecture as it is implemented. The design portion of the architecture document should begin with a high-level block diagram (e.g., see Fig. 15.1, which is a high-level block diagram of a service to provide wholesale web or phone access to email and calendar capabilities) and text to describe each of the blocks, their function, and the interfaces that each block presents to the other blocks. Each of the blocks in the high-level block diagram typically represents the subsystems of the system being architected. For each of the subsystems deemed core to the project (or that otherwise warrant this level of detail), the single block in the high-level block diagram should be exploded to show the next level of detail, similar to what was done for the high-level block diagram: text to describe each of the blocks, their function, and the interfaces that each block presents to the other blocks. Following the subsystem discussion, the document must identify all the existing external systems that the architecture relies on and the details of the exact interfaces used. The various block diagrams should graphically indicate the interfaces to these external systems. The discussion within the document to this point has been primarily block diagram and interface-related, and the next area to cover is the data architecture used. This would include identification of all databases used in the architecture, and for 15 Building Large-Scale, Reliable Network Services 555 General Switched Telephone Network Telephone End User Voice Browser Usage Collection and Reporting Email Message Center IM Service Node Internet PC End User Customer Another Email (POP3 or IMAP4) Calendar Provider Fig. 15.1 Example of high-level block diagram each database, identification of key data elements, as well as expected queries and updates, with usage profiles (e.g., each XYZZY transaction results in Q queries and U updates). Following data architecture, the document should address security by identifying the security mechanisms to be employed in the architecture (e.g., use of Access Control Lists, secure transport, or password encryption algorithms) as well as an analysis to ensure that the identified mechanisms are sufficient for this service. The document should close with a discussion of issues and risks. The architect, having produced this detail about the service, will undoubtedly be aware of a number of issues that have not been addressed in the document, but need to be tracked and resolved to assure the success of the project. Finally, the architect will also be aware of the quantifiable and non-quantifiable risks remaining in the project. These should be documented and the project should be managed with an eye towards mitigating those risks. 556 A.L. Glasser 15.4.2 Physical System Architecture This subsection describes the material that needs to be included in the physical system architecture: server, network4, and storage engineering as well as system and server management. The subsequent subsections address each of these areas. The functional blocks of the logical architecture need to be realized on physical servers running in a specified operational environment, and the physical system architecture provides that realization. The physical and logical architectures are often developed together, typically with the physical somewhat lagging behind the logical. While constraints may limit certain choices in developing the physical architecture, the physical architecture should not limit the logical architecture. Hardware alternatives considered and discarded should be documented, clearly indicating the analysis that led to the discard. This analysis will undoubtedly be of value when the chosen hardware is discontinued by the manufacturer and new hardware must be chosen. The physical architecture needs to result in a series of engineering drawings that provide sufficient detail to allow all the hardware to be ordered, deployed, and interconnected. Finally, an important output of the physical architecture should be the expected capital and operations costs, normalized per end-user or subscriber. 15.4.2.1 Server Engineering The primary consideration in server engineering is to design in redundancy to provide reliability. The assignment of logical functions to individual servers is a key part of server engineering. The key considerations in such assignment are the data needed by each logical function. When two logical functions always (or almost always) act on the same data, or when one logical function is the producer of a huge volume of data consumed by a second logical function, those functions are excellent candidates to reside on the same server. Another key consideration is the impact of a server failure on each logical function and what strategies might be employed to minimize the impact of such failures. Logical functions that are effectively or inherently stateless (e.g., a proxy for the POP3 protocol) can fail with minimal impact, while those that are inherently state-full (e.g., an LDAP directory) need a server design that minimizes the impact of a hardware failure. Stateless servers are also easily scalable (often referred to as horizontally scalable), while state-full servers are not easily scalable. 4 Network engineering is covered elsewhere in this book and is not covered here, except for a few recommendations that aid overall service availability. 15 Building Large-Scale, Reliable Network Services 557 Stateless servers are typically accessed5 via a virtual IP (VIP) address that represents a pool of real, physical servers (for details on the network engineering needed to support this, see [5]). The logical functions to be performed on such servers will be accessed via an IP protocol (e.g., HTTP). At any point in time, some number of those physical servers will be operational, performing the logical functions required of them. Should one fail, all functions “in flight” will fail. Typically, clients of this VIP will need to determine whether the function failed (e.g., via an exception like a TCP reset or a timeout) and take appropriate recovery action. A simple, useful technique that is worth attempting prior to more drastic measures is to simply re-try the original request again. With appropriate network engineering, the new request will simply be routed to another server that is operational and will succeed. State-full servers typically require a sparing strategy. The simplest such strategy is to deploy two physical servers for every state-full server needed. To the extent that the software can support it, this could provide very fast recovery from a server failure. This will require that the internal state of the software be replicated to the spare in a manner that would allow the spare to assume the function of the primary, should the primary fail at any point. Such an arrangement is generally referred to as activeactive. The performance cost of state replication can become prohibitive and various optimizations are usually taken to replicate at particular junctures. The replication points are chosen so as not to leave the spare in an unusable state if the primary fails between the points. While the spare will be able to quickly take on the role of the primary, some in-flight client functions, at the time of primary failure, will fail and need to be re-tried. A simpler strategy is known as active-standby. In an activestandby approach, the standby server will assume the function of the just failed active server. All in-memory state will be lost and all in-flight client functions will fail. This approach is useful when the key state of the functions is stored on external storage that can be shared with a spare. Depending on the external storage chosen, a spare might automatically assume the active role; on the other hand, it may require human intervention to configure equipment to give the spare the necessary identity and access to assume the active role. Another key consideration in an active-standby arrangement is the cost of providing pairs of servers. Again, depending on the external storage chosen, another alternative is to allocate one spare for every N active servers, or more generally, allocate K spare servers for every M active servers. It is wise to be pessimistic on performance when engineering servers, and thus, have a designed-in safety margin on latency and capacity. Another important networking consideration is the assignment of at least two public IP (or VIP) addresses to be allocated for each externally visible TCP/IP service (e.g., for an email service, inbound SMTP to subscribers’ mailboxes would be a service, and outbound SMTP from subscribers would be another service) to avoid disruption in the event of routing configuration errors. Also, the authoritative DNS for the domain needs to return two separate address records, each containing one of 5 Such access may be from an end-user of the service (e.g., via a browser or email client) or from another system, either internal to the service or from a customer or third-party server. 558 A.L. Glasser the two addresses. These two IP addresses should be allocated from two blocks of addresses that contain a lot of space between them (the two blocks are not “near” each other). In other words, it should not be possible to combine the two blocks into a single routing prefix that might be accidentally hijacked, which would prevent customers from accessing the service. Assuming that all clients will make the proper use of the two DNS address records (and not simply rely on the first record), this technique should prevent accidental hijacking of an address block from impacting this service [6, 7]. 15.4.2.2 Storage Engineering This section presents storage reliability considerations and tradeoffs, and the need to practice recovery operations to minimize downtime when a failure occurs. Data stored on disk presents a number of engineering issues to be worked through as a part of the system architecture. The required performance and capacity of the disk subsystem for particular sets of data as well as the required availability of that data must be identified. Data that has similar performance and availability requirements, whose combined capacity requirements can be satisfied by a given single-storage solution, are candidates for sharing that solution. Some services, like a Content Distribution Network cache, may have very low availability requirements for data that is cached (i.e., the stored data is simply a copy of the authoritative data source that is stored elsewhere). The physical system architecture must specify the storage to be employed in the system. Reliability of disk drives and redundancy in storage engineering are beyond the scope of this chapter; interested readers are referred to [8, 9]. In spite of carefully designed disk subsystems, there could be a catastrophic disk subsystem failure, or with similar effect, a software failure that causes data corruption of all redundant copies of data in the disk subsystem. Such a failure always results in unplanned system downtime; the duration of such an outage needs to be minimized. Recovery from such a failure requires having a recent backup of the data (i.e., a copy of the data on separate secondary storage, typically tape media). In order to achieve recency of the backup data, backups need to be performed regularly at frequent intervals. While backup is a well-known, commonly instituted practice, what is equally important and rarely done is periodic data restoral to spare disk drives. Backup media that is unreadable at the time a restoral is required will become a major system catastrophe that may not have any viable recovery. One possible result is to restore service without the lost data. For an ISP email platform, this might mean restoring service to end-users with new, empty mailboxes. Another possible result is that the lost data can be extracted from other systems, each containing a portion of the needed data, but this will typically be a long duration process and service cannot be restored until it completes. Backups and regular periodic restoral to spare disk drives need to be specified in the requirements and supported by operations training and documentation, in addition to being covered by the architecture. Another important consideration in providing reliable storage is the use of multiple data centers for mitigating the effect of a disaster in a single data center (disaster 15 Building Large-Scale, Reliable Network Services 559 recovery is covered more fully in Chapter 14 of this book). In this section, we are concerned with spreading equipment among multiple data centers so that a failure in a single data center will only cause a partial service outage. An example is an email service supporting millions of end-user mailboxes, where a fraction of the mailboxes are hosted in each data center so that a site failure only impacts a fraction of the end-users. This approach does not achieve true disaster recovery (which can be very expensive), but it will result in higher availability of the service than placing all equipment in a single data center. If multiple data centers are used to host the equipment required by the service being architected and there is a key database or directory that is a critical resource required for the service to operate, it is advisable to maintain a replica in at least one (or more) of the other data centers. A simplified case is a SQL database master with one replica. In this case, the replication would occur over the WAN (possibly via a tunnel). Such an arrangement should allow the service to continue running if the master data center site experienced a major failure or disaster. In some cases, manual procedures may be necessary to promote the replica to the status of master, and insert, update, and delete transactions will fail until that promotion occurs. The key to achieving success with this “master-slave” approach is to regularly exercise “fail-over”, meaning that the operations organization will, on a regular basis (say once a week at a low traffic point), deliberately stop the master and promote the replica to master. Typically, when the prior master is restored to service, it will run as a replica to the new master until the next regular “fail-over” or true failure. This approach of regular “fail-over” should also be employed for any state-full server schemes. When the operations organization is familiar with the “fail-over” process, a true failure is dealt with as a relatively minor inconvenience (at least as far as the familiarity of the steps required to restore service are concerned). Whenever these processes become exceptions that only get executed rarely, they rarely get carried out correctly or well. As with backups and restorals, this fail-over behavior needs to be specified in the requirements and be supported by operations training and documentation, in addition to being covered by the architecture. 15.4.2.3 System and Service Management This section describes support systems, instrumentation and logging, secure access, and considerations for software installation and upgrade. The System and Service Management section of the architecture document needs to describe the monitoring and operations principles and mechanisms to be employed to manage the service, which needs to be addressed both from an end-to-end service perspective and for each component. A key goal of this portion of the architecture is to establish a foolproof and reliable mechanism for monitoring the health of the service by the operations staff. For an enterprise that has an established operations organization, the architect will probably need to find a solution that will fit into existing monitoring mechanisms and tools used by that organization. For example, there may already be an SNMP trap monitoring infrastructure. 560 A.L. Glasser Software systems that monitor the health of the network service should never fail in a manner that allows a network service failure to go unnoticed. This places a more stringent reliability requirement on the monitoring systems than on the network service. Some techniques to achieve this are to avoid sharing code between the network service and the health monitoring software, use of a separate staff to build the health monitoring system, and using off-the-shelf, proven reliable software either from a vendor or from open source (e.g., OpenNMS and Nagios). There is a cost associated with this approach, and for a business developing a single network service, this cost may prove prohibitive. For an organization that can afford it and needs to support multiple network services, having support systems that are independent of and separate from the network service also presents a cost-savings opportunity: the support systems can be shared across multiple services. Examples of such systems are SNMP management systems for monitoring SNMP traps and trouble ticketing systems for managing and tracking problems (and their resolution) in the network (services as well as other components like routers, switches, etc.). These support systems need to be very reliable as their failure can mask a network service failure; they also need to be managed well (e.g., backed-up regularly). The monitoring system must proactively poll the service to determine that it is working correctly and not simply rely on the service reporting faults. To provide data for monitoring, one must first address the approach taken to logging key events. Such events may simply be informational (e.g., a typical web server access log), but must also cover faults discovered by the software. Faults are assigned a severity, typically minor, major, and critical. A minor fault is one that needs attention today or within 24 hours, and if ignored, will result in end-users being able to detect service degradation. Some examples of minor faults are various resources such as disk file system capacity or average CPU consumption being above some threshold, say 80%. Other thresholds such as the rate of retries exceeding some threshold may also be a possible minor fault. A major fault is one that requires attention within a prescribed amount of time, typically 2–4 hours. Thresholds are an example of major faults, perhaps set at a higher utilization level than a minor fault, such as 90%. A critical alarm is an indication that a component or a major subsystem of the service has failed and requires immediate attention. Examples of critical faults are loss of connectivity to key resources and missing key configuration data needed to provide the service (for which there are no reasonable defaults). An SMTP email Message Transfer Agent (MTA) that has lost all connections to the system’s directory and cannot re-connect after repeated retries and thus cannot ascertain whether an addressed user, named in a RECIPIENT command, is a subscriber, is an example of a “loss of connectivity” critical fault. All these events must get logged to disk storage in a reliable manner, as this data is indispensible in troubleshooting in-service problems. One choice for achieving reliable trap processing is to utilize multiple SNMP receivers, requiring all the originators to send to multiple receivers, and to filter-out redundant traps downstream, within the management infrastructure. While the architect must provide guidelines to the developers on when to generate a trap, it is often the case that further processing on the received traps is necessary 15 Building Large-Scale, Reliable Network Services 561 prior to raising an alarm to the operations staff. The system receiving the traps should be configurable to some level so that every trap does not, in general, result in an operations alarm. It is often the case that multiple traps can be combined, especially when they occur at roughly the same time, into a single alarm event. Correlation of traps across components of the service as well as across services is also desirable. A typical behavior for which operator intervention is undesirable is when the software attempts a TCP connection to another system that initially fails (this should generate a trap and log entry, but not cause an immediate alarm), but is re-tried and subsequently succeeds on the second attempt. If the second attempt failed, it too would generate a trap and log entry, and the combination of the two traps within a specified time period would be configured to raise an alarm. Another behavior that should be detected by monitoring is continual, repeated failure and restart, as this is often not easily detected by the network service itself. While seemingly mundane, standardization of log format within the team simplifies specification of alarm correlation rules6 and aids training of all staff: design and implementation, test, support, and operations. The architecture should define the standard log format. A single line of text per entry is recommended. It should always include a date and time in a prescribed format for both elements, a specified resolution for the time (e.g., seconds or milli-seconds), and a well-known time zone (Coordinated Universal Time or UTC is an excellent choice). Various other fields that might be a part of the standard are process-id, user-id, client-IP-address, end-user-id, etc. Typically, some fields will be particular to the functionality of the service; e.g., an SMTP MTA will have some unique fields to log compared to a POP process, such as the domain provided on an SMTP command. The architecture sets the principles and standards to be followed. The format of each log needs to be documented as a part of the design process. An important set of architectural principles to be established in this area is how much or how little should be logged as “informational” entries, and whether the amount of data logged should be fixed or determined by a configurable parameter (e.g., the “log level”). Such a configurable parameter necessitates further principles associating different classes of information with particular values of the parameter, so that there is consistency across the system in how such logging is controlled and occurs. Such “informational” entries are not actionable events, but, at one extreme, provide data about every transaction (e.g., each SMTP session and each message within that session processed by an MTA), and, at the other extreme, provide no data for the “expected” activities (like a successful SMTP session). For maximum flexibility, each subsystem, component, or program should be independently configurable. Also, while it simplifies the operations training and documentation to have common values for such configurable parameters across the subsystems, components, or programs, there are often good arguments to be made for a particular piece of the service to utilize a unique set of values (e.g., bit masks) to control different informational entries. 6 To simplify the specification of alarm correlation rules, the system and/or tools used to perform alarm correlation will drive commonality requirements on logging (e.g., common date and time formats, allowing rules to determine multiple failures within a given time interval). 562 A.L. Glasser When troubleshooting problems, it is often helpful to have as much information as possible about the system behavior. Also, there may be value to service planners, particularly, capacity planners, in capturing a large number of informational entries. On the other hand, such an approach will require careful attention to the performance impact of such logging as well as attention to rolling logs and preserving log data, while avoiding any impact to the service. An occasionally successful alternative to logging for troubleshooting is to utilize a packet sniffer (e.g., tcpdump, wireshark, or snoop) when the logging would contain a subset of the sniffed data. Another important consideration is the ability to provide secure access to developers to gather data or otherwise observe the service to help troubleshoot a problem. One proven technique is to provide an entirely separate LAN infrastructure for all operations, administration, and maintenance (OA&M) activities (including developer access), and providing a secure tunnel or VPN access for the developers to access the OA&M LAN. This general approach also keeps logging and trap traffic separate from service traffic, which is generally a good idea to aid overload troubleshooting (e.g., diagnosing a DDoS attack). Finally, the architectural principles for software installation should be established. At one extreme, software installation may need to occur on a server that is completely out-of-service and from which any prior version has been first removed. At the other extreme, software installation occurs while the prior version is running and providing service, and only a minimal duration outage is required to stop the old version, likely perform some administrative tasks (e.g., adjust some symbolic links) and start the new version. The former is simpler to implement, and the latter results in minimal down time. In Section 15.7.2, operational concerns related to software installation, upgrades, and deployment are discussed. 15.5 Design and Implementation 15.5.1 Design There are many texts dedicated exclusively to software design. This section will focus on the aspects of design that are essential to producing reliable software. The design of a software component has two primary classifications: external design and internal design. The external design describes how others can use this component and the internal design describes how the component is constructed. From an external perspective, each significant functional component of the architecture should have a corresponding design document, which, at a minimum, describes all of the interfaces presented by the component to other components in the system. Thus, the application programming interfaces (or APIs) exposed by each architectural subsystem to the rest of the system must be documented (using, for example, javadoc or doxygen). Expected sequences of API calls, along with sample code that implements a demonstration of that portion of the design, should 15 Building Large-Scale, Reliable Network Services 563 be documented. Data, in the form of database tables, flat files, or anything else, represents a key external interface. Data that this component produces or consumes, for or from other components, should be documented as a part of the external design. The data local to the component would be described in the internal design. Another important interface that must be documented is the operations interface. This would describe how the component is installed, updated, removed, configured, started, checked (for current operational status), re-started (if applicable), and stopped. While the details of internal-to-the-component data remain in the internal design, the operations staff needs to know all the files, databases and tables, directories, and any other data that the component needs, uses, consumes, and produces. All this can aid troubleshooting (e.g., when an operator is mistaken about the location of the current directory and inadvertently removes all the files therein). Descriptions of each and every event that the component might log, including sufficient detail to enable troubleshooting, represents an important portion of the operations interface. Additionally, various key performance indicators (KPIs, typically counters) should be included in the design, along with simple mechanisms for operations staff to observe those indicators. Expected value ranges of each KPI should be documented for the operations staff. It is very often the case that such indicators are predictors of problematic trends. Each user interface should be documented; this is typically done by providing prototype user interface code or “wireframe” figures and text describing navigation. User interface design is beyond the scope of this book (see [10,11]). The internal design typically focuses on key algorithms chosen, trade-offs made, and performance considerations. 15.5.2 Organization An effective way to organize the design and implementation staff, given that the system has been decomposed into components, is to assign components to individual staff members. An individual component should never be assigned to multiple staff members. If a single large component is identified, which is undeniably too large to be the responsibility of a single staff member, it should be decomposed, if at all possible, into multiple smaller components. General purpose or utility components should be avoided; items that would be added to such utility library should instead constitute new, albeit small, specific functional components. This design approach requires identifying a bottoms-up component structure for the integration of components. At the bottom layer, each component should, in general, be functionally independent of all other components. Such components might be generally useful in future projects and should be built in a general fashion to foster reuse. At some point, a component must be built that depends on bottom layer components and so on, up the dependency hierarchy. The bottom-up dependencies drive the project schedule: the bottom layer should be built first and so on, up the hierarchy. Each component will be made available for integration with the other components when 564 A.L. Glasser it has been fully unit-tested by its developer. It is typically the case that problems that arise are attributable to the most recent components added to the integration area. This approach results in increased accountability, as there is never a question as to who is responsible for a particular function. It also fosters increased ownership of the software by the staff members, as each staff member knows that they alone have complete responsibility for their components. More than one person needs to be familiar with the software or else there is a potential, software single-point-offailure (e.g., the one person is unavailable when a bug is discovered). A “buddy system” that clearly identifies a specific backup person for each component works well. This results in an individual being the primary on a number of components and the backup on a different set of components. Managing a project of many small components, though somewhat tedious, allows for quite accurate estimation of effort. This results in increased predictability as well as visibility of the overall development process, affording earlier identification of problems than might otherwise be the case. Finally, given the availability of good API documentation for each component, this approach decreases developer inter-dependence, as it reduces the need for extensive inter-developer communication and results in fewer and smaller integration delays (but with typically more integration points). This approach fosters a “test early, test often” environment7, as each integration culminates in integration testing to certify that the integration was successful. It avoids “big bang” integration efforts that, when they do not go smoothly, result in long delays in sorting out which (often many) components have problems. 15.5.3 Configurability The configurability of network services deserves careful design attention. Any parameter that might change over time should be designed to be a configuration parameter and not a constant in the code. The use of a constant in the code would require a new code delivery and certification prior to achieving the change, while the former, with appropriate operations documentation, could be achieved by the operations staff acting alone. It is sometimes not clear why a particular parameter might change in the future; when in doubt, it is best to make it a configuration parameter. Default values, whenever they make sense, should be provided for each parameter. Examples of parameters are directory names for various data (e.g., the directory in which log data should be written), and fully qualified domain names or IP addresses and corresponding ports (and, if relevant, protocol or protocol version) for other systems that this component must communicate with (e.g., SNMP 7 “Test early, test often” should be followed in any case; it fosters easier and faster bug detection than waiting. 15 Building Large-Scale, Reliable Network Services 565 receivers). The design needs to recognize the environmental differences present in different test environments (e.g., unit test and system test) as well as in the production environment, and support these different environments via simple configuration changes. Clearly, changing code to support these different environments is not desirable. In addition to allowing for the exact same source code to support various test environments as well as (potentially multiple) production environments, this design for configurability allows for recovery from a class of failures (e.g., failure of an SNMP receiver) via a simple change to the configuration data. Another important consideration is whether a configuration change requires a network service stop and start, or a trigger from an operator instructing the service to re-read its configuration data, or if the service automatically detects and implements configuration changes (e.g., having a thread detect that a configuration file changed, and ingest and process that file). While the simplest implementation is a service stop and start, automatic detection provides the least service impact. 15.5.4 Maintainability and Modularity An important consideration in producing reliable network service software is the maintainability of the source code. It is almost always the case that modularity and maintainability are positively correlated. Fewer lines of code are preferable to more lines of code, since it is less expensive to completely test and certify fewer lines of code, and less code simplifies enhancement and bug repair because there are fewer places to look for the place in the code to change. While an approach that emphasizes a high level of modularity should drive the internal design of the code, it is often the case that implementers replicate code in multiple methods or functions, failing to notice a lack of modularity. An approach to design that does not emphasize modularity often results in “yank and put” or “copy and paste” replication of code fragments. Such replication throughout a subsystem will clearly suffer from reliability issues when a bug is discovered in such a fragment, and it is only repaired in the section that first exhibits the bug. Also, as code replication is typically only apparent upon reading the code, it is detected, if at all, during code review. Designing a single common method usable by all callers may be challenging due to a need for similar, but not identical behavior in all cases. In such a case, the designer must include additional parameters whose function is to modify the behavior of the common method to achieve the necessary variations in behavior. In the extreme, there may be good cases to be made for multiple methods. These are important internal design choices. It can be very helpful to future code maintainers to capture these choices in an appropriate document (which can be comments in the code). To establish that code is being written for human rather than just machine consumption, a process of code reviews (also known as walk-throughs, inspections, or peer reviews) must be instituted. Code inspections are valuable in finding bugs, but the focus deemed most important here is the production of understandable, readable source code. Such inspections can improve code via suggestions to improve 566 A.L. Glasser modularity, and more generally, to refactor [12] the code. At least one staff member, other than the author, must be adequately familiar with the code to be able to fix a problem should one arise when the author is unavailable. When a “buddy system,” like the one described earlier, is instituted, a buddy code review, where the primary code owner leads the backup person through the component’s code, is a lightweight but effective way to introduce a component to a backup person. Learning the code after a problem arises is often painful and expensive. Again, this represents yet another engineering trade-off: incur the cost of code reviews and establish them as a regular activity to minimize the time to repair a bug versus waiting for a fault to occur and doing “whatever it takes” at the time of the bug to repair it. When a “whatever it takes” approach is used without the author, it is often the case that an imperfect repair results. In such a repair, the bug is repaired, but other functionality, not exercised in the bug scenario, may no longer work correctly. While it is difficult to compare costs, the availability of a large-scale network service is always higher when problems are fixed correctly without introducing new problems. Sometimes the demanding performance requirements of network services tempt the designers and implementers to sacrifice modularity and maintainability to obtain high performance. This is rarely necessary: careful design can usually achieve the required performance with modular and maintainable code. The emphasis on performance can also lead to unnecessary, premature optimizations. If the code meets its performance requirements, then further improvements are not needed, and to the extent that such improvements impact the code’s maintainability, they are undesirable. Also, when a performance problem is discovered as a result of measurements of the code via testing, such a problem is often best solved by algorithmic or architectural changes, rather than ad-hoc code changes. When a sophisticated algorithm is the solution, the resulting code may be difficult for the casual reader to follow. Comments in the code referring the reader to a detailed description of the algorithm employed, or if no such description is available, providing that description in the code’s comments, will improve the maintainability of the code. 15.5.5 Implementation Reliable network service software is software that is commented, tested, and written in a language with ongoing support, using libraries and other resources that are themselves of production quality and well-supported, under source control, with a bug-tracking system. The amount of effort required to produce reliable network service software for production use is typically much higher than that required to produce prototype or personal-use software. A prototype (or proof-of-concept) of a network service is often used to demonstrate a few key functions to show the value of the service to potential customers or funders. The things that distinguish the source code of production software from that of non-production software are typically around dealing with unexpected errors. Non-production software may ignore errors, as 15 Building Large-Scale, Reliable Network Services 567 they can “never happen” (e.g., an existing TCP connection to a well-known server goes stale owing to a fail-over behind the well-known server’s load balancer). In production, everything that can “never happen” always does. Retry and other recovery strategies need to be carefully designed and implemented. Building in various “fall-backs” to handle some amount of external system unavailability (e.g., queuing requests to unreachable TCP/IP services, using previous query results for an established period of time) is necessary. In terms of easing the effort required to produce a reliable, production quality service, having a working prototype is rarely more than an aid in clarifying requirements; converting prototype code into reliable, production quality code is often more expensive than just re-coding the service with clear production requirements. 15.5.5.1 Commenting Comments in code should be supportive and accurate; out of date, inaccurate comments are more harmful than no comments at all. Comments such as the infamous “RIP LVB” next to a constant of 1827 [13] do not help an individual unfamiliar with the code to debug a problem. Code needs to be written to be read and understood by other humans; compiling or interpreting with no errors is necessary, but not sufficient. Many development teams establish coding guidelines (or standards) to be followed to aid the production of code meant to be read and understood by other team members. 15.5.5.2 Unit Testing Developers need to unit-test their code. A set of unit tests need to be developed in addition to the code. That test code represents a regression test suite that can be re-run whenever a change is made to the code. Besides straightforward tests of the external APIs, one of the most valuable approaches to unit test code is coverage testing [14]. In coverage testing, one determines which lines of code have been executed by a set of tests, and more importantly, which lines of code are yet to be exercised. There are many tools available to assist in measuring code coverage. Since these tools almost always instrument the code to measure coverage, it is always a good idea to re-run the test cases used to determine coverage with a normal (non-coverage test) build, to make certain that the coverage tool does not hide any latent bugs. Tools are available for managing, developing, and maintaining unit test suites (e.g., J unit). 15.5.5.3 Development Tools Reliable software cannot be produced by unreliable tools. New languages with possibly buggy compilers and/or support libraries should be avoided. Debugging one’s own code is challenging enough; determining that the root cause of one’s bug is because of a compiler bug is adding insult to injury. 568 A.L. Glasser Good debugging tools aid the production of reliable software. Tools that detect memory leaks are invaluable in producing software that needs to run continuously8. Enabling and heeding compiler warning messages, or using other code analysis tools, can help to eliminate bugs prior to execution. Strong type-checking helps avoid many errors. 15.5.5.4 Change Management Change management is a key aspect of producing reliable network service software. It manifests itself to developers in the form of a source code control system (current popular tools are CVS and subversion; older tools are the venerable SCCS and RCS). All the source code and project documents must be managed by a source code control system (also known as revision control system or version control system). These tools allow the source code used to create a particular build of a component at a particular point in time to be exactly reconstituted. This is necessary because when a bug is reported on a particular version of the system running in a particular installation, the desired solution is to just fix the cause of the bug and not change any other aspect of the software whatsoever. This allows for high-quality fixes to be deployed. An important aspect of repairing production problems is recreating the production problem in a test environment. This can be difficult if the problem requires environmental conditions and/or loads that are difficult to recreate outside of the production environment. However, not being able to reconstitute the exact source code and rebuild it (with the same tools that built the production instance) is a solvable problem. Introduction of new “features” while fixing an old bug often introduces new bugs; the new “features” are typically far from being thoroughly unit tested, and the interactions of the new “features” with the bug fix will not be carefully considered due to time pressure. Further, as we would see in the next section, independent system test will always focus on certifying the fix to the production problem; any additional testing (e.g., of new features) should not be forced upon the system test organization in the form of a high-priority production bug fix. A bug-tracking system (e.g., bugzilla, trac, or many other such systems) is a necessary tool in the production of reliable network service software. It is used to record problems and track repairs. Integration with a source code control system may be an attractive feature, but loosely coupled tools can often be coerced to produce any necessary project reports with some simple additional script writing. 15.5.5.5 Support Needs In addition to producing high-quality source code, another development deliverable as important as source code is a document for each alarm that might be raised to the 8 In cases where continuous operation is not a hard requirement, automatic or scheduled process restart (sometimes called process rejuvenation) can be used to get “clean” memory. That said, software that exhibits no leaks is probably always more reliable than software that leaks. 15 Building Large-Scale, Reliable Network Services 569 operations organization9. The document describes the steps that the operations staff must follow to remedy the situation that caused the alarm to be raised (it might be the case that the action to take is to call the developer, regardless of the hour, but a better approach is to enable the operations staff alone to remedy the situation10 ). The elements of an alarm document should include: the level (e.g., critical, major, or minor), sample text of such alarms, the software component(s) reporting the alarm, the cause, the effect on the service, and a procedure to remedy the situation. If no such document exists, or, more importantly, if there is no procedure to remedy the situation, then the alarm should not be raised. The test organization needs to be able to cause each alarm and test each procedure. The network service should never cause an alarm to auto-clear from the network monitoring system11 ; all alarm clearing should require operations intervention (they may be provided with tools to clear multiple alarms at once). These documents as well as training material for the support staff (see Section 15.7, below) are all additional development deliverables, and should utilize the same change-management tools as those used for source code (e.g., CVS). In addition to the obvious software components to be built, support staff often identify the need for various support tools that should be built (e.g., producing a report from the logs that helps identify a particular class of problem). While perhaps stating the obvious, the service has a great dependency on all of the project’s documents and source code. This data should be backed-up regularly, and periodically stored remotely from the development environment, to provide some level of insurance against a disaster impacting the development environment. 15.6 System Test System test attempts to identify the remaining defects in the software following unit testing. Testing cannot improve the quality of poorly designed or implemented software; this is often stated as “you cannot test in quality; it has to be designed in.” System test staff should be distinct from the software development staff to allow for an independent quality assessment of the software. The software that system test installs is ideally built by a support organization, and should never be built by the software developers. This guarantees that someone 9 This is in addition to the event log documentation described earlier in the chapter. A special class of alarms falls into the category of events that “should almost never happen.” For such alarms, directing operations staff to call the developer is acceptable. However, if what “should almost never happen” begins to occur frequently, then the developer should provide more detail on the action to be taken by the operations staff. 11 It is a good idea to simply provide an informational log entry when a problem that had been previously reported has been cleared. 10 570 A.L. Glasser other than the software’s author can successfully build the software; something that might prove crucial in troubleshooting and repairing a production problem when the author is unavailable. The function of system test is primarily to measure the software’s adherence to the requirements. To this end, the first activity typically undertaken is to assure that all functional requirements are met. Failure to meet a requirement will result in the tester creating an entry in the project’s bug-tracking system. All the non-functional requirements must also be certified, although this will often be more difficult than certifying functional requirements. Installation, upgrade, removal, start, restart, and stop of the service must be tested. Each possible alarm should be generated and the developer-documented procedure to alleviate the alarm condition should be tested. The system test organization typically requires a rather expensive hardware environment to accomplish the testing of all latency, throughput, and performance requirements. In addition to mimicking the production environment, this testing requires equipment and tools to simulate the production loads. System test organizations may have a set of developers who concentrate on developing custom test tools. In addition to certifying that the software meets the requirements, it is also important to design tests of adverse conditions (so called “fault injection”, “failure injection”, or “rainy day” testing), where anything and everything that can go wrong does go wrong (this may include, as an extreme case, cutting cables, such as an Ethernet cable, connected to a system under test: “the technician thought the cable wasn’t being used”). This also extends to entering invalid data in any way in which the service accepts data, which includes the special case of misconfiguring the service, to the extent possible, and for all these cases, observing its behavior. Another set of important tests is stability or endurance tests, which will typically take place at a prescribed load, not exceeding the engineered load, for an extended period of time (e.g., many days). These stability or endurance tests will often uncover latent bugs, such as memory leaks in the software. It is often helpful, if possible, to have the system test team stress-test the system by overloading it or driving the system past its engineered load for a relatively short period of time (minutes or hours), and then reducing the load back to the engineered load. The system should recover, without operator intervention, when the load is reduced. On the other hand, it might be the case that the system fails entirely at some overload (it is important to test at an overload level that is gradually reached, as well as at an overload level that is reached suddenly, because the system might behave differently to the two approaches to the same overload). It is helpful to establish whether or not it is possible for a value of overload (i.e., larger than the engineered load) to consume all available resources such that no useful work gets done, and if so, to document that value for the project team. While we may hope that the service’s business sponsors accurately forecast demand, it would not be good for the service to be overly popular only to have it crash owing to overload. Also, network services might very well be the target of denial-of-service or distributed denial-of-service attacks; overload testing is one 15 Building Large-Scale, Reliable Network Services 571 Throughput Offered Load Maximum Engineered Load Fig. 15.2 Throughput versus offered load way to determine how the service will react to such attacks. An important output of overload or stress testing is a graph indicating throughput versus offered load (see Fig. 15.2). Any bug found in the production environment following system test should be considered a defect against the system test team. It is not reasonable to hold the developers responsible for all the defects when the function of the system test team is to identify defects following developer turnover. Whenever a bug is detected in the production environment and a fix is developed, that fix must be tested. This first implies that the production bug can be reproduced in the system test environment, so that the fix for the bug can be certified as fixing the bug. It is often a good idea to re-run a number of functional tests to ascertain that the software has not deteriorated as a result of the fix. One valuable technique is to be able to automate much of this regression testing and easily determine that the fixed code produces the same results as the previously released code. The standard UNIX command diff can be used to compare results of multiple regression test runs when the relevant outputs can be captured in files. When this data contains date or time stamps, various techniques12 can be used to eliminate them as differences that require attention. There are a number of key metrics that the system test team should track, such as number of bugs found over time and the overall fault density. Sufficient testing with appropriate software reliability engineering [15] should predict the number of faults remaining. Criteria for exiting system test should be established prior to beginning test and should target a specific fault level not to be exceeded, as well as zero critical or major outstanding bugs being found over some period of just completed testing13 . 12 One technique is to encapsulate all the date and time stamp string generation in a single project module. That module needs to be configurable (e.g., via a configuration parameter) to always return the same value. 13 A critical bug is one that prevents the service or system from functioning; it is sometimes referred to as a “severity 1 problem.” A major bug is one that prevents a portion of the service or system from functioning; it is sometimes referred to as a “severity 2 problem.” 572 A.L. Glasser 15.7 Support In this section, we address how operations and support staff may be organized, deployment of hardware and software, and managing service outages. 15.7.1 Organization The operations and support staff are often organized into four (or more or less) distinct “tiers” of support. “Tier 1” comprises individuals in the operations organization who first notice that a service has failed. They typically have very little service-specific knowledge, but they know how to distinguish a hardware failure from a software failure, and can follow procedures to contact hardware suppliers to perform hardware repair or maintenance. There is a need to identify and assign staff to Tier 1 within the defined hours of operation. Tier 1 staff may be trained to follow some developer-produced operations documents, e.g., documents that describe some classes of problems that might be cleared by a restart of the service or a system reboot. “Tier 2” staff are the individuals in the operations organization who have a combination of hardware, operating system, and service-specific software knowledge, and can typically diagnose which of those three elements are at fault in an otherwise questionable failure scenario. They have developer-produced documentation for specific faults (i.e., “alarms”) as to what procedure should be followed to restore service. “Tier 3” staff are the individuals, often within the development organization, who have extensive service knowledge and can reconfigure the service to get it up and running without a software change. “Tier 4” staff are the individuals who can diagnose and repair software faults in the service. They are the developers. Training is required for Tiers 1 and 2, and may be required for Tier 3. Training promotes a common, correct understanding of the network service. Training needs to inform the staff of all the anticipated failures, how they could occur, and how they can be resolved. The amount of training depends on the support staff’s level of service awareness and ability, and the completeness of the documentation targeted at the Tier. At a minimum, Tier 1 and Tier 4 are required. Tier 1 can always escalate all but hardware/reboot failures to Tier 4. The system and service management health monitoring must generate Tier 1 observable “alarms” for failure conditions. Many alarms will be owing to the service being down or difficulty communicating with subtending services; alarm documentation and Tier 1 staff should suffice to get the service restored most of the time. However, continual failures will result in escalation. 15 Building Large-Scale, Reliable Network Services 573 A poor substitute for Tier 1 staff, which may be acceptable when the ServiceLevel Agreement allows for significant down time, is automatic alarming (via email) to Tier 4 staff. Alternatively, another low-cost approach is to eliminate Tiers 2 and 3 with Tier 1 monitoring, and escalating any and all events requiring attention to a prioritized list of Tier 4 staff, possibly going up the management chain if no one can be reached. A “group cell phone” can be used, which is passed (following a weekly “on call” list) from one Tier 4 member to another, allowing the instructions to the Tier 1 staff to simply call one phone number to gain support. When equipment is remote from Tier support staff, arrangements must be made for some staff at the equipment site to support hardware supplier access for repair and possibly for service reboot attempts. 15.7.2 Deployment Whenever a new service is deployed on new hardware or when new hardware is added to support an existing service, two classes of certification, operational readiness testing and network validation testing, need to be carried out, typically by Tier 2 or 3 staff. This certification testing is done to certify that the hardware, operating system, application software, network connectivity, and the overall operation is functional prior to providing service to the customers. These tests have been proven valuable in identifying where a problem exists prior to customer service, thus improving service availability (the alternative is to simply enable the service, hope for no problems, and then perform general troubleshooting if a problem arises). Operational readiness testing typically consists of the following steps: verify that each element has a valid maintenance contract; perform a hardware stress test/burn-in to verify that there are no hardware problems and that nothing was damaged in shipping; where applicable, verify and, if necessary, configure backup functionality; verify that the crash dump capability is configured and functional; install logins for the production service; and verify the service functionality. Service functionality tests include starting, checking (for current operational status), and stopping the service, as well as rebooting the system and checking that everything that is supposed to start automatically does, and that anything that is not supposed to start automatically does not. Network validation testing certifies that all network connectivity to or from each new piece of hardware functions correctly (this is often and best carried out using software other than the actual service application software, e.g., telnet or nc as a client for testing TCP connectivity to a server). Another category, network management validation test (sometimes incorporated into operational readiness), tests whether the service can be properly monitored and whether the service can communicate (e.g., via SNMP) with the monitoring systems. Changes to the software may be minor simple configuration value changes or major upgrades. Since, following any new problem, the most frequently asked question is always “What changed?” it is mandatory to have a process whereby every change 574 A.L. Glasser made by the support staff is recorded (e.g., in a simple file, but on a system quite distinct and separate from the servers supporting the service, since they may be down when questions about “what changed” need to be answered). The items that should be recorded include who, what, where, when, and why for every change. Deployment of a new version of the software needs to be carefully planned and supported. A flash-cut from one version to the next presents a number of problems. It will require the service to be entirely down during the time it takes to get the new version up and running. If the new version cannot be installed while the previous version is running, the amount of out-of-service time increases. As the reliability of a piece of software is established by having the software operate reliably for an extended period of time in a full production (i.e., not test) environment, a new version is, by definition, not reliable14 . A flash-cut, then, presents the operations organization with the problem of supporting a new, unknown, unreliable implementation rather than the previous, known, reliable implementation15. As a result, following a flash-cut, an entirely new version of the software will not have the confidence of the operations organization and they will suspect the software prior to suspecting an operator mis-step. An alternative to a flash-cut, though more expensive, is to maintain compatibility between versions, allowing a new version to be deployed on a single machine at a time. An effective confidence building approach is to first deploy a new version for a period of time (e.g., a week) on a single machine, and if no critical or major problems arise, deploy on a few more machines and continue this process every few days until the new version is entirely deployed. When major new features are being introduced, it is important to control them via configuration options, which, initially, are set to disable those features so that end-users see consistent behavior (e.g., it is undesirable to have two web servers behind a load balancer presenting different user interfaces, one with the old version and one with the new, as this will confuse the end-users and will almost certainly result in a spike in customer complaints). Once the new version is fully deployed, the new features can be enabled via configuration changes (see the discussion on configurability in the next section). This approach requires that each new version, via configuration, maintains backward compatibility with the previous version. Deployment of a new release of software needs to be carefully planned. Given the desire to gradually phase in new software, all the steps required need to be documented in a deployment plan and reviewed with the developers, testers, and operations staff. Whenever possible, the cut-over should be tested in a nonproduction environment to work out any problems prior to the actual deployment. It is mandatory that a tested back-out plan accompany the upgrade plan. Objective criteria to determine readiness to begin the deployment as well as continue at major steps within the plan must be established (these are often referred to as 14 Software is like wine – it improves with time. This is not to say that the known version is perfect. Operations staff will always prefer the devil that they know to the one that they do not know. 15 15 Building Large-Scale, Reliable Network Services 575 “entrance criteria”). The planning for the deployment needs to estimate the durations of each activity. The planning must also determine the worst case scenario – the set of circumstances that leads to the longest period of service unavailability. This worst case duration must not violate any service-level or other agreements. This activity must occur in a “maintenance window.” During a maintenance window, client access to the service is normally disabled in some fashion; e.g., for an end-user web page oriented service, the service is configured (perhaps via DNS) to return a special web page indicating that the service is temporarily unavailable owing to maintenance. To determine that the deployment was a success, objective “exit” criteria must be established. Such exit criteria typically take the form of successful test results as a result of testing the production system to ascertain that it is functionally performing as expected. Such testing needs to be similar, but not identical, to normal client access, as such client activity during a maintenance window should never access the running system. Thus, this “in-window” testing might use specific IP addresses or special domain names supporting the testing rather than the normal domain names. When configuring a single server behind a load balancer, exposing that server directly, via its IP address, is a good way to test it while still in the maintenance window. The worst case scenario often arrives following the upgrade when the last of this “sanity” testing seriously fails and the back-out plan needs to be executed. 15.7.3 Managing Outages In an enterprise with multiple lines of business, each with multiple large-scale network services, operations coordination across related services is highly desirable. It may very well be the case that planned maintenance on one service that is called by a second service will cause a high alarm level on the second service. Unfortunately, in spite of best intentions, there may be occasions when the support staff cannot quickly identify the cause of a service failure. When such a situation arises, it is best to maintain coordination across the different tier support staffs while providing information to all as new items are discovered. A proven technique to aid this is a conference call (sometimes called an “outage bridge” when used to identify a failure cause), on which all active support staff participate. A helpful addition is an instant messaging tool to allow pairs or larger groups to share data. Managers of the support staff will also need to follow the activity and join the conference call. There is often a separate conference call running in parallel, on which mangers participate where the lower (or lowest) level managers either sequentially or simultaneously (two phones with appropriate mute and volume controls) participate on both calls. Finally, it may be suspected that the service outage is due to third-party, vendor supplied software (e.g., the operating system or the database management system). When this is the case, it is often necessary to add the vendor’s support staff to 576 A.L. Glasser the outage bridge and provide the vendor (or vendors) with the requested data (forced dumps being common) or, in extreme cases, direct access to the failing systems. 15.8 Service Reports Various data gathered by the service should be combined and reported on a periodic basis to aid in the management of the service. Some data might best be gathered and reported daily, such as a performance summary of activity on the service16 . Other data might best be gathered and reported less frequently, weekly, monthly, or quarterly. Figure 15.3 shows a quarterly availability report for a service; it shows the goals (or targets) as well as the actual achieved levels for availability and a number of related service measurements. Figure 15.4 shows a portion of a daily report for an email service (for a single server); it provides various statistics about messages, recipients, and bytes processed, including an hourly histogram, busy hour and busy minute statistics, and message distribution statistics. Figure 15.5 shows a daily report of account subscriptions for an ISP email service; it shows the distribution of accounts against the number of subscribers. 2009 Year End Target Jan 09 Actual Feb 09 Actual Mar 09 Actual Q1 09 Target Q1 09 Actual 99.99% 99.98% 99.98% 99.99% 99.99% 99.98% 160 137 270 132 160 179 Reduce Incidents 1830 139 153 169 456 461 Reduce Incidents > 60 Mins 1005 74 95 84 249 253 Reduce Caused by Change Outages 257 12 20 26 63 58 Reduce Procedure Caused Outages 330 22 33 26 81 81 2009 Goals Availability Reduce MTTR Fig. 15.3 Quarterly availability report 16 For web-server reporting, webalizer is a useful tool. 15 Building Large-Scale, Reliable Network Services 577 ---------- Messages Sent ---------Transmitted Messages: 3,015,994 Transmitted Recipients: 3,016,044 Total Bytes: 305,851,785,029 Rejected Recipients: 10,159 Deferred Recipients: 12,268 -----------Message Distribution by Hour -----------00:00 125,101 08:00 61,027 16:00 213,672 01:00 104,898 09:00 65,705 17:00 201,212 02:00 93,829 10:00 83,778 18:00 205,569 03:00 77,064 11:00 97,707 19:00 190,757 04:00 75,120 12:00 126,519 20:00 177,327 05:00 55,289 13:00 149,461 21:00 161,212 06:00 48,108 14:00 169,670 22:00 150,355 07:00 55,568 15:00 198,273 23:00 128,773 ----------------------------------------------------------Busy Hour: 15:39-16:38 Max: 223,715 Messages/Hour Max: 4,776 Messages/Minute Bu sy Min: 19:06 Message Distribution by Size in bytes 38% are 9,999 or less 70% are 22,999 or less 80% are 34,999 or less 90% are 57,999 or less 95% are 115,999 or less 98% are 358,999 or less Average is 101,409 Maxim um is 15,725,197 Message Distribution by Recipients per message 99% are 1 or less Average is 1.000017 Maximum is 2 Fig. 15.4 Sample of an email service report Fig. 15.5 Sample of an account report Active Subscriptions 742951 Suspended Subscriptions 7107 Number of accounts with… 2 active subscribers : 119465 3 active subscribers : 48932 4 active subscribers : 26120 5 active subscribers : 13573 6 active subscribers : 9035 7 - 10 active subscribers : 2728 11 - 25 active subscribers : 299 26 - 50 active subscribers : 0 51 or more active subscribers : 0 Total Active Secondary Email Ids = 416596 578 A.L. Glasser 15.9 Summary In this chapter, we presented an approach to building large-scale network services. Other approaches undoubtedly exist, and this approach is certainly not the only possible path to success. However, for us, this approach has proven successful in providing many reliable network services. We conclude this chapter with a summary of the “best practice” principles. Provide accurate, clear, and understandable requirements to ensure that the resultant software behaves as expected. The architecture of a software system must have conceptual integrity. This is best achieved by designating a single individual as the architect. Clearly document all interfaces. Avoid any single-point-of-failure by designing in sufficient redundancy. Practice recovery operations regularly. Establish a reliable mechanism to monitor the health of the service. Establish a standard log format. Assign exactly one person as the primary developer of each module; each module should have an assigned backup person (to avoid a human singlepoint-of-failure). Test early, test often. Modularity helps to reduce source code size, which improves maintainability. Write code for human (and machine) consumption. Production software is considerably more difficult to implement than a prototype: plan accordingly. Change management allows for deployment of fixes that do not introduce new, unrelated problems. Software should be built by someone other than the software’s author. This guarantees that the software can be built when the author is unavailable. Successful deployment of a new version of software requires careful planning. Plan for service outages. It is rare that such planning will go unused. Care should be taken when assigning staff to carry out more than one function. References 1. Oppenheimer, D., Ganapathi, A., Patterson, D.A. (2003). Why do Internet services fail, and what can be done about it? 4th Usenix Symposium on Internet Technologies and Systems. 2. Persse, J. (2006). Process improvement essentials: CMMI, Six Sigma, and ISO 9001. O’Reilly Media, Inc. 15 Building Large-Scale, Reliable Network Services 579 3. Brooks, F.P (1995). The mythical man-month: Essays on software engineering. Reading, MA: Addison-Wesley. 4. Schneider, G., Winters, J.P. (2001). Applying use cases: A practical guide. Reading, MA: Addison-Wesley. 5. Bourke, T. (2001). Server load balancing. O’Reilly Media, Inc. 6. Bono, V.J. (1997). 7007 Explanation and apology. NANOG email of Apr 26, 1997. 7. Zhang, Z., Zhang, Y., Hu, Y.C., Mao, Z.M. (2007). Practical defenses against BGP prefix hijacking. Proceedings of the 2007 ACM CoNEXT conference. 8. Patterson, D.A., Gibson, G., Katz, R. H. (1988). A case for redundant arrays of inexpensive disks (RAID). Proceedings of the 1988 ACM SIGMOD international conference on Management of Data. 9. Schroeder, B., Gibson, G.A. (2007). Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Proceedings of the 5th USENIX conference on File and Storage Technologies. 10. Tognazzini, B. (1992). Tog on interface. Reading, MA: Addison-Wesley. 11. Spolsky, J. (2001). User interface design for programmers. Berkeley, CA: Apress. 12. Fowler, M. (1999). Refactoring. Reading, MA: Addison-Wesley. 13. Bosworth, E. (2008). The IBM 370 programming environment. Lecture Notes. Department of Computer Science, Columbus State University. 14. Cornett, S. (2009). Code coverage analysis. http://www.bullseye.com/coverage.html. Accessed May 17, 2009. 15. Musa, J.D. (2004). Software reliability engineering: More reliable software faster and cheaper, 2nd edn. Indiana: AuthorHouse. Chapter 16 Capacity and Performance Engineering for Networked Application Servers: A Case Study in E-mail Platform Planning Paul Reeser 16.1 Introduction Proper capacity and performance engineering (C/PE)1 is critical for the success of developing and deploying any complex networked application. All too often, systems and services are rushed to market without proper capacity/performance planning, resulting in myriad problems including costly hardware upgrades and software rework, loss of revenue due to poor quality and late delivery, customer dissatisfaction, and missed market opportunities. In contrast, by planning for performance and scalability from the earliest stages of product architecture and design, the chances of “doing it right” the first time are greatly improved. Industry studies have proven the positive business case for building C/PE into software development: upfront costs are usually only 1–3% of total project budget, while long-term savings are typically ten times the upfront investment [1]. All mature software development organizations have a well-defined process in place to systematically plan for capacity, performance, and reliability throughout the software development and service deployment life cycle [2]. In this chapter, we discuss the typical capacity, performance, availability, reliability, and scalability engineering activities required to deploy a networked service platform. These activities should begin at the earliest stages, and span the entire platform life cycle: from architecture, design, and development, through service test and deployment, to the ongoing capacity management of a mature service. During the service development life cycle, an iterative, “layered” approach to addressing C/PE is often necessary to meet schedule constraints, wherein more detailed passes are successively made over each assessment area, rather than completing each task before moving to the next. In general, successful C/PE requires staying “one step P. Reeser () Lead Member of Technical Staff, AT&T Labs Research, 200 S. Laurel Avenue, D5-3D26, Middletown, NJ 07748, USA e-mail: preeser@att.com 1 The term “capacity/performance engineering” in the chapter title and throughout this chapter broadly refers to the expansive set of activities required to assess and manage platform capacity, performance, availability, reliability, and scalability. C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 16, c Springer-Verlag London Limited 2010 581 582 P. Reeser ahead” of the product life cycle. In the architecture phase, for example, we try to identify improvements that lead to a better design. In the design phase, we try to identify improvements that lead to the development of more efficient software. In the development phase, we try to create an environment that leads to more effective testing, and so on. The aim of this chapter is not to present an exhaustive C/PE “how to” manual, as there are many books devoted to the topic (cf. [1, 3, 4]). Rather, our goal is to highlight the areas where proper C/PE is especially critical to the successful deployment of a networked service platform. At the highest level, the goal is to ensure that the service meets all performance and reliability requirements in the most cost-effective manner, wherein “cost” encompasses such areas as hardware/software resources, delivery schedule, and scalability. With this goal in mind, the process (shown in Fig. 16.1a) begins with an understanding of what functionality the platform provides and how users interact with the system (Architecture Assessment), including the flow of critical transactions, the workload placed on the platform elements, and the service-level performance/reliability metrics that the platform must meet (Workload/Metrics Assessment). Next, we develop analytic models and collect measurements to predict how the proposed platform will handle the workload while meeting the requirements (Reliability/Availability Assessment and Capacity/ Performance Assessment). Finally, we develop engineering guidelines to size the platform initially (Scalability Assessment) and to maintain service capacity, performance, and reliability post-deployment (Capacity/Performance Management). These C/PE assessment activities are depicted relative to the typical software development and delivery life cycle phases in Fig. 16.1b. a Architecture Assessment elements, transactions, flows Reliability/ Availability Assessment modeling, analysis Workload Assessment workload, requirements, budgeting Scalability Assessment demand, projections, engineering Capacity/ Performance Assessment measurement, modeling Capacity/ Performance Management monitoring, automation b development deployment test Capacity / Performance Assessment Scalability Reliability/ Availability Architecture Assessment design Workload architecture growth Capacity/ Performance Management Fig. 16.1 (a) High-level description of end-to-end C/PE process and (b) timing of C/PE activities relative to platform delivery phases 16 Capacity and Performance Engineering for Networked Application Servers 583 16.2 Basic Probability and Queuing Concepts Prior to introducing the case study, we briefly review some straightforward concepts from elementary probability analysis and queuing theory. These simple concepts and their associated results will be leveraged at various points throughout the C/PE process described in the remainder of this chapter. Readers who are already familiar with such concepts as birth and death models, state transition balance equations, Markovian queuing systems, and Little’s Law may skip this section and proceed to Section 16.3. Readers who desire a more thorough treatment of these concepts may explore any of the countless probability and queuing texts (cf. [5–8]). From [5], a discrete random variable X is said to be Poisson distributed with parameter > 0 if its probability mass function p.n/ is given by p.n/ D P fX D ng D e n ; nŠ n D 0; 1; 2; : : : : A continuous random variable X is said to be exponentially distributed with parameter > 0 if its probability density function f .x/ and its cumulative distribution function F .x/ are, respectively, given by f .x/ D e x ; 0; x0 and F .x/ D P fX xg D x<0 1 e x ; 0; x 0: : x < 0: The mean and variance of X are, respectively, given by 1 and 2 . The 95th percentile of X is the value of x such that F .x/ D 0:95. Solving the above equation for x yields x D ln.0:05/1 D 3:001 . That is, the 95th percentile of an exponentially distributed random variable is three times the mean. Exponentially distributed random variables are said to be memoryless in that P fX > s C tjX > tg D P fX > sg for all s; t 0: A stochastic process N.t/; t 0 is said to be a counting process if N.t/ represents the number of “events” up to time t. Furthermore, N.t/ is said to be a Poisson process if (among other conditions) the number of events in an interval t is Poisson distributed with mean t. That is, for all t 0 P fN.t/ D ng D e t .t/n ; nŠ n D 0; 1; 2; : : : : For a Poisson process N.t/, let T1 denote the time of the 1st event, and let Tn for n > 1 denote the time between the .n 1/st and the nth events. The sequence Tn of inter-event times are independent and identically distributed (i.i.d.) exponential random variables with mean 1 . The Poisson process is one example of the general class of exponential models known as continuous-time Markov chains. These models are completely 584 P. Reeser characterized at any time by their state at that time (memoryless), and the time between transitions from one state to another is exponentially distributed. Markov chains that always transition from state n to state n C 1 are called pure birth processes, whereas those that always transition from state n to state n 1 are called pure death processes. More generally, a Markov chain that can transition from state n to either states n C 1 or n 1 (such as the number of jobs in queue) is called a birth and death process. One example of a birth and death process is the Markovian queuing system. Suppose that jobs arrive at the single server according to a Poisson process with rate > 0 (i.e., inter-arrival times are i.i.d. exponentially distributed with mean 1 ), and suppose that service times are i.i.d. exponentially distributed with rate > 0 (mean 1 ). Such an exponential queuing system is typically denoted by M()/M()/1, or M/M/1 for short. More generally, the notation M()/M()/C/K (or M/M/C/K for short) commonly denotes a Markovian queuing system with Poisson arrivals, exponential service times, 1 C 1 servers, and 0 K 1 buffers. The service discipline (i.e., the order in which waiting jobs are served) can vary from the implicit first-in-first-out (FIFO), to last-in-first-out (LIFO), or processorsharing (PS) wherein each job in the system receives an equal “slice” of service, or random order, or priority order, and so on. Fortunately, most metrics of interest (such as average queue length, or average time in system, or average server utilization) are insensitive to the particular service discipline due to work conservation laws. However, service order does impact variances as well as metrics for individual jobs. Let X.t/ denote the number of jobs in an M/M/1 queuing system at time t, and let Pn D lim P fX.t/ D ng; n D 0; 1; 2; : : : t !1 denote the steady-state probability that there are n jobs in the system. For each n0, the rate at which the system enters state n equals the rate at which it leaves state n (flow in D flow out). This principle is known as equilibrium, and the resulting set of equations is known as state transition balance equations. For the M/M/1 system, solving these “flow in D flow out” balance equations in terms of P0 yields Pn D n P0 , where D = is the utilization. Note that there are n 1 unique equations and n unknowns. The nth equation comes from the fact that the probabilities must sum to 1; hence, 1D 1 X Pn D P0 nD0 1 X nD0 n D P0 : 1 Thus, P0 D 1 and Pn D n .1 /; n 1. Next, let L denote the average number of jobs in the system, and let W denote the average time spent in the system (delay). L and W are given by LD 1 X nD0 nPn D .1 / 1 X nD0 n n D 1 1 X nC1 and W D Pn D : : : D ; 1 nD0 16 Capacity and Performance Engineering for Networked Application Servers 585 where D 1 is the average service time. The expression for W is derived by noting that a job arriving to find n jobs already in the system expects to wait for n C 1 service times (the n jobs ahead of it, plus its own service).2 Comparing the expressions for L and W , we see that L D W . This relationship is referred to as Little’s Law [9], which states that the average number of jobs in the system equals the average arrival rate times the average delay in the system, independent of the arrival or service distributions. Although this formula seems deceptively simple, it applies under general (non-Markovian) conditions, and is an extremely powerful “back-of-the-envelope” (BoE) result that we employ often throughout the C/PE process. Other common BoE results that we often rely on include the expression for W , which states that the average delay equals the average service time divided by (1 – utilization), as well as the expression for L, which states that the average number of jobs in the system equals the utilization divided by (1 – utilization). Although these expressions are derived assuming Markovian arrival and service distributions, they are generally applicable as a rough estimation for most common queuing situations in stable steady state. (For transient scenarios, such as the rate of queue growth following server failure, comparable fluid-flow approximations can be used.) As a final note, the M/M/C Markovian queuing system has been studied widely, and has many applications in computer engineering [10]. Other non-Markovian queues that have wide applications include the M/D/C system, wherein service times are deterministic (D), and the B/M/C system, wherein jobs arrive in batches (B). Generally speaking, any variations on the M/M/C exponential system that “smooth” either the arrival process or the service process (such as the M/D/C queue) tend to reduce the coefficient of variation (CV),3 while those where either process is more “bursty” (such as the B/M/C system) tend to increase the CV. As a result, the 95th percentile will be less than three times the mean for smoother (than exponential) distributions, and more than three times the mean for burstier distributions. For example, in computer systems in which the CPU executes virtually identical code for each job (e.g., a server that specializes in one function), the service process may appear more deterministic. In this case, the 95th percentile delay will be less than three times the mean. Practical experience suggests that the 95th percentile delay for common systems is typically two to three times the average delay. 16.3 Case Study Throughout this chapter, we use an Internet Service Provider (ISP) e-mail platform as a unifying case study to illustrate many of the C/PE tasks. Most likely, we all have one or more Internet accounts through which we can send and receive e-mails, 2 This Markovian property results from the memoryless nature of the exponential distribution, and is referred to as Poisson Arrivals See Time Averages (PASTA). 3 The coefficient of variation (CV) is a normalized measure of dispersion of a distribution, defined as the ratio of the standard deviation to the mean .CV D =/. 586 P. Reeser maintain a personal web page, access newsgroups and web logs, and participate in a variety of online activities such as chat rooms and gaming. In most cases, we pay an ISP to provide basic Internet access ranging from narrowband dial-up, through broadband DSL and cable, up to wideband FTTH. Features such as e-mail, web page hosting, and news then come “free” with our Internet access subscription. But have you ever wondered what goes on “behind the scenes” to provide a “free” feature such as e-mail? In reality, the cost and complexity of providing a fast, reliable ISP e-mail platform for millions of subscriber mailboxes is a real C/PE challenge. With many large ISPs offering 1 GB mailboxes, these providers potentially need to provision and maintain many terabytes of online storage, and meet stringent delay requirements while processing many millions of e-mail messages daily. Figure 16.2 illustrates an example functional architecture for a large ISP e-mail platform. Such a platform typically consists of numerous functional components, each performing specialized tasks and conforming to multiple protocols for sending, receiving, and retrieving e-mails. These software components could all run on the same physical server, and many e-mail platform vendors offer an “all-on-one” configuration. From a performance, reliability, security, and scalability standpoint, however, such a solution has severe limitations. For example, the server capacity of an “all-on-one” configuration is limited by the most stringent performance metric (e.g., message retrieval), resulting in costly over-engineering relative to other metrics (e.g., message delivery). By partitioning functionality across servers, each component can be optimized relative to its own unique metrics. Or, sizing an “allon-one” solution to meet storage needs may require more disk than one physical server can manage. Or, putting the inbound and outbound mail delivery processes on the same physical server may result in security vulnerabilities such as “mail relay”, where spammers attempt to mask their identity to relay mail through your server (by spoofing the sender as an on-net user so that the server passes the mail through to an off-net recipient). Thus, architects and C/PE planners must determine Inbound SMTP (GW) Filter Engine (AS/V) Message Store (PO) HTTPS WebMail (WM) Outbound SMTP (MR) SMTP POP Proxy (PP) HTTPS POP HTTPS SMTP SMTP SMTP Global Internet SMTP ISP Intranet Fig. 16.2 Example ISP e-mail platform functional architecture HTTPS POP SMTP 16 Capacity and Performance Engineering for Networked Application Servers 587 the expected market segment that their solution targets, and plan accordingly. Unless the target market is very small, partitioning of software components onto dedicated hardware sized for the component is more cost-effective. With this understanding in mind, we assume throughout the remainder of this chapter that we are delivering an e-mail platform to serve a large ISP. Accordingly, there are typically numerous identical replicas of the functional components in Fig. 16.2, each running on its own physical hardware. Thus, the physical architecture for a large ISP e-mail platform can consist of many hardware servers, and will typically look very much like the functional architecture illustrated in Fig. 16.2. As a result, we will henceforth use the word server interchangeably to refer to either the specialized functional (software) component or the dedicated physical (hardware) element on which it resides.4 Referring again to Fig. 16.2, a typical large e-mail platform includes inbound Gateway (GW) servers to receive incoming e-mail from the Internet, running the industry-standard Simple Mail Transfer Protocol (SMTP) on the Internet-facing side. These GWs typically perform a variety of filtering functions to screen out unwanted and threatening e-mails (e.g., spam, viruses, and worms), often employing specialized anti-spam/virus (AS/V) filtering software on outboard servers. Messages that pass filtering are then forwarded to Post Office (PO) servers, where messages are stored in user mailboxes until they are retrieved or deleted. Unlike the other mail platform components, the POs are usually “stateful” in that a user’s mailbox typically resides on only one PO, and messages destined to a particular user must be routed to a particular PO. (As we will discuss later, this fact is particularly relevant to ongoing C/PE.) Collectively, the GWs and POs constitute the message delivery and storage platform. Next, users can typically access their e-mail over the ISP Intranet through a number of interfaces. The oldest such access mechanism is the industry-standard Post Office Protocol (POP), where a POP software client residing on the user’s PC connects to a POP proxy (PP) server to retrieve e-mails. The PP server in turn connects to the appropriate PO and typically “drains” all messages in one transaction, downloading them to the user PC and removing them from the PO. The user can then read the messages from their local storage. In addition to POP, another widely used access mechanism is the industry-standard Secure HyperText Transfer Protocol (HTTPS), wherein an HTTP browser residing on the user’s PC connects to a WebMail (WM) server to retrieve the e-mails. The WM server in turn connects to the appropriate PO (possibly via the PP), and typically provides a list of all stored messages. The user can then choose to retrieve and/or delete the messages (usually one at a time) from the PO, resulting in a series of transactions. Typically, messages 4 In reality, ISPs typically support multiple applications in addition to e-mail (e.g., newsgroups and web hosting). These applications typically share physical resources, either through virtualization, common transactions (e.g., authentication), or shared infrastructure (e.g., LANs). For the purpose of illustrating the C/PE tasks, we assume that all physical resources are dedicated to the single e-mail application. In the case of resource sharing/virtualization, the C/PE analysis must account for the impact of additional workload, reduced resource availability, and contention. 588 P. Reeser remain on the PO server until the user explicitly deletes them, or the PO eventually deletes them as a result of optional mail aging policies. (Again, this fact is particularly relevant to ongoing C/PE.) Although POP and HTTPS are the most prevalent consumer access protocols, other options are common, including Internet Message Access Protocol (IMAP) and proprietary mail clients (e.g., MS Outlook). Collectively, the PPs, WMs, and POs constitute the message retrieval platform. Finally, users can typically send e-mails through a number of interfaces. The oldest egress mechanism is again SMTP, wherein an SMTP client residing on the user’s PC connects to an outbound mail relay (MR) server to send e-mails. These MRs typically perform the same filtering functions as the GWs, again often employing specialized filtering software on outboard AS/V servers. Messages that pass filtering are then forwarded to the recipient ISP’s GW over the Internet, or to the appropriate PO server if the recipient is “on-net” (hosted by the same ISP as the sender). In addition to SMTP, another widely used egress mechanism is again HTTPS, wherein the browser connects to a WM server, which forwards the message to an MR. Collectively, the MRs and WMs constitute the message egress platform. 16.4 Architecture Assessment Section 16.4 describes the Architecture Assessment activities. These tasks are usually performed during the architecture and design phases of the platform life cycle. The goals at this stage are to 1. Identify critical functional (software) and physical (hardware) elements 2. Identify critical user transactions and develop a descriptive model of the flow of transactions through the platform elements 3. Identify critical element resource limits and potential performance- and scalability-limiting platform bottlenecks (“choke points”) For example, the critical software elements in this e-mail platform are (a) (b) (c) (d) Inbound SMTP GW and AS/V filtering processes PO message delivery, storage, retrieval, and deletion processes POP and HTTPS message retrieval processes Outbound SMTP GW, HTTPS, and AS/V filtering processes Similarly, the critical hardware elements in this e-mail platform are (a) Inbound SMTP GWs, outbound SMTP MRs, and AS/V servers (b) PP and WM message retrieval servers (c) PO message storage servers Examples of software/hardware elements that may not be considered critical (at least in the first iteration) include databases to store user identities, credentials, and e-mail preferences, secure servers to authenticate HTTPS users against their credentials, directories to map user identities to physical mailbox locations, servers to manage 16 Capacity and Performance Engineering for Networked Application Servers 589 access control lists (ACLs) and spammer “blacklists,” log servers to record transaction access and summarize daily usage volumes, scripts to migrate mailboxes from one PO to another PO for load-balancing, and probe servers to measure transaction reliability and performance. Such noncritical elements can be explicitly considered in successive iterations if their associated transaction volumes or resource consumptions turn out to warrant it. Once the critical software/hardware platform elements are identified, we next identify the critical user/system transactions. First, these critical “use case” transactions must include all those that will have associated service-level metrics. If we do not explicitly model transactions for which a requirement will be specified, then we will not know in a timely manner if the requirement can be met. For example, any common user-initiated transactions, such as retrieving a message, must be considered as critical. In addition, critical transactions must include those that may be particularly usage- or resource-intensive. If we do not explicitly model transactions that may consume significant resources, then we will not know in a timely manner if the system will have adequate capacity. For example, if the e-mail service implements a message aging policy, traversing the storage directory to find messages older than N days can be extremely CPU-intensive, even though there is no associated performance metric. For the e-mail platform, some of the critical transactions are (a) (b) (c) (d) (e) (f) (g) (h) (i) Receive and filter an inbound or outbound SPAM/virus message Receive, filter, and deliver a safe inbound message Receive, filter, and deliver a safe outbound message locally Receive, filter, and deliver a safe outbound message to the Internet Retrieve a mailbox contents via POP (including moving the contents to a trash bin for subsequent deletion) Retrieve a mailbox list via HTTPS Retrieve a single message via HTTPS Delete a list of messages via HTTPS Traverse the storage directory to find messages older than N days Examples of transactions that may not be considered critical (at least in the first iteration) include interactions with a database server to identify a user or update user e-mail preferences, sending a non-delivery notice (NDN) back to the originating ISP when a message recipient is not found, writing transactions to a logging server, updating the AS/V rule set when new spam signatures are identified, migrating mailboxes from one PO to another PO to load-balance the storage levels, running daily scripts to summarize transaction volumes, and so on. Such non-critical transactions can be explicitly evaluated in successive iterations if their transaction volumes or resource consumptions turn out to be higher than anticipated, or if the involved elements turn out to be bottlenecks. These critical platform elements, and many of these critical transaction flows, are captured in Fig. 16.2. Given this characterization, we can begin to identify possible resource limits and potential capacity- and performance-limiting platform 590 P. Reeser bottlenecks. Listed below are a number of typical e-mail platform choke points. This list is by no means intended to be exhaustive. Rather, these are a few of the numerous bottlenecks that can be identified early in the platform delivery process: The ISP does not have direct control over the incoming message arrival process. The source ISP could deliver messages as they are received (one at a time, one SMTP connection per message), or store messages destined for a particular ISP and deliver a batch of many messages at once (in one SMTP connection). As a result, the inbound SMTP process running on the GWs needs to be able to handle a highly variable input stream with highly variable connection times. This in turn suggests that the GWs need to have a large amount of RAM, and a mechanism to commit messages to disk prior to closing the SMTP connection (to ensure message delivery reliability). The rules governing AS/V filtering are highly dynamic and ever-expanding. Spam is growing exponentially, with volumes doubling every few years. And as a new spam, virus, or worm signature is identified, the filtering rule set must be updated with this new signature. Thus, the GW must be able to keep pace with the ever-growing processing demands of this CPU-intensive function. For this reason, the AS/V function is often moved to an outboard filtering engine such as a high-density, rack-mounted, disk-less blade server, where processing power can be grown in a cost-effective manner. This in turn allows the GW server to be specialized to its more memory-intensive task. Mailbox management is of particular concern for any e-mail platform. Without proper policies to control message retention (e.g., an aging policy that deletes unread messages older than 60 days), PO storage needs will grow exponentially. Even so, the POs need to be able to handle huge volumes of data and support large disk subsystems. As a result, PO storage is often moved to expandable NFS-based network attached storage (NAS), or even to a storage area network (SAN). Finally, the user experience is typically dominated by message retrieval, where stringent performance and reliability metrics are often defined. Hence, the userfacing PP and WM servers must be sized to provide adequate capacity to meet the delay requirements even under failure conditions (e.g., two of N servers are down, or, in the case of redundant sites, half of all servers are unavailable due to, say, site router failure). 16.5 Workload/Metrics Assessment This section describes the Workload/Metrics Assessment activities. These tasks are usually performed during the design phase of the platform life cycle. The goals at this stage are to 1. Characterize the anticipated critical (usage- or resource-intensive) transaction workload and develop representative workload models (transaction mix) 16 Capacity and Performance Engineering for Networked Application Servers 591 to describe platform usage during typical and extreme scenarios (e.g., under element failure or during peak holiday periods) 2. Characterize the anticipated transaction performance and reliability requirements/metrics 3. Develop software component resource estimation and budgeting models for the representative transaction workload scenarios 4. Identify needs to optimize the platform architecture (e.g., splitting software components across multiple servers) based on budget constraints 16.5.1 Workload Models and Requirements For an e-mail platform, many of the critical transactions were listed in the previous section. Table 16.1 provides an example transaction mix for the normal and peak scenarios. These workload parameters can be estimated through a variety of channels, including past platform experience, competitive assessments, industry benchmarks, and market research. In addition, sensitivity analyses can be performed to understand the C/PE ramifications of significant changes in the expected workload profile. Next, we must specify performance and reliability requirements for a subset of the critical transactions. Often, these requirements are driven by what customers are demanding, or by what competitors are offering, or by what the product planning organization thinks will be required to differentiate this product from those offered by competitors. Frequently, these requirements are built into a contractual servicelevel agreement (SLA) with the customer, including specific penalties (such as a specified credit on the monthly service cost) when an SLA metric is violated. Performance and reliability metrics can take many forms. Traditionally, these metrics have been specified in terms such as “average delay less than X seconds” or “availability greater than Y %”. More recently, many service providers have adopted approaches such as the 6 Sigma methodology [11] to specify these metrics in terms Table 16.1 Representative transaction workload models Critical transaction Receive/filter inbound (IB) spam message Receive/filter outbound (OB) spam message Receive/filter/deliver safe IB message Receive/filter/deliver safe OB message locally Receive/filter/deliver safe OB message to Internet Retrieve mailbox contents via POP Retrieve mailbox list via HTTPS Retrieve single message via HTTPS Delete list of messages via HTTPS Traverse storage directory to find old messages Normal rate (tps) 2,000 1,000 500 100 400 100 100 200 150 0.001 Peak rate (tps) 3,000 1,500 750 150 600 150 150 300 200 0.001 592 P. Reeser of a unified Defects per Million (DPMs) rate. In a nutshell, a transaction defect can occur in any one of three areas: 1. Accessibility (simply speaking, the availability of the operation) 2. Continuity (the reliability of the operation) 3. Fulfillment (the performance quality of the operation, e.g., latency) For example, an e-mail transaction can be considered defective if (a) The transaction fails to complete (i.e., no response is received) (b) The transaction completes, but an incorrect response is received (c) The transaction completes, and the correct response is received, but the response time (or other appropriate metric) violates its target The overall DPM is defined as 106 fthe fraction of defective transactions in excess of targetg. Specifically, the number of defects (raw count) is the actual number of transactions violating their targets minus the allowable number of transactions violating their targets. For example, for the “deliver a safe inbound message” transaction, a typical target might be “95% of measured delivery times <10 min”. In this case, the DPMs would be max f[(observed fraction >10 min) – (1 – 0.95)]106 , 0g. Regardless of how the metrics are defined, a number of characteristics must be addressed. First, they must be specific. Consider, for example, a response time requirement for a “retrieve message” transaction. We must specify any characteristics that impact delay, such as message size, the point at which the stopwatch begins (user clicks on link) and ends (first packet is received), access link speed, and so on. Second, they must be measurable. A service-level metric is useless to you and your customers if it cannot be accurately measured and verified. As a result, you need to consider how you plan to measure the requirement. Will you need a software client add-on to capture and report measurements? Will you deploy a hardware sniffer at select user end points? Will you subscribe to an outside vendor’s performance verification service? Will you develop a separate measurement platform to launch synthetic transactions into your platform? Third, they must be controllable. You may have difficulty meeting contractual SLAs if you do not control all components in the critical path of the metric. For instance, defining a response time metric to include rendering by the user’s browser makes you vulnerable to the user’s PC. Or defining a metric to include network transport is dangerous if you do not control the access/egress networks. For example, we may define a metric for “deliver message locally” rather than “deliver message to Internet” because the ISP does not have control over the Internet, or the recipient ISP’s platform. Using this approach, we can define the DPM components for each critical transaction. For example, the direct measures of quality (DMoQs) associated with the “retrieve message via HTTPS” transaction are shown in Table 16.2. Given these DMoQs, we can tolerate up to 106 0:05 D 50;000 DPMs associated with the “retrieve message via HTTPS” transaction: up to 106 0:005 D 5;000 accessibility DPMs and up to .106 5;000/0:002 D 1;990 additional continuity DPMs, with the balance as fulfillment DPMs. 16 Capacity and Performance Engineering for Networked Application Servers 593 Table 16.2 Example metrics for “retrieve message via HTTPS” transaction DMoQ Target Definition 1. HTTPS read availability 99.5% Proportion of attempts that complete prior to time out 2. Read reliability (given 1.) 99.8% Proportion of attempts that complete successfully 3. 95th Percentile response 20 s Time from clicking on link until time (given 1. and 2.) contents fully displayed 16.5.2 Resource Estimation and Budgeting Models Performance modeling must begin as early as possible in the platform development process. Many platform planners assume that useful models cannot be constructed until after the software is developed and tested. Unfortunately, if we wait until performance/scalability problems are uncovered during testing, it is often too late to make architectural changes without costly rework. Early-stage performance models need not be overly complex. In fact, simple “back-of-the-envelope” (BoE) models often provide valuable insights into performance issues. Resource estimation and budgeting is one such modeling effort that can bear significant fruit early in the platform life cycle [12]. Throughout the remainder of this chapter, we attempt to maintain a consistent set of symbolic notation in mathematic formulas wherever possible. We consolidate much of this notation here so that the reader can refer back to one place to refresh their memory. Let Ri denote the rate of transaction i (in transactions per second, or tps) Nj denote the number of instances of component j (parallel servers) Cij denote P CPU consumption of transaction i on component j (s) D j i R i Cij =Nj D CPU utilization per replica of component j Tij D P Cij = 1 j D delay of transaction i on component j (s)5 Ti D j Tij D end-to-end delay of transaction i (s) T i denote the end-to-end delay requirement of transaction i (s) T ij denote the delay budget of transaction i on component j (s) The process of resource estimation and budgeting is iterative, and varies depending on the stage of platform life cycle. Prior to development, when resource consumptions cannot yet be measured, designers must estimate resource costs based on the detailed component design. Resource estimation is a “bottom-up” approach in that we first estimate the hardware/software component resource consumptions Cij and determine the number of servers Nj required to meet the delay requirements TNi . Thus, the goal of software component resource estimation is essentially to minimize Nj such that Ti TNi . Resource budgeting is a “top-down” approach 5 This expression results from a BoE model for delay W reviewed in Section 16.2. 594 P. Reeser in that we first specify delay allocations T ij for each component and determine the maximum resource consumptions Cij allowable while still meeting the budgets. Thus, the goal of software component resource budgeting is essentially to maximize Cij given Nj such that Tij T ij . By performing both estimation and budgeting, we can identify gaps in the design and focus development resources on the most critical components. As development proceeds, the results of the estimation and budgeting eventually align. With this process in mind, consider again the “retrieve message via HTTPS” transaction. The critical path flow of this transaction is Client $ Access network $ WM server $ PO server. Thus, delay objectives must be budgeted to each component in the critical path. Assume that to meet a 95th percentile delay requirement of 20 s, you must target an average response time of 10 s.6 Assuming that the performance metric is specified to be from the moment the user clicks on the browser link until the first packet of a 100 kB message is received, the client is essentially removed from the critical path. Assuming that the ISP provides an access network capable of sustaining 1 Mbps, the transmission of a 100 kB message should take no more than 1 s including protocol delays. Of the remaining 9 s, assume that our rough sizing of the workload indicates an initial allocation of 3 s to the WM server and 5 s to the PO as a starting point. We keep the final 1 s in reserve (a “kitty”) to allocate later in the process in the event of minor overruns. This initial allocation is somewhat arbitrary, since we can perform sensitivity analyses around the allocation of time among components to optimize the configuration. Consider the WM server (the PO server budgeting is similar). The WM server is in the critical path of three critical transactions: “retrieve single message via HTTPS” (transaction 1), “retrieve mailbox list via HTTPS” (transaction 2), and “delete list of messages via HTTPS” (transaction 3). From Table 16.1, the peak transaction rates are R1 D 300 tps, R2 D 150 tps, and R3 D 200 tps. Assume that development has not yet completed, but the designers estimate that the WM CPU consumptions per transaction are C1 D 20 ms, C2 D 40 ms, and C3 D 30 ms. (For simplicity, we drop the subscript j as we only consider a single component.) Then the WM server CPU utilization per replica is D †Ri Ci =N D 18=N , and the average delay of transaction 1 (in seconds) is T1 D C1 =.1 / D 0:02N=.N 18/. Finally, solving the expression T1 3 s for N yields N 18.1 WM servers. Thus, given the peak workload and resource consumption estimates, the current projection is that at least 19 WM servers are required to meet the average delay objective for transaction 1. (This number could increase further once a similar analysis is performed for transactions 2 and 3.) This approach is “best case” in that it ignores contributions to end-to-end delay other than CPU contention. For instance, disk 6 As discussed in Section 16.2, both analytic modeling and practical experience suggest that the average delay for user-initiated jobs with common code execution is typically one-third to half of 95th percentile delay. As part of the budgeting exercise, we can perform sensitivity analyses around this 95th percentile-to-mean assumption. 16 Capacity and Performance Engineering for Networked Application Servers 595 and network I/O time and protocol delays will “eat into” the delay budget, leaving less time for CPU processing. These factors are usually illuminated during the testing phase, and can be captured during the performance modeling. This approach is also “worst case” in that the underlying BoE model T D C =.1 / assumes a high-degree variability that is frequently not observed when computer systems deterministically execute identical code for each job. This budgeting exercise often sheds light on opportunities to optimize the platform architecture based on the resource constraints. For example, we may find that too many critical transactions are competing for the same resource, resulting in the need to over-engineer that component to meet the most restrictive requirement. By splitting the functionality across specialized servers (say one pool of WM servers to handle transactions 1 and 3, and another pool to handle transaction 2), we may be able to meet all requirements with fewer total servers. Or we may find that one particular design estimate for component CPU consumption leads to an inefficient use of the resources. By budgeting a smaller target for that component, we may be able to better focus development resources on the most critical components, thus leading to a more efficient product. 16.6 Availability/Reliability Assessment This section describes the Availability/Reliability Assessment activities. These tasks are usually performed between the design and software development phases of the platform life cycle. The goals at this stage are to 1. Develop reliability block diagram models to quantify long-term (steady-state) service availability, and birth and death models to quantify short-term (transient) platform reliability, and identify reliability-impacting platform bottlenecks (such as single points of failure) 2. Perform reliability sensitivity and failure-mode analyses to identify and quantify the reliability impact of required platform enhancements 3. Propose additional reliability requirements and engineering rules 16.6.1 Availability Modeling Prior to software development, we can begin to assess platform availability and reliability. To determine the availability of the platform for various activities, we first estimate the availability of all components and identify which components are required to perform the activity. The data required for each component in estimating transaction availability are The mean-time-to-failure (MTTF) The mean-time-to-repair or restore (MTTR) 596 P. Reeser The “K of N” sparing policy (discussed below) The software C procedural “scaling factor” (discussed below) The availability A of each element is given by A D MTTF/(MTTF C MTTR), and the downtime DT (in minutes per year) is given by DT D 525;600.1 A/. The sparing policy depends on whether or not persistent state information is retained. For stateless components such as the GW server, the notation “K of N ” denotes that the component is available if at least K of the N replicas are operational. For stateful components such as the PO, the notation “1 of K CN ” (typically 1 of 1 C 1) denotes that there are K primary servers and N warm standbys. For 1 of 1 C 1 sparing, if the primary component fails, the state is re-created on the warm standby, and the subsystem is down for the duration of this failover procedure (given by the MTTR). The availability for a block of elements is given by Ak of n D nk X i D0 n Ani .1 A/i : i For example, AN 1 of N D NAN 1 .N 1/AN ; AN 2 of N D 1=2.N 1/ .N 2/AN N.N 2/AN 1 C 1=2N.N 1/AN 2 , and so on. Note that this general model includes as limiting cases the “series” (N of N ) and “parallel” (1 of N ) systems, given by Aseries D An and Aparallel D 1 .1 A/n . Two approaches are commonly employed to account for the effects of software faults and procedural errors on the platform availability: Perform rigorous software reliability analysis to measure and estimate the mean time between faults/errors and MTTR, and explicitly include these components in the reliability critical path Scale the hardware availability estimates based on common “rules of thumb” to account for software/procedural impacts Clearly, the first approach is more accurate and application-specific, provided data can be obtained at the current stage of platform delivery. There are numerous approaches to this analysis, such as software reliability engineering (SRE), fault insertion testing (FIT), and modification request (MR) analysis (cf. [13, 14]). More often than not, however, direct measurements of software/procedure failures are not available until the platform has been through system test and/or deployed for some time. As a result, the second approach is more common at this early stage. One common methodology is to scale the hardware downtime to reflect software/procedural faults. Based on experience [15], the recommended factors are listed in Table 16.3. Let S denote the scaling factor, and let the subscripts H and T denote the hardware (only) and total (hardware C software C procedural) availability measures. Then S DT T =DT H D .1 AT /=.1 AH /. Thus, once we compute the hardware downtime DT H and availability AH , the total downtime DT T is given by S.DT H /, and the total availability AT is given by AT D 1 S.1 AH /. 16 Capacity and Performance Engineering for Networked Application Servers 597 Table 16.3 Recommended hardware-to-total DT scaling factors Level of component complexity Platform life-cycle stage New Evolving Mature Simple, simplex 15 9 3 Moderate, average 20 12 4 Access GW ••• ••• Access GW AS/V N–1 of N N–2 of N Global Internet 1 of 2 AS/V Complex, redundant 25 15 5 PO Spare 1 of 1+(N–1) Fig. 16.3 Example reliability block diagram The process usually begins with the construction of reliability block diagram (RBD) models. These models define “blocks” of platform elements along the critical path of transaction flows, where each block has an associated probability of failure. For example, Fig. 16.3 shows a typical RBD model for the “receive/filter/deliver safe IB message” transaction. As can be seen, there are two access links from the Internet to the GWs. In this example, assume that this portion of the path is available if at least one is operational (1 of 2). There are multiple stateless GW servers, any one of which could receive the next connection. Assume that this portion of the path is available (i.e., has sufficient capacity to handle the workload without performance degradation) if no more than one GW is down (hence N 1 of N must be up). There are multiple stateless AS/V servers, any one of which could receive the next filtering request. Assume that this portion of the path is available if no more than two AS/V servers are down (hence, N 2 of N must be up). Finally, there are multiple stateful PO servers, only one of which normally contains the destination mailbox. Assume that there is one spare PO available in the event that any primary PO fails. Then, this portion of the path is available if either the primary PO for this mailbox is operational, or the spare PO is available to serve this mailbox. The spare PO in turn is available if all other N 1 POs are operational. Hence, 1 of 1 C .N 1/ must be up. Further explanation of how failover to the spare PO actually works is given later in Section 16.6.2 when we discuss detailed reliability modeling. (For simplicity, we ignore other components in the critical path, such as access routers, load-balancing switches, and LAN hubs.) Next, we estimate the MTTF and MTTR parameters for each element in the RBD model. The MTTF estimates are typically based on industry-standard assumptions or vendor analyses of server hardware availability, whereas the MTTR estimates are typically based on knowledge of your data center operations and staffing (e.g., 15 min to detect and reboot a server, 4 h to diagnose and replace a 598 P. Reeser Table 16.4 Example availability analysis Element Access GW AS/V PO Critical path MTTF (h) MTTR (h) 40K 24 5K 8 10K 4 5K 4 S K 5 1 25 9 15 18 25 4 N 2 10 20 5 DT T DT H AH (min/year) (min/year) 1.00000 0.2 1 0.99989 59.8 1496 1.00000 0.0 1 1.00000 132.7 3319 0.99988 192.8 4807 AT 1.00000 0.99715 1.00000 0.99369 0.99085 LAN card, and 24 h to ship and replace a component not available on-site). Once the model is constructed, we can easily perform sensitivity analysis of these parameters. For example, consider the RBD in Fig. 16.3. Table 16.4 shows the results of this availability modeling exercise. Columns two to four provide the assumed MTTFs, MTTRs, and scaling factors for the elements along the critical path. Columns seven to ten show the resulting downtimes and availabilities. The PO server requires special treatment. The availability AH given in Table 16.4 only reflects the availability of the primary PO or spare PO (in parallel). We must also reflect the failover time in the PO downtime DT H . Assume that this procedure requires 1 h to migrate the file system from the failed PO to the spare, and another 15 min to reboot the server. Thus, every 5,000 h (the PO MTTF), we incur a 75-min downtime to restore service, or 131.4 min/year added to DT H . As can be seen, the estimated total availability AT for the “deliver safe IB message” transaction is 99.1%. If AT is less than the target requirement proposed in Section 16.5.1, then we must consider enhancements to the architecture and/or data center operations. For example, the biggest contributors to the downtime are the GW and PO servers. By planning for additional GW servers, we can provide enough capacity to handle two failures (N 2 of N ). If this change does not provide sufficient benefit to meet the requirement, then we can consider alternative storage architectures that could reduce the PO failover time below the 1-h assumption. These and other sensitivity analyses are easily facilitated by this modeling approach. We must also be careful not to over-simplify the analysis. Otherwise, we may overlook potential single points of failure (SPoFs). For example, this analysis assumes that element failures are independent. In the case of the access links, this assumption implies that the physical links are diversely routed (i.e., each link takes a separate physical path between the data center and the Internet). If this assumption is not true (e.g., the links terminate on the same edge router, or the logical links “ride” on the same higher-capacity physical fiber), then failures are not independent (e.g., a fiber cut can take out both links). As another example, if the AS/V servers are blade servers, then they reside in a blade center chassis. If there are SPoFs in the chassis (e.g., power supply or cooling fan), then we could lose all AS/V servers in the chassis if the chassis fails. Or, if the data center does not have battery or diesel power backup, then the loss of commercial power could result in the catastrophic failure of all servers at once. 16 Capacity and Performance Engineering for Networked Application Servers 599 16.6.2 Reliability Modeling Once RBDs are constructed for the critical transactions, we can begin to look at element reliability in more detail where warranted. For example, consider again the PO server failover behavior described in the previous section. Assume that the procedure for handling PO outages is as follows: If the primary PO serving a given mailbox goes down, the data center staff first tries to reboot the PO. With probability c (referred to as the “coverage” factor to denote that the remedial action – in this case, a reboot – “covers” the failure event), the PO successfully comes back up. Otherwise, the PO is considered failed. If the spare PO is available, then state is migrated onto the spare PO (failover), which becomes the new primary PO for the given mailbox. Once the failed PO is repaired, it becomes the new spare. Otherwise, if the spare PO is unavailable (i.e., another PO failure occurred and that PO is not yet repaired), then the given mailbox is unavailable. The PO availability state space can then be described as follows. Let A denote the state “primary PO is active” D denote the state “primary PO is down” F denote the state “primary PO is failed” S denote the state “spare PO is available” U denote the state “spare PO is unavailable” Furthermore, let N denote the total number of PO servers denote the PO failure rate D MTTF1 R denote the PO repair rate D MTTR1 B denote the PO reboot rate D (time to reboot)1 F denote the PO failover rate D (time to failover to spare)1 c denote the reboot coverage factor As with most modeling efforts, we take a layered approach to this reliability modeling (starting with the simplest model first, and adding successively more detail until the benefits diminish). With this approach in mind, the simplest model results from assuming that the spare PO is always available. The state transition model for this case is shown in Fig. 16.4. Hopefully, “A/S” is the predominant state. Transitions from state “A/S” to “D/S” occur at rate if the primary PO goes down. Transitions from “D/S” back to “A/S” occur at rate cB if the reboot succeeds, whereas transitions from “D/S” to “F/S” occur at rate .1 c/B if the reboot fails. Finally, transitions from “F/S” to “A/S” occur at rate F if the primary PO fails over to the spare. Service is available if the PO is in state “A/S”. As discussed in Section 16.2, this state transition diagram Fig. 16.4 Simple PO server state transition diagram A S λ cμB D S (1– c)μB F S μF 600 P. Reeser describes a birth and death model. Solving the resulting equilibrium balance equations yields B F P .A=S / D : F . C B / C .1 c/B For N D 5, D 1=5;000, B D 4, F D 1, and c D 0:5, the probability that the mailbox is available P .A=S / D 99:985%, resulting in a PO hardware downtime DT H of 78.8 min/year. Thus, reflecting the detailed data center operations procedure (attempting a reboot before failing over to the spare PO) results in a reduction in our PO hardware downtime estimate from 131.4 to 78.8 min/year. In fact, we can see now that the original estimate of 131.4 min/year is an upper bound, since it essentially assumes that c D 0 (i.e., every failure results in failover). In contrast, the new estimate of 78.8 min/year is a lower bound, as it assumes that we only ever have one concurrent PO failure. This observation illustrates the point that much of the early C/PE analysis involves developing simple models to provide upper and lower bounds on the true answer. If the bounds are tight, then we can often move on to the next problem without the need to develop a more detailed model. In this case, the bounds are not tight. So adding the next layer of detail, we assume that the spare PO is sometimes unavailable due to the failure of another PO. Furthermore, we assume that at most one other PO has failed at any moment in time. This state transition model is shown in Fig. 16.5. Again, “A/S” (shown in the upper left corner) is hopefully the predominant state. Transitions from “A/S” to “A/U”, or from “D/S” to “D/U”, occur at rate .1 c/ .N 1/ in the event that one of the other N 1 POs fails (thus making the spare PO unavailable), whereas transitions from the bottom row state “*/U” to the top row state “*/S” occur at rate R in the event that the failed PO is repaired. Transitions from “A/*” to “D/*” occur at rate if the primary PO goes down. Transitions from “D/*” back to “A/*” occur at rate cB if the reboot succeeds, whereas transitions from “D/*” to “F/*” occur at rate .1 c/B if the reboot fails. Finally, transitions from “F/S” to “A/U” occur at rate F if the primary PO fails over to the spare PO. Service is available if the platform is in either of states “A/S” or “A/U”. Solving the (much more complicated) equilibrium balance equations for N D 5; D 1=5;000; R D 1=4; B D 4; F D 1, and c D 0:5 suggests that the probability that the mailbox is available P .A=S /CP .A=U / D 99:9849%, resulting A S (1–c)(N–1)λ μR A U λ cμB μR (1–c)(N–1)λ λ cμB (1– c)μB D S D U Fig. 16.5 Detailed PO server state transition diagram (1– c)μB F S μR F U μF Capacity and Performance Engineering for Networked Application Servers P(A*) 1.00000 Probability of PO Availability and HW Downtime vs Coverage 601 150 0.99995 125 0.99990 100 0.99985 75 0.99980 50 0.99975 0.99970 0.0 DT (min/ yr) 16 25 P(A*) DT 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.9 c 1.0 Fig. 16.6 Sensitivity of PO availability to coverage in a PO hardware downtime DT H of 79.3 min/year. Thus, adding this next layer of detail results in a further refinement in our PO hardware downtime estimate from 78.8 to 79.3 min/year. As with the RBD modeling, sensitivity analyses are easily facilitated by this modeling approach. For example, Fig. 16.6 shows the probability that the PO is available P .A= / and the PO hardware downtime as a function of the coverage factor c. From this type of analysis, we can then set development targets for recovery times and coverage factors. Additional layers of detail are possible. For example, we can model the totality of failure possibilities for all N POs. Here, the states can be represented by (i; j ), where i D 0; : : : ; N is the number of active POs, and j D 0, 1 is the number of available spares. However, given the minimal change achieved by the last refinement, further detail is not warranted in this case. 16.6.3 Failure Modes and Effects Analysis Failure modes and effects analysis (FMEA) is a proactive, systematic software quality assurance methodology, utilized during the design phase to help identify and correct weak points in the platform design, thereby addressing potential reliability problems prior to deployment. FMEA identifies both remedies to avoid outages, and mitigations (e.g., alarming and recovery procedures) to reduce recovery time. FMEA is powerful yet easy-to-use (one brainstorming session can yield significant improvements), and it typically pays for itself if just one failure is averted in the field. More specifically, FMEA is a disciplined design review technique intended to Recognize and evaluate potential failure modes of a system and their effects on user-perceived performance and reliability 602 P. Reeser Assess the criticality of the potential failure mode in terms of its frequency of occurrence and its severity of impact Identify actions that could reduce or eliminate the likelihood that the potential failure mode occurs Identify alarming, alarm handling, and recovery procedures that focus on mini- mizing the time to restore service Develop a prioritized set of recommendations to achieve the greatest potential “bang for the buck” in terms of quality and TTR The FMEA process is as follows: First, we decompose the system into functional elements (autonomous code modules with well-defined interfaces) and construct functional decomposition block diagrams to illustrate how the different subsystems are interconnected. Next, we identify the key transaction flows among the elements based on the operational workload profile and critical use cases. (This information is readily available from the Architecture and Workload assessments described previously.) Now, for each element/interface, we identify the possible failure modes and their likely effects. This step is typically accomplished during a brainstorming session with the platform architects, system engineers, lead developers, and lead testers, where the team addresses such questions as: What happens if interface X is slow, or hangs, or times out? What happens if external system Y is down for an extended period? What happens if the response is malformed, or inappropriate? Next, for each failure mode, we populate an FMEA spreadsheet with the following information: Failure mode (what can go wrong?) Failure effects (what are the impacts?) Frequency (how often does it occur?) Severity (how critical is it?) Detection (how is it recognized?) Root cause (what is the underlying event?) Remedies (what can be done to avoid the failure mode?) Mitigations (what can be done to alarm/recover quickly?) Effort (how costly – staff, capital – is it to do?) Remedies are proactive approaches that result in outage avoidance by eliminating the underlying root cause (e.g., fix the software design, or add redundancy to remove SPoFs). Mitigations are proactive approaches that result in outage minimization by reducing their impact (e.g., alarming to provide early warning/detection, alarm handling procedures to facilitate detection, or recovery procedures to expedite shortterm restoral/repair). Once a robust list of failure modes has been compiled, we review the spreadsheet to identify “low-hanging fruit,” such as Failure modes for which the remedy is trivial to implement (even if the criticality is low, eliminating these failure incidents is beneficial) 16 Capacity and Performance Engineering for Networked Application Servers 603 Failure modes for which the effects are catastrophic (even if the effort is high, these failures must be eliminated through design improvements or minimized through alarming/recovery procedures) Failure modes for which the frequency is high Finally, we prioritize the identified failure modes, and develop recommended action plans to address them. The priority of each failure mode is based on three factors: frequency of failure, severity of impact, and effort to remedy. One approach to prioritization is simply to assign numerical values to each factor, and use the product (or sum) of the values to determine priorities. The lower the product (or sum), the higher the priority (e.g., a product of 1 is number 1 priority). Example assignments are shown below: Frequency – – – – Level 1 (100/year): error response code, time out, core dump Level 2 (10/year): CPU board failure, database corruption Level 3 (1/year): hard disk crash, commercial power outage Level 4 (rarely): lightning strike, flood, locust, alien attack Severity – – – – Sev 1 (catastrophic): complete loss of system or major function Sev 2 (critical): severe reduction in functionality or performance Sev 3 (major): significant functionality/performance degradation Sev 4 (minor): slight degradation affecting limited population Effort – – – – Level 1 (trivial): no change to code base, no regression testing Level 2 (minor): one staff-day of rework, little retesting Level 3 (moderate): one staff-week of rework, some retesting Level 4 (major): complete redesign of critical module As an illustration of applying the FMEA methodology, a few sample FMEA scenarios for the e-mail platform are listed below: 1. Failure mode: Communication disrupted between the GWs and AS/Vs – Effects: Messages back up in GW memory, queues overflow, further SMTP connections denied, message integrity possibly compromised – Detection: Alarms monitoring GW memory usage, queue volumes – Root Causes: OB GW and/or IB AS/V process died? LAN failure? – Remedies: Redundant LANs, multiple virtual IP addresses (VIPs) – Mitigations: Alarm on queue thresholds, throttle SMTP connections, write IB messages to GW disk to guarantee integrity/avoid loss – Frequency: 2 (e.g., once per month) – Severity: 3 (i.e., significant functionality/performance degradation) – Effort: 3 (e.g., 2 weeks to design/code/test write-to-disk capability) – Priority: 18 (i.e., 2*3*3) 604 P. Reeser 2. Failure mode: PO physical disk storage exhaust – Effects: Inability to store/send messages, message corruption/loss – Detection: Alarms monitoring PO disk usage, GW queue volumes – Root causes: Disruption in PP/WM message deletion or PO garbage collection/disk clean-up? Unusual spike in volume or SPAM attack? – Remedies: Rate-limit message ingestion, adequate spare PO storage – Mitigations: Throttle message ingestion, off-load messages to tape – Frequency: 4 (e.g., rarely with proper C/PE planning) – Severity: 1–2 (i.e., possible loss of service to many users) – Effort: 1–2 (e.g., small effort to proactively monitor storage levels) – Priority: 9 (4*1.5*1.5) 3. Failure mode: User authentication services disrupted – Effects: Inability to access mailbox, retrieve/send messages – Detection: User complaints to customer care, DMoQ probe failures – Root causes: Authentication process failure or database corruption? Intranet connectivity disruption? Denial-of-service attack? – Remedies: Dual active–active authentication DBs, process monitors – Mitigations: Auto-restart process, roll back DB, or restore from tape – Frequency: 3 (e.g., once per year) – Severity: 1 (i.e., possible loss of service to all users) – Effort: 4 (e.g., major redesign of DB integrity/backup architecture) – Priority: 12 (3*1*4) Among these failure modes, “PO physical disk storage exhaust” is the highest priority at 9, followed by “user authentication services disrupted” at 12, then “communication disrupted between GWs and AS/Vs” at 18. 16.7 Capacity/Performance Assessment This section describes the Capacity/Performance Assessment activities. These tasks are usually performed during the software development and testing phases of the platform life cycle. The goals at this stage are to 1. Identify ongoing transaction usage measurement/monitoring requirements and develop a unified measurement architecture for performance data collection, storage, distribution, reporting, and visualization 2. Measure per-transaction server resource consumptions (e.g., CPU, memory, disk, and I/O) and performance, under normal and overload conditions, and quantify maximum system throughput and system capacity 3. Develop element/platform performance models to identify performance-limiting bottlenecks, and perform sensitivity analyses to identify and quantify the performance impact of required enhancements 4. Identify necessary overload controls to prevent performance degradation at high load and propose engineering rules to avoid overload 16 Capacity and Performance Engineering for Networked Application Servers 605 16.7.1 Performance Data Measurement Architecture Once software development has begun, we next turn our attention to planning for an effective capacity/performance measurement environment that will facilitate upcoming software testing as well as post-deployment ongoing scalability planning. As discussed previously, the ability to accurately measure transaction performance is critical for offering and supporting customer SLAs. A reliable performance measurement architecture is often one of the most overlooked aspects of software development, resulting in costly rework to build it back into the platform “after the fact.” Yet, it is one of the most important components of any successful platform. The foundation for this measurement architecture is often laid during this phase, as we prepare to measure performance in the laboratory. C/PE is responsible for thoroughly identifying all relevant application traffic/ workload and resource consumption/usage measurements. Each platform element must be instrumented to measure its own application workload and resource utilization. Each software component is responsible for “application-aware” measurements, while the server OS is responsible for basic, hardware-level measurements. For example, the GW application must be instrumented to track such applicationlevel workload metrics as the number of SMTP connections, the number of received messages, the number of AS/V filtered messages, the number of messages transmitted to the POs, and so on. Each such measurement should be reported at various levels of aggregation (e.g., total per day, during each hour, and during the busiest 5 min (B5M) or the busiest 1 min). The GW server (OS) must track resource utilizations, such as 5-min samples of CPU and memory utilization, disk and network I/O operations, and so on. (Fortunately, most OSs routinely track these system activity metrics, so special measurement code development is typically not required.) This measurement architecture is discussed in more detail in Section 16.9. 16.7.2 Performance Testing Software testing typically consists of three parts. First, during “unit” testing, individual software components are tested in isolation to ensure that they function as designed. Second, during “system” testing, components are collectively tested in an end-to-end manner to ensure that interfaces are operating properly and transactions are processed as expected. Third, during “load/soak” testing, collective components are stressed by generating multiple concurrent transactions to assess system behavior under expected load and under overload [16]. C/PE plays a role in all three parts. During the unit test phase, the actual transaction CPU consumptions Cij and “noload” transaction delays Tij are collected. This allows us to iteratively update the software estimation/budgeting analysis performed earlier, and to begin to gather the parameters and insights required to build element performance models. For instance, the budgeting formula reduces to Tij D Cij at no-load . 0/. If Tij is 606 P. Reeser significantly larger than Cij at no-load, then we know that other contributions to end-to-end delay (e.g., disk I/O time) must be explicitly captured during upcoming performance modeling. In the mean time, the budgeting analysis can be updated to reflect these additional delays. For example, if the unit testing reveals that the additional delay is associated with an operation that is not highly dependent on the load on this component (such as waiting for a timer to expire, looking up a record in a database dominated by other transactions, or session protocol handshake delay), then this additional delay can be treated as a fixed cost added to the transaction delays Tij during the budgeting calculation. On the other hand, if the unit testing reveals that the additional delay is associated with another server resource such as disk I/O, then the expression for the delay Tij of transaction i on component j must be expanded to include another load-dependent component, the disk controller. Thus, we now have Tij D Cij =.1– Cj /CDij =.1– Dj /, where Cj and Dj , respectively, denote the CPU and disk utilizations per replica of component j , and Dij denotes the disk consumption of transaction i on component j (which must be measured along with Cij during unit testing). Continuing the ear0 lier analysis in Section 16.5.2, assume for simplicity that the actual measured Ci s match the design estimates for transactions 1–3, and assume that the disk consumptions per-transaction are measured at D1 D 40 ms; D2 D 100 ms, and D3 D 50 ms. Then, the WM server disk utilization per replica D D †Ri Di =N D 37=N , and the average delay of transaction 1 (in seconds) is T1 D C1 =.1 C / C D1 =.1 D / D 0:02N=.N 18/ C 0:04N=.N 37/. Now, solving the (quadratic) expression T1 3 s for N yields N 37:5 WM servers. Thus, at least 38 WM servers are now required to meet the average delay objective for transaction 1 due to the disk I/O bottleneck. At this stage, we can also begin to identify automated workload generation tools capable of generating a production-level workload in the system and stress test environments. Depending on the application protocols in use, a commercial load testing tool may be available that emulates multiple users generating common transactions (e.g., delivering SMTP messages and retrieving/deleting messages via POP or HTTP). Otherwise, custom tools must be specified and developed to generate load. In either case, the load generation platform must be designed with scalability in mind, since a stress test environment that cannot drive the application servers into overload is of limited value. This often requires numerous user-emulation servers, together with a collection server capable of integrating the separate performance measurements. (Again, this stress testing infrastructure is itself a service platform, requiring its own C/PE effort.) During the system test phase, components are tested in a pair-wise manner to ensure that interfaces are operating properly, and eventually “strung” together to ensure that transactions are processed as expected end-to-end. At this point, we begin to get a clear picture of server resource consumptions along the entire critical path, and can evaluate the best-case (no-load) transaction performance. The first step is to verify that the sum of the delays observed during unit testing match the end-to-end delay observed during system testing. If not, then we must break down the end-to-end delay to determine the source of the discrepancy. Are there 16 Capacity and Performance Engineering for Networked Application Servers 607 unexpected interactions between platform components impacting performance? For example, to unit test the WM server processing a “retrieve message via HTTPS” transaction, we likely had to “hairpin” the request across the PO server interface and immediately return the response to the WM server (the requested message). During system testing, the WM server has to wait for the PO to return the message. Does the act of putting the WM process to “sleep” waiting for the response cause additional resource consumption, adversely impacting performance? Any learning from this exercise contributes valuable insights required to build element performance models. During the load/soak test phase, end-to-end components are stressed to assess system behavior under expected load and under overload. Unlike in the unit and system test phases, where the C/PE role is more passive (gleaning data and insights as a by-product of the test effort), we take an active role in driving the stress testing. The first step is to develop a comprehensive load/soak test plan specifying how load will be generated and performance will be measured, the characteristics of the workload to be generated (transaction mix and volumes, message sizes and mix, mailbox sizes, and so on), and the specific performance metrics to be captured (usr, sys, and wio CPU consumption, and disk and network I/O rates). There are numerous goals of this stress testing. One goal is to produce a so-called “load–service curve” characterizing the service performance (e.g., transaction delay) as a function of offered load (tps). A typical load–service curve is shown in Fig. 16.7. As can be seen, the delay starts at the best-case (no-load) value and remains relatively flat at low loads. Eventually, the transaction delay begins to rise rapidly as the bottleneck resource begins to saturate, approaching a vertical asymptote corresponding to a bottleneck resource utilization of 100%. This load level is often referred to as the “maximum throughput,” the highest level of load that the DELAY Maximum Throughput vs Capacity DELAY REQUIREMENT LOAD Fig. 16.7 Maximum throughput versus system capacity C A P A C I T Y M A X T P U T 608 P. Reeser system can handle without replication of the bottleneck component. Also shown in Fig. 16.7 is an example delay requirement for this transaction. The point at which the delay curve hits the requirement is often referred to as the “system capacity,” the highest level of load that the system can handle without violating the requirement. Note that system capacity < maximum throughput (always), and often system capacity << maximum throughput. Although the load–service curve depicted in Fig. 16.7 is typical of a wellbehaving system, other abnormal behavior is frequently observed. It is critical to pay attention to any observed abnormalities, as these are opportunities for performance improvement. For example, we sometimes see a load–service curve that increases linearly with load (i.e., delay at 2 tps D delay at 1 tps C X , delay at 3 tps D delay at 1 tps C 2X , and so on). This behavior is indicative of serialization within a process, where each request is single-threaded through a portion of the code (e.g., a synchronization lock). Other times, we see the maximum throughput asymptote occur at a point where resource utilizations are <<100%. This behavior is indicative of a software bottleneck (e.g., connection table entries and file descriptors). Frequently, we see a load–service curve that approaches the maximum throughput asymptote and then bends backward. This behavior is indicative of a “concurrency penalty” that occurs at high bottleneck utilizations (e.g., excessive memory paging and context switching). In all cases, these abnormalities are indications of performance problems that must be addressed. Note that for platforms such as our e-mail system, in which most components are stateless and multiple replicas can exist, stress testing is iterative. For example, say that in the first round of testing we found that the WM server was the throughput-limiting bottleneck for the HTTPS transactions, while the PO server’s highest-utilization resource only reached 40%. Then in the second round of testing, we should deploy two WM servers and verify that maximum throughput doubles. If so, then in the third round of testing, we should deploy three WM servers. At this point, the bottleneck resource should shift to the PO, and maximum throughput should increase to no more than 2.5 times the original throughput. In this manner, we can begin to determine the appropriate balance of machines required in our deployment configuration (e.g., 2.5 WM servers for each PO server). Furthermore, if the system capacity achieved during this third round of testing is, say, one-tenth of the expected peak workload, then we can also begin to determine the appropriate number of machines required in our deployment configuration (e.g., 25 WM servers and 10 PO servers). Another goal of stress testing is to identify engineering rules for the platform components. As discussed above, the capacity of a component is the level of utilization above which one or more transaction delay requirements are violated. Many planners bypass the “load–service” analysis and assume that system capacity is tied to a particular utilization level, say 80% or 90% utilization of the bottleneck resource. In reality, system capacity is dictated by the requirements, as depicted in Fig. 16.7, and not by the utilization level. Through stress testing, we can determine the component utilization levels above which performance is adversely impacted. Based on these levels, we can then define engineering rules required for ongo- 16 Capacity and Performance Engineering for Networked Application Servers 609 ing capacity/performance planning. For example, we may discover that to meet transaction requirements, user-facing components such as the WM or PP servers must be engineered to, say, 50% utilization, while system-facing components such as the GW or AS/V servers can be engineered to, say, 80% utilization. Once the platform is deployed, we can then monitor service-level metrics and adjust these engineering rules based on actual field performance under actual field workloads (discussed later). Other factors, such as observed historical volatility, or the impact of server/site failures, may also affect these engineering rules. 16.7.3 Performance Modeling Frequently, the ability to adequately measure component behavior during stress testing is limited by a number of factors, such as schedule constraints (time and personnel), equipment availability, workload generator capabilities, and so on. One of the most common factors is that the laboratory environment is not equipped with machines of the same caliber as those planned for deployment in the field. As a result, performance models may be required to estimate field performance. These models can range from simple BoE formulas, to detailed queuing models and simulations. As reviewed in Section 16.2, one common BoE model results from Little’s Law [9], which states that the average number N of jobs in the system equals the average arrival rate times the average delay D, or N D D. Given the expected transaction rates, message sizes, and estimated or measured service times, we can easily predict the average number of concurrent inbound SMTP connections, the average number of messages in process in the GW memory, the average throughput of messages over the egress network, and so on. Through this simple analysis, we can identify potential required platform enhancements such as faster processors or more memory in the servers, bigger links between the data center and the Internet, and so on. Another common BoE formula from Section 16.2 states that average delay D equals the average service time T divided by .1 /, or D D T =.1 /, where utilization D T . (This model was employed in the estimation and budgeting analysis in Section 16.5.2.) Also from Section 16.2, the average number of jobs in the system N D =.1 /. These simple formulae can be used to deduce numerous powerful observations about our platform. For example, suppose that we expect each GW to handle 80 tps, and we measure CPU consumption per-transaction to be 10 ms. Then the expected GW CPU utilization D T is 80%, the expected number of SMTP connections in progress N D =.1– / is 4, and the expected transaction delay for an arriving request D D T =.1– / is 50 ms (i.e., the service times for the four connections in progress plus that of the arriving request). One reason to develop detailed queuing models is to explain unexpected behavior observed during testing. As mentioned previously, abnormal behavior is frequently observed during stress testing, and performance models can often help 610 P. Reeser the development team to know where to look during detailed code examination, by providing possible explanations for the observed behavior. Another reason to develop performance models is to perform parametric sensitivity analyses around the usage assumptions (transaction workload mix and message/mailbox size) to quantify the performance impact of deviations from the assumptions, and to identify required platform enhancements. Yet another reason is to further evaluate system behavior under overload, and recommend overload controls to prevent performance degradation at high load. For example, we can identify appropriate limits on the number of simultaneous SMTP connections to the GW servers, thereby deferring load to nonpeak periods and avoiding GW overload. Finally, another reason is to evaluate the impact of proposed features and enhancements that have not been developed (and hence cannot be measured), allowing the development team to prioritize future features based on which gives the biggest performance “bang for the buck.” As an illustration, consider the following actual stress test results [16]. (The remainder of this section is extracted from Ref. [16] with express permission. In particular, we reuse Figs. 16.4 through 16.8 and associated text from Sections 16.3.2 through 16.3.7 on pp. 291–298.) Figure 16.8 shows the measured delay D and CPU utilization as a function of the number of concurrent simulated users N . Each user submits a transaction, waits for the response, submits another transaction, waits, and so on. As can be seen, the delay increases as more “load” (in the form of concurrent users) is offered to the server, but is not yet exhibiting the classic “hockey stick” behavior that is observed when the bottleneck resource saturates. Yet, the CPU utilization curve has leveled off at 65–70%, well below expectation if the CPU were the bottleneck. Although this CPU behavior seemed odd, it was not sufficient to get the attention of the development team. (After all, the system is handling more and more concurrent “load,” right?) So to better understand and explain what is happening, we can 100 15 90 80 12 70 60 9 50 40 6 30 20 3 10 0 1 2 3 4 5 6 7 0 8 9 10 11 12 13 14 15 16 17 18 19 20 # CONCURRENT USERS Fig. 16.8 Concurrency–service curve for a closed system CPU UTILIZATION (%) RESPONSE TIME (sec) Response time CPU utilization 16 Capacity and Performance Engineering for Networked Application Servers RESPONSE TIME (sec) 15 611 Response time linear asymptote 12 9 6 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # CONCURRENT USERS Fig. 16.9 System saturation in a closed system turn to queuing theory and modeling. First, we note that stress testing open systems by emulating users submitting transactions serially in a loop creates a closed system. That is, there are a fixed number of transactions in the system, corresponding to the number of concurrent users. Contrary to expectation for open systems, increasing the concurrency level does not necessarily result in increasing the load level. Thus, this delay curve is not a traditional load–service curve. In fact, for closed systems, it can be shown that as the bottleneck resource saturates, the relationship between N and D approaches a linear asymptote with a slope equal to the bottleneck resource holding time (rather than a vertical asymptote, as expected in open systems), and the transaction arrival rate D N=D stops increasing and levels off. As can be seen in Fig. 16.9, we have already hit the asymptote at a concurrency level of seven simulated users. In other words, the (unknown) bottleneck resource has already saturated, the arrival rate has leveled off, and the “concurrency–delay” curve is riding along its asymptote. In fact, it appears that the curve actually starts to diverge from the asymptote at higher levels of concurrency. By plotting delay D as a function of load D N=D, we can translate these stress test results into equivalent load test results. As shown in Fig. 16.10, maximum throughput peaks at 2 tps, then decreases. That is, as N increases beyond seven simulated concurrent users, there is actually a drop in capacity (“concurrency penalty”), likely due to context switching, object/thread management, garbage collection, and so on. With this knowledge in hand, the developers finally began to believe that there could be a problem. As Fig. 16.8 shows, the CPU was not the bottleneck (nor were any other hardware resources), so some unknown “soft” resource bottleneck was preventing full utilization of CPU resources. Candidate “soft” resources include OS/application threads, file descriptors, TCP transaction control blocks, I/O buffers, virtual memory, object/code locks, and so on. After discussions with the development team, we theorized that transactions were being serialized through a lock on one or more significant synchronized code regions (e.g., some Java-related kernel 612 P. Reeser RESPONSE TIME (sec) 15 12 9 6 3 0 0 0.5 1 1.5 2 OFFERED LOAD (requests/sec) 2.5 3 Fig. 16.10 Equivalent load–service curve for a closed system Front-End Sub-System 4 CPUs SW Bottleneck (Code Lock) 1 Server Back-End Sub-System ∞ Server Fig. 16.11 Model of a closed system with software bottleneck system call). To test this theory, we developed a simple queuing model (shown in Fig. 16.11) consisting of three serial queues: a CPU node (modeled as a four-server queue to represent the four CPUs in our test machine), a software bottleneck node (modeled as a single-server queue to represent the theorized code lock), and a fixed delay node (modeled as an infinite-server queue to reflect any load-independent components in the measured delays). The modeling results are shown in Fig. 16.12. As can be seen, this simple model produces a good fit to the test results, providing significant insights into the performance of the system. In particular, the approach characterized the system performance limitations, identified a significant software bottleneck in the code that prevented the system from fully utilizing the CPU resources, and exposed the system’s behavior under overload. The development team then ran a number of profiling tools to uncover the source of the code serialization. Two culprits became evident: first (at the application level), writes to the log file were synchronized, and second (at the kernel level), the createObject method was single-threaded. To address the first source, the logging method was rewritten to allow concurrent writes to the transaction log. To resolve the second source, many objects were moved from transaction (request) scope to application scope to avoid much of the object 16 Capacity and Performance Engineering for Networked Application Servers 613 RESPONSE TIME (sec) 15 Response time Model fit to data 12 9 6 3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # CONCURRENT USERS Fig. 16.12 Model fit to test results creation, de-referencing, and garbage collection activity. Collectively, these changes eliminated the software bottleneck and resulted in a dramatic improvement in the software performance. 16.8 Scalability Assessment This section describes the Scalability Assessment activities. These tasks are usually performed between the development and deployment phases of the platform life cycle. The goals at this stage are to 1. Develop platform-wide scalability models that reflect aggregate resource consumption (e.g., storage) 2. Identify scalability-limiting bottlenecks that impact the platform’s ability to scale linearly with usage, and quantify the scalability impact of proposed architectural enhancements 3. Provide capacity–deployment (“facility–demand”) projections comparing platform capacity to forecasted usage growth over time 4. Provide additional scalability requirements and engineering rules Once we have a handle on the predicted “current” performance (i.e., the behavior expected at service launch), we next take a look at the likely long-term platform performance/scalability as the workload grows beyond the near-term engineering limits. Simply speaking, a service platform can be called scalable if a 2 increase in load requires no more than a 2 increase in resources. For new deployments, we can expect that doubling the load requires much less than twice as much equipment, since we can hope to achieve some economies of scale as resources are utilized more efficiently. In addition, as the load grows, the peak:average ratio tends to drop 614 P. Reeser due to a number of factors (e.g., source ISP or Internet congestion smoothes out the traffic bursts). Thus, the average load may double, but the peak load (that we engineer for) may only increase by, say, 1.5 times. But at some point in the service growth, all economies and efficiencies of scale have been fully realized, and the best we can hope for is a linear scaling of equipment with load. At that point, we need to consider whether or not there are any resources that will begin to exhibit nonlinear scaling (i.e., doubling the load requires more than double the resources). There are many common but often overlooked examples. For instance, data center floor space and power are frequently limited, especially if you choose to locate your platform in a mature data center, if you are hosted in another entity’s facilities, or if you require contiguous space for growth. Solutions to these scalability limitations (e.g., relocating to another data center, or establishing a second site, or upgrading the power distribution plant) are often extremely costly, so advanced planning is critical if you expect platform demand to grow rapidly. And even if the data center itself is not a scalability bottleneck, connectivity to the Internet can be. Access links and edge router ports can eventually exhaust, requiring costly and time-consuming link/router upgrades. Advance planning for scale (e.g., budgeting for one 10 GigE link rather than hoping to add multiple 1 GigE links “on the fly”) can save cost and headaches in the long run. And some components may scale in “blocks,” requiring infrequent but costly upgrades to grow. For example, blade servers reside in blade centers, so adding another blade may require adding a new blade center chassis. Or adding storage to a NAS or SAN may require upgrading the storage appliance. One approach to planning for scalability is to determine the number of components required per unit of “demand.” In the case of our e-mail platform, the most effective unit of demand is a mailbox. Thus, we want to determine the number of GWs or POs required per (say) one million mailboxes. This is a two-stage process: first, we determine the measure of usage to which each component’s capacity is most sensitive, and measure/estimate the component capacity in terms of that metric, then we determine what a typical mailbox generates in terms of each of these measures of usage (often referred to as a usage profile). Integrating these results yields the example set of scalability engineering rules, illustrated in Table 16.5. Table 16.5 Example e-mail platform scalability engineering rules Server Server capacity Engineering Peak usage per component Usage metric @100% limit mailbox GW IB messages 100 m/s 70% 2/h MR OB messages 80 m/s 60% 1/h AS/V I/O messages 60 m/s 80% 3/h PO Storage 5 TBs 80% 10 MBs PP Retrieved messages 100 m/s 60% 1/h WM Retrieved messages 50 m/s 60% 2/h Servers per one million mailboxes 7.9 5.8 17.4 2.5 4.6 18.5 16 Capacity and Performance Engineering for Networked Application Servers 615 Growth in Number of Mailboxes, Mailbox Size, and Total Storage Number of Mailboxes Average Mailbox Size Total Persistent Storage Fig. 16.13 Example mailbox number, size, and storage growth Consider again our ISP e-mail platform. As mentioned earlier, storage of messages is stateful (i.e., persistent messages for a particular mailbox reside on a particular PO). As a result, persistent storage is of critical concern from a scalability standpoint. Total storage growth has two distinct components: growth in the number of mailboxes, and growth in the size of mailboxes. So even if the number N of mailboxes and the size S of mailboxes both growth linearly over time, the total storage NS increases super-linearly with time (as illustrated in Fig. 16.13). An added wrinkle results from the fact that different mailboxes grow at different rates. Thus, the particular mailboxes on one (stateful) PO may collectively grow at a different aggregate rate than those on another PO. This in turn leads to C/PE issues associated with load-balancing (discussed in detail in Section 16.9). The result of this compounded growth means that PO storage does not scale linearly. Over time, our PO scalability engineering rule in Table 16.5 will decrease due to growth in the average mailbox size. (In fact, this result is true for most other components, as changes in such factors as message sizes or filtering rules or usage profile will impact the engineering rules.) In the case of storage, however, this concern is particularly relevant, as many service-level metrics are sensitive to mailbox size. So even if we de-load a PO to account for increasing per-mailbox size, at some point the size of a mailbox can become too large to meet delay requirements. For example, consider the “retrieve mailbox list via HTTPS” transaction from Section 16.5. Eventually, mailboxes may become so large that we cannot process the “list” command in the allotted time. By considering these scalability bottlenecks early in the process, we can identify required architectural changes. For instance, in the case of mailbox size, we can consider tiered storage, where messages that have not been accessed in N months are moved to slower secondary storage. A static list of such messages can be compiled only every time the contents of secondary storage change (thus avoiding the delay of real-time compilation every time the user performs a “list” command). Or we can consider implementing an e-mail aging policy (discussed previously) to manage mailbox growth. 616 P. Reeser Facility-Demand Projection for PO Storage Deployment Projected Persistent Storage 100% of Deployed Capacity 80% of Deployed Capacity T-90 T Fig. 16.14 Example PO storage “facility–demand” projection Another component of the scalability assessment is the development of so-called “facility–demand” projections. (The term originated in teletraffic engineering, where there is a need to forecast the provisioning timeline for new “facilities” – voice and data trunks – as a function of projected growth in demand.) Consider again the growth in PO storage, and assume that the persistent storage curve in Fig. 16.13 represents a forecast of storage growth over time. Figure 16.14 shows the resulting facility–demand projection for PO storage. The dashed line represents 100% of the total deployed storage, whereas the dotted line represents 80% of deployed storage (the desired engineering limit). As can be seen, each time the projected persistent storage level hits the engineering limit of currently deployed storage, a new PO must be deployed. Thus, if the PO provisioning lead time is, say, 90 days (including a cushion to absorb variability in demand), then this facility– demand projection provides a forecast for when in the future new PO servers must be ordered. These capacity–deployment projections play an essential role in the ongoing capacity/performance management once the platform is deployed (discussed in Section 16.9). 16.9 Capacity/Performance Management This section describes the Capacity/Performance Management activities. These tasks are performed during the deployment and growth phases of the platform life cycle. The goals at this stage are to 1. Implement a measurement architecture capable of collecting, warehousing, and reporting/visualizing all performance (usage and resource consumption) and reliability (failure and outage) data 16 Capacity and Performance Engineering for Networked Application Servers 617 2. Implement a monitoring architecture capable of measuring, tracking, and reporting all platform quality metrics (DMoQs) and transaction/service-level objectives (SLAs/SLOs) 3. Implement capabilities required to analyze usage trends against service metrics to project platform usage growth and predict platform capacity augments required to maintain acceptable service levels 4. Identify/reflect unique characteristics of the service environment that impact C/PE (e.g., seasonality, shifts in traffic mix, and cyber attacks) 5. Automate performance management and capacity-deployment planning activities where appropriate 16.9.1 Measurement/Monitoring Infrastructure Congratulations! Your service platform has been deployed. But your C/PE work is far from over. As discussed previously, a reliable measurement and monitoring architecture is one of the most important components of any successful platform, and it is critical to any post-deployment C/PE activity. Hopefully, the foundation was laid during the testing phase. Now we need to ensure that the measurement platform is robust and scalable. It must be capable of collecting, storing, processing, and reporting all relevant performance (usage and resource consumption) and reliability (failure and outage) data across all platform elements. In particular, C/PE is responsible for specifying consistent, reliable mechanisms for performance data collection, storage, processing, distribution, reporting, and visualization. Data collection mechanisms may include native OS utilities, Simple Network Management Protocol (SNMP)-based MIBs, off-the-shelf “measureware” agents, and custom-developed code (often required for application-level data). The data storage architecture must be designed with extreme scalability and longevity in mind, typically consisting of a number of polling servers, flat-file and relational database servers, and reporting servers. It is all too common to see a poorly designed data storage architecture that “runs out of steam” soon after platform introduction. (This data storage infrastructure is itself a service platform, often requiring its own C/PE effort.) In addition to performance data, this warehouse should also include a comprehensive database of platform topology and server configuration data, including hardware profiles, software versions, and connectivity maps. The data distribution, reporting, and visualization architecture must be carefully designed to ensure that data is readily available in the appropriate formats. For example, “canned” reports and graphics may be required for executive dashboards, whereas raw data feeds may be required for capacity planning tools and ad hoc analyses. Finally, the data push (or pull) from platform elements to data warehouse to capacity management tools/users should be fully automated where possible, taking into consideration such security issues as firewalls between production sites and back-office systems. (For example, will FTP or e-mail work between a GW server on a secure production LAN and a collection server on a management LAN?) 618 P. Reeser Like the measurement architecture, the monitoring architecture must be designed for extreme scalability and robustness. C/PE is responsible for specifying consistent, reliable mechanisms for collection of any performance and availability data that are required to monitor and validate all service-level metrics. As discussed previously, these metrics should be specific, measurable, and controllable. A contractual service-level metric is useless if it cannot be accurately measured and verified. As a result, you need to consider how you plan to monitor the requirement. A number of approaches are possible, including client software add-ons, end-point hardware sniffers, outside vendor services, and parallel monitoring platforms. For example, consider the last approach of a separate measurement and monitoring platform. First, such a monitoring architecture typically consists of a number of probe servers to launch synthetic transactions. These probe servers emulate an end-user performing typical transactions, such as sending, listing, retrieving, and deleting messages. It is important to locate these servers so that the entire controllable transaction path is exercised. For instance, placing these servers in the same data center as the e-mail platform unnecessarily bypasses much of the ISP Intranet infrastructure, while placing them in an off-net data center introduces Internet and possibly peering connectivity issues that are out of the ISP’s control. Second, the monitoring platform typically consists of a number of probe mailboxes, distributed evenly across all POs in the data center. Third, the monitoring architecture typically consists of a well-defined set of synthetic user transactions that comprehensively cover all service-level metrics. For instance, if SLAs are defined for receive safe message, send safe message via SMTP, retrieve message via HTTPS, and retrieve message via POP, then the synthetic transactions must mimic these operations. As an example, assume that one such service-level metric is “receive safe 100 kB message within 15 min of sending 95% of the time.” The probe server thread can send a safe 100 kB e-mail message to a probe mailbox, sleep 15 min, and attempt to retrieve the message via HTTPS then via POP (thus removing the message). The probe server can keep track of successes and failures, and compute the 95th percentile over time. This single synthetic transaction allows us to monitor all SLAs defined above (send, receive, and retrieve) simultaneously. Finally, the monitoring platform typically consists of database and reporting servers, capable of providing “canned” reports and supporting ad hoc analyses. And of course, the SLA monitoring and verification process should be fully automated. 16.9.2 Resource Growth Projections One of the primary post-deployment capacity/performance management roles is to monitor the growth in consumption of critical platform resources, project when resources are likely to exhaust, and determine when resource augments must be scheduled based on deployment lead times. As introduced in Section 16.8 and illustrated for the PO storage resource, this task is facilitated by development and maintenance of “facility–demand” projections. Besides PO storage, another key 16 Capacity and Performance Engineering for Networked Application Servers 619 e-mail platform resource is server CPU utilization. A typical process for developing “facility–demand” projections of server CPU utilization consists of the following (performed separately for each element type GW, MR, AS/V, PP, WM, and PO): 1. First, collect 5-min samples of CPU utilization of all active servers 2. Compute the time-consistent 5-min average across all active servers 3. Compute the daily average busy hour (BH) server CPU utilization (a) This is the maximum rolling 1-h average of 5-min averages. (b) Other measures of peak utilization are possible, including busiest 5 min (B5M) and 95th percentile of 5-min samples (for typical daily traffic profiles, BH and 95th percentile values are similar). 4. Compute the weekly peak BH (or B5M or 95%) server CPU utilization (a) This is the maximum rolling 7-day peak of BH values. 5. Compute the linear trend through the series of weekly peak values 6. Compute the headroom threshold (HT) based on engineering limits (a) The HT D (engineering limit)1 . For example, if the GW engineering limit is 60%, then the GW HT is 1/0.6 D 1.67. (b) This headroom is intended to provide sufficient spare capacity to absorb historic volatility without suffering degraded performance. 7. Compute the CPU “consumption” trend D HT*fCPU utilization trendg (a) When utilization hits the engineering limit, consumption hits 100%. 8. Finally, project the CPU consumption trend into the future (say 1 year) (a) Server augment is required when the consumption trend hits 100%. (b) With each augment, the utilization and consumption trends “step down” by N=.N C 1/, where N is the number of active servers prior to augment. For example, if N D 3 then the consumption trend will be reduced from 100% to 75% when the fourth server is deployed. A typical set of server engineering limits and resulting HTs are given in Table 16.6 (including other platform infrastructure elements as well). As an illustration of the “facility–demand” projection process, consider the GW CPU utilization curves shown in Fig. 16.15. (This chart represents an actual GW component for a large cable ISP.) The thin solid curve shows the daily BH CPU utilization (the results of steps 1–3 above). The bold solid curve shows the weekly Table 16.6 Typical engineering limits and headroom thresholds Engineering limit (weekly peak BH CPU) 33% DNS, databases 3 HT 50% Network infrastructure 2 HT 60% GW, MR, PP, WM 1:67 HT 75% AS/V, PO 1:33 HT 620 P. Reeser 100% GW BH CPU Consumption (% of Current Capacity) 80% 60% 40% 20% average daily BH CPU weekly peak BH CPU HT 1.67x weekly peak 0% Apr Oct Jul Jan Jul Apr Jan Oct Fig. 16.15 Example GW CPU “facility–demand” curve 100% GW BH CPU Consumption (% of Current Capacity) 80% 60% 40% 20% 0% Apr average daily BH CPU weekly peak BH CPU HT 1.67x weekly peak Jul Oct Jan Apr Jul Oct Jan Apr Jul Oct Jan Fig. 16.16 Example GW CPU “facility–demand” projection peak BH CPU utilization trend (steps 4 and 5). The dotted line shows the CPU consumption trend based on a 60% engineering limit, and thus a 1:67 HT (steps 6 and 7). As can be seen, the daily BH curve exhibits weekday peaks as well as weekend troughs. As such, any trend through this daily data would be unduly skewed by the weekend data. In contrast, the linear regression through the weekly peak data (shown as a dashed line) captures only the weekday behavior, thus providing a more realistic basis for a trend of peak CPU utilization. Finally, Fig. 16.16 shows a projection of the CPU consumption trend five quarters into the future (step 8). As can be seen, CPU consumption is projected to hit 100% in March, signaling the need to deploy a new GW. (Equivalently, the CPU utilization is projected to hit the 60% engineering limit of currently deployed capacity.) Thus, if the GW provisioning lead time is 60 days, then a new GW server must be ordered in January. Capacity and Performance Engineering for Networked Application Servers # Servers per 1M Mailboxes (Including Headroom) 16 Oct 621 Server Consumption per 1M Active Mailboxes * Includes Headroom Factor * GW MR PP Jan Apr Jul Oct Fig. 16.17 Example “server consumption per mailbox” trends Another value to tracking CPU consumptions is the ability to project resource needs per unit of “demand.” In the case of an e-mail platform, it is valuable to project CPU consumption per mailbox for each platform element. With this knowledge in hand, the capacity/performance planner can then readily assess the impact of service growth. For example, assume that the ISP is planning to acquire a new market area (through the acquisition of another ISP, or the common swap of markets between ISPs to consolidate geographic footprints). Given the number of mailboxes to be added to the platform, you can quickly determine how many new servers must be deployed to accommodate those mailboxes. As an illustration, Fig. 16.17 shows example trends for the number of GW, MR, and PP servers required per one million new mailboxes, including associated headroom levels. (This chart represents actual server components for a large DSL ISP.) 16.9.3 Traffic Growth Projections Another primary post-deployment capacity/performance management role is to monitor the growth in traffic/usage (demand) of critical platform transactions, and reflect any unique characteristics of the service environment that impact capacity/performance engineering. These unique characteristics include seasonality, session/state management, load-balancing, off-site backups, shifts in traffic mix, and cyber attacks. Of particular interest in e-mail platform C/PE (or any enduser-driven service platform) is the impact of seasonality. As discussed previously, capacity planning must reflect daily periodicity (by engineering based on BH or 95th percentile) as well as weekly periodicity (by engineering based on weekday peaks). In addition, capacity planning must reflect yearly periodicity (seasonality). Consider the seasonal growth in e-mail storage. Figure 16.18 shows an example of growth in average mailbox size over a 3-year period. (This chart represents 622 P. Reeser Projected Average Mailbox Size Average Mailbox Size Projection Jan Apr Jul Oct Jan Apr Jul Oct Jan Apr Jul Oct Jan Fig. 16.18 Example mailbox storage growth and seasonality actual mailbox growth for a very large cable ISP.) As can be seen, storage utilization exhibits strong seasonality. Storage levels surge around holidays (specifically, Valentine’s Day, Halloween, and the December religious celebrations). In the case of consumer e-mail, these surges are due largely to the popularity of digital greeting cards, digital holiday photos, and holiday-themed animated executables. Peak utilization exceeds the trend (shown as a solid straight line) by as much as 10% during holidays. Thus, if we planned PO storage capacity based on average storage growth, then we could experience a serious capacity shortfall (and significant negative publicity) during holiday periods of peak demand. Also shown in Fig. 16.18 is a projection of mailbox size during the upcoming holiday period (shown as a thinner solid curve during the final quarter). As can be seen, this projection is not simply a linear trend, but rather mimics the yearend behavior observed during the previous year. There are a number of possible approaches to developing such a fluid-flow model of storage growth, most of which essentially involve “replaying” the previous year’s traffic behavior scaled by yearover-year volume changes. To illustrate at a high level, let St denote the stored volume at the beginning of day t, let It and Dt , respectively, denote the incoming and deleted volumes during day t, and let Fx denote the year-over-year scaling factor in volume x (where x D storage s, incoming i , and deleted d ). One approach is to replay the scaled daily storage change: St C1 D St C Fs .St 364 St 365 /. Another approach (utilized for the projection in Fig. 16.18) is to replay the scaled daily incoming and deleted volumes: St C1 D St C Fi It 364 Fd Dt 364 . Regardless of the approach used, developing such a seasonality-based projection allows for more accurate capacity planning during peak periods. As another example of seasonality, consider the seasonal variability in data center access link utilization. Figure 16.19 shows a real example of daily IB traffic variability over the course of a year (normalized by June volume). As can be seen, 16 Capacity and Performance Engineering for Networked Application Servers 1.6 623 Traffic IB to DC (Normalized) 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Fig. 16.19 Example seasonal variability in link utilization bandwidth utilization also exhibits strong seasonality, with traffic levels surging around Halloween, Christmas, and Valentine’s Day. Overall, the traffic volume is increased by 20% over the year. Even more striking is the 1:7 difference in traffic volumes between the year-end lull (83% of June volume) and the Halloween surge (144% of June volume). Again, the lesson is that such seasonality must be reflected in any capacity planning projections to avoid serious capacity shortfalls during holiday periods of peak traffic. Next, load-balancing is of particular concern for stateful components such as the PO server. As mentioned previously, different mailboxes grow at different rates. Thus, the particular mailboxes on one (stateful) PO may collectively grow at a different aggregate rate than those on another PO. Thus, operational procedures must be developed to monitor and balance the storage growth on individual POs (in addition to aggregate storage growth) to ensure that particular POs do not prematurely exhaust. Finally, as an example of shifting user behavior and traffic mix, consider the mix of message retrieval between HTTPS and POP. Figure 16.20 shows a real example of the percentage of users accessing their mailbox via HTTPS (WebMail) instead of POP over a 3-year period. As can be seen, HTTPS penetration increased steadily from 35% to 50% over the first two years and then leveled off. From an e-mail platform C/PE perspective, this HTTPS saturation is good news, since WM users are far more expensive to support than POP users (due to increased PO storage and server CPU consumption). Note also that the HTTPS:POP mix exhibits strong weekly periodicity, with higher WM penetration during weekdays (indicating that many users of this consumer ISP service retrieve their personal e-mail via HTTPS from their workplace computer during the week, and then use POP from their home computer over the weekend). 624 P. Reeser 60% WM % of Unique Daily Users 55% % WM 50% 45% 40% 35% 30% Jan Apr Jul Oct Jan Apr Jul Oct Jan Apr Jul Oct Jan Fig. 16.20 Example shift in message retrieval behavior 16.10 C/PE “Best Practice” Principles We conclude this chapter with a summary of C/PE “best practice” principles to guide you in your next effort. Develop and maintain a business-relevant transaction workload profile for use during initial platform sizing and ongoing new feature testing Define realistic, measurable service-level objectives tied to workload Specify comprehensive engineering rules for relevant service elements based on sound capacity/performance/reliability modeling and analysis Implement a single, consistent, comprehensive, authoritative database for platform topology and configuration data Thoroughly identify relevant usage/resource consumption metrics – BH server resource utilizations and 95th percentile traffic metrics Implement a consistent, reliable, scalable architecture for performance data collection, storage, and distribution (leveraging SNMP MIBs on switches and lightweight resource “measureware” agents on servers) Develop a highly-scalable warehouse for performance data, providing automated push/pull from elements to data warehouse to tools/users Implement a consistent, reliable mechanism for usage/performance data reporting and visualization, minimizing the required number of capacity/performance management tools/interfaces to be maintained Develop tools to provide historical and projected views of data, with – Flexible aggregation capabilities (by server type, technology, and so on) – Trending on peak (e.g., 95th percentile) values, not averages – Ability to reflect anticipated future events in trending/projections 16 Capacity and Performance Engineering for Networked Application Servers 625 Provide automated triggers to determine required capacity augments based on defined engineering rules/metrics Implement a well-defined deployment process with known lead times Acronyms ACL AS/V BH B5M BoE C/PE DMoQ DPM DSL DT FIFO FIT FMEA FTP FTTH GW HT HTTP HTTPS HW IMAP IB i.i.d. I/O ISP LAN LIFO MIB MR MRA access control list anti-spam/virus filtering server busy hour busy 5 min. back-of-the-envelope capacity/performance engineering direct measure of quality defect per million digital subscriber line downtime first-in-first-out fault insertion testing failure modes and effects analysis File Transfer Protocol fiber-to-the-home IB SMTP Gateway server headroom threshold Hyper-Text Transfer Protocol Secure HTTP hardware Internet Message Access Protocol inbound independent identically distributed input/output Internet service provider local area network last-in-first-out management information base OB Mail Relay server modification request analysis 626 MTTF MTTR NAS NFS OB PO POP PP PS RBD SAN SLA SLO SNMP SPoF SRE SMTP tps VIP WM P. Reeser mean-time-to-failure mean-time-to-restore network attached storage network file system outbound Post Office server Post Office Protocol POP Proxy server processor-sharing reliability block diagram storage area network service-level agreement service-level objective Simple Network Management Protocol single point of failure software reliability engineering Simple Mail Transfer Protocol transactions per second virtual IP address (aka VLAN) WebMail server References 1. Smith, C., & Williams, L. (2002). Performance solutions – a practical guide to creating responsive, scalable software. Reading, MA: Addison-Wesley. 2. Chrissis, M., Konrad, M., & Shrum, S. (2003). CMMI: Guidelines for process integration and product improvement. Reading, MA: Addison-Wesley. 3. Jain, R. (1991). The art of computer systems performance analysis: Techniques for experimental design, measurement, simulation, and modeling. New York: Wiley-Interactive. 4. Menasce, D., Almeida, V., & Dowdy, L. (2004). Performance by design – computer capacity planning by example. Upper Saddle River, NJ: Prentice Hall PTR. 5. Ross, S. (1972). Introduction to probability models. New York: Academic. 6. Cooper, R. (1981). Introduction to queueing theory (2nd ed.). New York: North Holland. 7. Lazowska, E., Zahorjan, J., Graham, G., & Sevcik, K. (1984). Quantitative system performance – computer system analysis using queueing network models. Upper Saddle River, NJ: Prentice-Hall. 8. Kleinrock, L. (1975). Queueing systems, volume 1: theory. New York: Wiley-Interscience. 9. Little, J. (1961). A proof of the queueing formula L D W. Operations Research 9, 383–387. 10. Hennessy, J., & Patterson, D. (2007). Computer architecture: a quantitative approach (4th ed.). Boston, MA: Elsevier-Morgan Kaufman. 11. Snee, R. (1990). Statistical thinking and its contribution to total quality. American Statistician, 44(2), 116–121. 12. Smith, C. (1990). Performance engineering of software systems. Reading, MA: AddisonWesley. 13. Musa, J. (1999). Software reliability engineering. New York: McGraw-Hill. 16 Capacity and Performance Engineering for Networked Application Servers 627 14. Billington, R., & Allan, R. (1992). Reliability evaluation of engineering systems (2nd ed.). New York: Plenum. 15. Reeser, P. (1996). Predicting system reliability in a client/server application hosting environment. Proceedings, Joint AT&T/Lucent Reliability Info Forum. 16. Huebner, F., Meier-Hellstern, K., & Reeser, P. (2001). Performance testing for IP services and systems. In Dumke, R, Rautenstrauch, C., Schmietendorf, A., & Scholz, A. (Eds.), Performance engineering – state of the art and current trends. Heidelberg: Springer-Verlag. Index A Access control, 6, 58, 150, 267, 282, 289–293, 463–465, 467, 534, 589 Access control list (ACL), 6, 264, 267–269, 272, 273, 282, 289, 291–293, 463, 465–467, 534, 555, 589 Access network, reliability modeling of, 12–13, 98, 102–104, 414, 552, 597, 599–601, 624 Access port, 105–106 Access router. See also Customer edge remote access router (RAR), 34–37, 39, 42, 74 Accountability Internet service model, 357 relationship to security, 450 in software development process, 16, 285, 548, 549, 581, 595, 604, 605 Accountable, 449, 460, 462, 549 Accounting, 231, 281, 322, 327, 330, 332, 465 ACL. See Access control list Active measurement, 14, 320–322, 342, 343, 345–351, 405, 406, 425 Adaptive sampling, 335 Add-drop multiplexer (ADM), 27, 29–32, 34, 36–41, 63–65, 74–75, 102 Address management, 262, 267 Address scanning, 480, 482 Advanced Research Projects Agency Network (ARPANET), 3 Aggregate link, 37, 38, 45, 85 Aggregation, 36, 49, 110, 130–131, 140, 144, 145, 147, 153, 161–163, 184, 323, 325–331, 334–336, 338, 339, 350, 351, 520, 529, 605, 624 Alarm. See also Event management correlation, 561 detection, 486 Anomaly detection, 156, 322, 427, 431, 486, 494 Another tool for language recognition database (ANTLR), 286–288, 297, 307 Application profile/mix, 150–151, 161, 323 Application programming interface (API), 261, 267, 268, 270, 285, 296, 313, 562, 567 configuration management, 265 documentation, 564 Arithmetic quantifier-free form, 279, 296, 297, 306, 312 AS PATH, 45, 46, 181, 185–188, 190, 191, 194, 197, 198, 201–203, 205, 211–213, 215, 218, 360, 361, 369, 371, 466 Assets, 259–261, 267, 268, 473, 474, 506, 521, 523–525, 527, 529, 530, 542 Assignment, 51, 59, 113, 121, 259, 260, 265–268, 456, 531, 556, 557, 603 port, 151, 270, 272, 340 Asynchronous transfer mode (ATM), 25, 27, 32, 33, 49, 54, 56, 58, 184, 344 Attack-resilient services, 243–244 Audit, 10, 260, 262, 264, 265, 268, 271, 274, 330, 339, 368, 369, 374, 406, 422, 439, 465, 468 configuration, 14, 257, 259, 261, 262, 269, 273, 463 regulatory, 463 Authentication, 225, 244, 265, 291, 324, 330, 450, 454, 457, 463, 465, 467, 588, 604 Authorization and accounting (AAA), 465 Autonomous system (AS), 42, 181, 242, 258, 282, 360, 453, 587 relationships, 184, 197–199, 204, 216, 311, 366 629 630 Availability, 8, 34, 130, 153, 194, 237, 286, 325, 397, 449, 519, 547, 581 access, 110 assessment, 16, 582, 595–604 average, 105, 107 backbone, 110 end-to-end, 98, 110–111 modeling, 12, 98, 425, 595–598 requirements, 551, 558 B Backscatter, 471, 481, 482 Backtracking, 287, 299 Backup, 62, 98, 128, 151, 184, 262, 289, 322, 417, 522, 552, 598 Bad gadget, 196 Balance equation, 583, 584, 600 Baseline. See Anomaly detection Best practices, 10, 15, 16, 142, 274, 278, 285, 291, 399, 441–442, 452, 461, 542, 578, 624–625 Bidirectional line switched ring (BLSR), 62–64 Binary decision diagrams (BDDs), 279, 296, 297, 306, 312 Birth and death model, 583, 595, 600 BitTorrent, 245–248 Black hole, 6, 289, 310, 398, 406, 445, 467, 487, 496, 497 Blacklist, 589 real time, 498–499 Blaster worm, 483 Block distribution problem, 245–246 Border gateway protocol (BGP), 13, 42, 99, 117, 141, 221, 258, 279, 408, 453 beacon, 205–207 CA, 212–213 community attribute, 187 convergence, 182, 185, 190–191, 196, 211–214, 359, 370, 371 eBGP, 45–47, 51, 53, 185–188, 190, 201, 214–217, 361–362, 365 flaps, 433–435 iBGP, 45–47, 185–188, 190, 199, 201, 202, 204, 214–217, 289, 309–311, 361–362, 367 local preference, 181, 187–191, 196, 199, 201, 212, 215, 361 multihoming, 182, 191–196, 201 path protection, 211, 214–217 path selection, 187–188, 211, 224 peer, 45, 46, 100, 185–187, 204, 215, 281, 365, 385, 390, 426 Index persistent route oscillation, 182, 196–200 policy, 46, 141, 181–183, 187–191, 196–197, 201, 215, 216, 221, 224, 309, 310, 313, 364, 365 prefix hijacking, 466 route reflector, 47, 361 routing loop, 182, 196, 211, 218 security, 53, 456 transient routing failure, 182, 200–204, 216, 218 YouTube incident, 452, 467 Border router, 43, 44, 192, 193, 217, 290, 295, 362–363 Botnet, 341, 455, 470, 471, 473–476, 481, 482, 491 detection, 488, 489, 492–494 recruiting, 458, 493, 494, 511 tracking, 492–494 Bug tracking, 566, 568, 570 Bundled link, 38, 65, 118, 119, 133 Business continuity, 518–527, 529, 542 Business impact analysis, 521–522, 524 Business support systems (BSS), 259–261, 265 C CA. See Consistency assertions Capacity assessment, 16, 582, 604–613 engineering, 16, 581–625 Capacity planning. See also Performance management green field, 152, 153 incremental, 152–157 Case management, 504, 507–508 Case study, 13, 20, 116, 125, 127–133, 229–232, 234–237, 244, 297, 299, 358, 372–374, 583, 585–588 email platform, 581–625 IPTV network, 12, 20, 85 Cell sites on light trucks (COLTs), 538 Cell sites on wheels (COWs), 538 Central offices (COs), 21–24, 27, 29, 30, 34–41, 62, 65, 74, 76, 243, 520 Certification authority, 231, 457 Certification testing, 266, 463, 573. See also Lab testing; Software testing Change, management, 568, 569, 578 Channelized, 27, 34, 39–40, 105, 107, 257 Chord, 40, 226–228, 230–232, 244 Cisco IOS, 190 Class of service (CoS), 7, 53, 54, 56, 58, 61, 71, 72, 78, 375 Index Clear channel, 28 Closest egress routing, 361 Code review, 565–566 Collaboration network, 14, 278–283 Collection infrastructure, 321, 323, 326, 329, 331, 441 Command and control, 493, 519 Community attribute, 187 Community of interest, 280, 300 Component, 5, 20, 98, 115, 137, 221, 256, 277, 322, 357, 406, 459, 518, 550, 586 binary, 119–121, 123 failure mode, 77, 115, 119–121 multi-mode, 119–120, 123 Composite link, 38, 321 Compromised, 53, 56, 225, 230, 451, 453, 454, 456, 458, 465, 466, 474, 475, 479, 603 ConfigAssure, 294, 297, 302, 303, 306, 308, 313 Configuration acquisition system, 278, 283–288, 312 application programming interface, 260, 261, 265, 267, 268, 270, 285, 296, 313, 562 auditing, 261, 262, 264, 269, 273, 463 consistency, 278, 279, 312 data, 102, 256, 259, 268, 273, 311, 330, 525, 560, 565, 617, 624 database, 257–263, 266–270, 272–274, 284, 286, 293, 294, 296–300, 302, 303, 307, 309, 312 download, 271, 273, 274, 531, 537 error, 10, 14, 273, 277–280, 282, 289–290, 295, 302, 309, 311 evaluation system, 278, 279, 283–288, 312 graphical user interface, 563 management, 9, 13–14, 255–275, 308, 520, 522, 531, 542 mediation layer, 260, 264 options, 264, 574 parameter, 278, 282, 288 repair, 10, 278–279, 285, 295, 297, 299, 311–313, 422 reports, 256, 259, 260, 268, 269, 273, 307, 312 validation, 7, 14, 170–171, 269–270, 273, 277–313, 442, 463 visualization, 278, 279, 284–286, 294–296, 312 Configuration logical structure integrity, 312, 313 BGP full-mesh, 289, 294, 361 BGP route-reflector, 47, 361 631 HSRP cluster, 288–290, 292, 294 MPLS tunnel, 217, 289 Configuration requirements connectivity, 193, 194, 278–281, 285, 286, 289–291, 300, 301, 312, 538, 560, 617 library, 278, 283–286, 288–292, 312, 313 performance, 278–283, 286, 291, 309, 312 proactive evaluation, 294 security, 278, 279, 281, 285, 286, 290, 291, 297, 299, 311, 312 Configuration specification language, 279, 283–286, 292, 295–297, 299–302, 307, 308, 310–313 Configuration validation logic, 277–278, 282, 285, 286, 288–289, 294–306, 311 Boolean logic, 279, 296, 297, 305, 306, 308 computational tree logic, 308 datalog, 279, 296, 297, 301, 305, 312 first-order logic, 279, 296, 297, 308, 312 MulVAL, 297, 301, 302, 305–306, 313 Prolog, 295–302, 304, 305, 308, 312 SLD-resolution, 296 Configuration validation model checking, 279 Connection-less, 27, 33, 49, 453 Connection-oriented, 22, 25, 27, 33, 45, 58, 242 Consistency assertions (CA), 212–213 Constraint-based shortest path first (CSPF), 55–56, 58 Constraint solver, 294, 302 Alloy, 308 Kodkod, 279, 296, 297, 306, 308, 312 Content delivery network (CDN), 4, 223, 225, 237–244, 248, 467 CDN servers, 237–238 DNS ‘hidden-load’ problem, 242–243 DNS ‘originator’ problem, 241, 242, 560 DNS outsourcing, 238–240 edge servers, 237–243 URL re-writing, 238, 240 use of IP anycast in, 241, 242, 381 Content filtering, 498 Continuity of government (COG), 519 Continuity of operations (COOP), 519 Continuous-time Markov chain, 583–584 Control bridge, 532 Control, network, 14, 20, 33, 151, 425, 464–467 device access, 464–465, 467 operational availability, 464, 467 passive router, 464–467 traffic flow, 464, 467 632 Coolstreaming, 233, 236–237 Cool zone, 539–540 Coverage factor, 599, 601 Critical infrastructure, 3, 450, 457, 511, 519, 535, 537 cyber-security, 459–460 Customer edge (CE) configuration of, 49, 50, 52, 105, 257, 264 router, 6, 34, 47, 104–105, 257, 263, 264 Customer support, 341 Cyber security economic incentives, 198, 199, 458, 499 relationship to critical infrastructure, 459–460 D Darkspace, 482 Database AAA, 465 common network management database, 431 inventory, 257–263, 266, 268, 272–274, 419 Data discords, 257, 258, 274 Data integration, 429–432, 442 Deep packet inspection (DPI), 322–324, 405, 474, 477 applications, 338–341, 474 privacy, 338, 339 systems, 339 Defects-per-million (DPM), 10, 11, 109, 110, 424, 592 Deming, Edward, 4 Denial of service (DOS), 13, 53, 148, 222, 238, 243, 244, 248, 447, 450, 464, 471, 479, 480, 487, 502, 521, 570, 604. See also Distributed denial of service (DDoS) Dense Wavelength Division Multiplexing (DWDM), 20, 25–32, 35, 36, 38, 62–64, 66, 74, 75, 77–79, 82, 85, 118, 130, 132 Department of Homeland Security (DHS), 459 Design point, 553 Detection, 9, 43, 100, 124, 144, 195, 231, 259, 299, 319, 367, 398, 448, 537, 560, 597 Digital cross-connect system (DCS), 25, 27, 31, 32, 34, 39, 40, 62, 63, 65 Digital Signal-x (DSX), 31 Disaster, 15, 436, 517–542, 558 diversity, 528 preparedness, 15, 517–542 Index recovery, 15, 517, 518, 522, 525, 527, 531, 532, 536, 541, 542, 559 service impact, 440, 524, 530, 533, 565 World Trade Center, 517–518 Disaster recovery institute, 517 Disaster recovery plans, 15, 518, 522, 542 Disk, 242, 269, 538, 558, 586, 590, 594, 603–607 logging to, 559–562 subsystem reliability, 552, 554, 558, 560–562, 565, 590, 596, 602 Distance constraint/reach constraint, 29–30 Distance-vector routing, 184, 359 Distributed denial of service (DDoS), 225, 447, 448, 450, 456, 459, 462, 467, 468, 471, 472, 474, 477, 480–482, 485, 487, 491–497, 499–500, 506, 510, 562 flow characteristics of, 480–481 mitigation, 494–497, 499–500 Distributed hash table (DHT), 224–232, 236, 248 primitives, 230 resiliency, 226–232 scalability, 226, 230 security against malicious users, 230–231 structured, 226–229 unstructured, 226, 228, 229 Distributed Management Task Force (DMTF), 286 Diversity, network, 98 Domain name system (DNS) outsourcing, 238–240 redirection, 238–239 security, 453, 455–457, 475 Do Not Fragment (DNF) bit, 281, 291 Doubling time, 155 Downtime (DT), 5, 76, 97, 98, 101–105, 522, 525, 539, 547, 552, 558, 596–598, 600, 601 DRI International, 517 Dual homed, 34, 105 E eBGP. See External BGP Edge switches, 49, 518, 534 Emergency management, 518, 519, 531, 532, 537 End offices, 518, 534 Engineering rules, 161, 256, 265–267, 272, 595, 604, 608, 609, 613–615, 624, 625 Entrance/exit criteria, 575 Index Ethernet, 23, 25–28, 32–34, 40–42, 54, 58, 60, 65, 76, 81, 162, 184, 258, 270, 286, 287, 298, 299, 372, 373, 434, 481, 570 layer, 23, 25, 32–34, 41, 65, 76, 81, 434 services (VPLS, Ethernet private line), 32, 33, 41, 42, 58, 60, 65 Event network event, 6, 15, 200, 211, 366, 367, 373, 374, 399, 400, 402, 404, 406, 408, 410, 423, 424, 428, 430, 434, 438, 441, 464, 471, 473, 479, 502, 504, 506, 508 Event correlation, 409–413 codebook, 410–411 cross-layer correlation, 412–413 rules-based, 410 Event detection. See also Instrumentation thresholding, 407 Event management, 10, 400–405, 407–413, 419, 420, 422, 423, 425–427 best practices, 441–442 chronic conditions, 427 EMC SMARTs, 410 event correlation, 409–413 event detection, 402–404, 406–408 event filtering, 336 event management system, 408–413 event notification, 401–404, 407–412, 422, 427 HP OpenView, 410 HP Operations Center, 410 IBM Tivoli, 410 impact, 404–406, 409 process automation, 419–422, 440 real-time event management, 424 ticketing, 413–414 troubleshooting, 414–416 Event notification, 401–404, 422, 427 alarms, 407–412 alerts, 407–412 Exploit trials, 482, 483 Exploratory data mining (EDM), 10, 424, 439, 442 data integration, 442 root cause analysis, 429–430 statistical correlation testing, 433 Export policy, 190, 198 Exterior gateway protocol (EGP), 360 External BGP (eBGP), 45–47, 51, 53, 185–188, 190, 201, 214–217, 361–362, 365 633 F Fail-over, 99–101, 108, 110, 172, 195, 196, 201, 206–212, 216, 217, 322, 527, 529–531, 538, 539, 559, 567, 596–600 Failure, 3, 20, 98, 113, 139, 222, 259, 278, 319, 367, 398, 465, 521, 548, 585 duration, 16, 106, 109, 210, 211, 596 mapping, 120, 122 mode, 9, 12, 74, 76, 77, 85, 115, 119, 120, 130, 412, 415, 418, 426, 435, 436, 595, 601–604 probability, 101, 118, 121, 122, 597 scenario, 8, 109, 110, 167, 172, 200, 204, 212, 572 state, 122 transient routing failure, 182, 183, 200–204–211, 215, 216, 218 types, 108, 109 Failure modes and effects analysis (FMEA), 601–604 Fault, 3, 26, 98, 113, 195, 229, 259, 290, 367, 397–441, 596 critical, 560 major, 560 minor, 560 Fault insertion testing (FIT), 596 Fault management. See Event management Federal emergency management agency (FEMA), 519 Federal Information Security Management Act (FISMA), 14, 291, 292, 294 general accountability office, 291–292 Feeder network, 21, 22, 64 Fiber cut, 75 File-sharing networks, 244, 245 block under-reporting, 248 bulk-data distribution, 492 free-riders problem, 246–248 proportional-share response, 247 unchoking algorithm, 246–247 Filtering, 140, 330, 336 Financial Industry Regulatory Authority (FINRA), 528 Finger table, 227 Firewall, 293, 306, 311 First field application (FFA), 7, 8 Flash crowd, 222–223, 481–482 Flash-cut, 574 Flow, 140, 141, 320, 323, 328–338, 467, 474, 480–482, 495 key, 326 records, 326–327 statistics, 326 634 Flow measurement, 330, 332, 333, 338 and address scanning, 480, 482 collection infrastructure, 321, 323, 326, 329, 331 and DDoS attacks, 474, 480–481 measurement standards, 323, 326, 327 mediation, 328–329 NetFlow, 327–329, 331, 334–336 Focused overload, 518 Forwarding equivalence class (FEC), 26, 54, 71, 72, 375 Forwarding information base (FIB), 98, 100, 358, 359, 364, 365, 375, 377–379, 383, 385–388 Forwarding loops, 196, 209, 211, 217, 310, 359 Forwarding path, 51, 125, 128, 307, 358, 366 Free riders, 245–248 Free-space optics, 538 G Ghost flushing, 213 Global Network Operations Center (GNOC), 531, 532 Gossip protocols, 237 Government Emergency Telecommunication Services (GETS), 518, 535 Graceful restart (GR), 100, 101 Graphical model of network layers, 22–24 Graphical user interface (GUI) configuration management, 265 Gravity model, 148–150, 158 Greyspace, 482 H Hazmat Team, 539 Hierarchical routing, 54, 360 High availability (HA), 100, 109, 281 Honeypots and honeynets, 471, 475, 476, 499, 510 Hot potato routing, 149, 189, 203, 371, 374 Hot standby routing protocol (HSRP) cluster, 288–290, 292, 294 Hot zone, 539 I iBGP. See Internal BGP Import policy, 187, 188, 190, 191 Indefeasible Right Of Use (IROU), 27 In-service software upgrade (ISSU), 101, 110 Index Instrumentation, 9, 111, 357, 364–367, 401, 402, 559. See also Measurement DPI, 405 end-to-end monitoring, 405–406 faults, 402–404 monitoring infrastructure, 7, 408, 425, 428, 559, 617–618 NetFlow, 323, 327–329, 331, 334–336 performance measurement, 404 route monitoring, 403–404 service monitoring, 405–406 SNMP polling, 7, 141, 324, 325, 330, 385 SNMP traps, informs, 407, 408, 412 syslogs, 403 traffic measurement, 405 Intelligent optical switch (IOS), 27, 32, 40, 62–65, 74, 75, 77, 190, 286, 297–299, 307 Interdomain routing, 13, 152, 181–218, 371, 372, 474 Interior gateway protocol (IGP), 20, 42, 44, 46, 51, 52, 57, 65, 73, 74, 76, 82, 83, 188, 189, 199, 200, 203, 217, 224, 360, 361, 375, 417, 438 convergence, 171 IS-IS, 43, 169, 181, 360, 375 LSA, 43–45, 66, 73, 184 OSPF, 20, 43, 48, 49, 56, 66, 169, 181, 184, 404, 474 OSPF-TE, 45, 56 Intermediate system-intermediate system (IS-IS), 43, 169, 181, 360, 375 Internal BGP (iBGP), 45–47, 185–188, 190, 199, 201, 202, 204, 214–217, 289, 309–311, 361–362, 367 International Telecommunication Union-Telecommunications Standardization Sector (ITU-T), 26, 343, 344 Internet assigned numbers authority (IANA), 150, 340, 466 Internet group management protocol (IGMP), 47, 48, 377, 381 Internet protocol goals of, 181, 534 Internet Protocol TeleVision (IPTV), 12, 13, 20, 48, 55, 61, 72, 79–85, 97, 116, 223, 377, 390, 392, 397, 406 performability analysis of IPTV distribution network, 127–130, 425 Index Internet route free core, 53 Internet service provider (ISP), 12, 19–86, 97, 98, 101, 102, 105, 107, 110, 130, 146, 158, 205, 207, 222, 225, 243, 244, 309, 585 transit, 183, 184, 206 Intradomain routing, 184, 187, 369 Inventory, 6, 9, 255, 256, 265, 269, 271, 493, 538 database, 257–263, 266, 268, 272–274 logical, 257–260, 267–268, 272, 274 physical, 257–261, 266–267, 272, 274 IP Assure, 14, 278, 286–295, 307 IP backbone, 11, 19–34, 39, 41, 42, 51–53, 56, 61, 62, 64–79, 85, 133, 280, 530 IP flow information export (IPFIX), 327–329, 332, 336 IP multicast. See Multicast routing ISP packet filters, 244 K Key performance indicator (KPI), 424–428, 439, 440, 442, 445, 563 L Label distribution protocol (LDP), 54, 60, 375, 376, 408 Label switched path (LSP), 41, 50–53, 57–60, 70, 72, 73, 375, 376 Lab testing, 429, 436, 440 Last mile. See Local loop Layer 2 VPN, 58, 60, 65 Layer 3 VPN, 58–60, 85 Line card failure, 101, 106, 407, 410–413, 417, 418, 420, 423. See also Line card outage Line card outage, 101, 105. See also Line card failure Link management protocol (LMP), 28 Link state advertisement (LSA), 43–45, 48, 58, 66–70, 73–75, 184, 363, 369, 372–374, 385 Link-state routing, 43, 184, 359, 360, 362 Link utilization statistics, 320 Little’s Law, 583, 585, 609 Load, 5, 35, 137, 182, 222, 267, 320, 366, 398, 467, 518, 551, 582 carried, 157, 158 offered, 157, 158, 535, 553, 571, 607, 612 Load balancing, 35, 36, 38, 76, 137, 168, 182, 195, 203, 243, 589, 597, 615, 621, 623 635 Load shedding, 553 Local loop, 21 local minima search (LMS), 228–230 Local preference, 181, 187–191, 193, 194, 196, 197, 199–201, 205, 212, 215 Logging, 330, 368, 559, 561, 562, 589, 612 fault, 560 format, 412, 561, 578 informational, 421, 499, 561 Longest prefix match, 359, 375 Loopback interface, 193 LSA aggregator (LSAG), 372–374 OSPF network topology model, 373–374 LSA reflector (LSAR), 373, 374 full adjacency, 372 host mode, 372 partial adjacency, 372 M MA. See Moving average Maintenance software, 61 window, 6, 100, 102, 108, 153, 270, 369, 551, 575 Make-before-break, 84 Management information base (MIB), 9, 141, 258, 323–325, 330, 348, 365, 385, 386, 389, 390, 404, 407, 411, 429, 617, 624 Management silos, 7, 430 Markovian queuing system, 583–585 Maximum transmission unit (MTU), 281, 282, 291 MBone, 377, 384, 385 Mean time between failures (MTBF), 78, 105, 118, 119, 126, 127 line card (LC), 98, 103–104, 108 route processor (RP), 98, 99, 103–104 Mean-time-to-failure (MTTF), 595–599 Mean time to repair (MTTR), 78, 118, 119, 126, 127, 595–599 line card (LC), 103 route processor (RP), 103 Mean-time-to-restore (MTTR), 596–599 Measurement, 43, 116, 137, 242, 319–351, 357–392, 401, 471, 525, 566, 582. See also Instrumentation active, 14, 320–322, 342, 343, 345–351, 405, 406, 425 architecture, 351, 604–605, 616, 618 challenges, 4, 7–8, 12, 16, 320–323, 325, 326, 330, 342, 351 636 end-to-end, 12, 13, 98, 110–111, 182, 183, 211, 218, 384, 402, 405, 406, 425, 426, 432 flow, 326–335, 337–338 infrastructure, 140, 206, 320–321–323, 326, 328–330, 335, 342, 347, 350–351, 428, 429, 617–618 methods, 14, 183, 320–321, 327, 342 passive, 14, 320–324, 338, 343, 350, 351, 405, 425 performance, 171, 321, 325, 338, 342, 350, 351, 404, 405, 605 routing, 171, 207, 211, 338, 404, 466, 467, 474 SNMP, 141, 323–325, 405 traffic, 14, 141, 146, 153, 158, 161, 171, 319, 323–324, 338, 405 Measurement consistent, 125, 200, 212, 320, 322 Measurement data reduction methods aggregation, 330 filtering, 330–331 sampling, 330–331 Measurement estimation error, 334–335 MED. See Multiple exit discriminator Mediation layer, 264 Memory leak, 404, 427, 439, 568, 570 Minimum route advertisement interval (MRAI), 182, 185, 191, 201, 213–215 MIRO. See Multipath interdomain routing Mission critical, 397, 518–520, 523, 524, 530, 535, 542 Mitigation, 9, 324, 440, 448, 459, 467, 472, 473, 476, 485, 488, 490, 494–497, 499, 500, 504, 507, 508, 510, 522, 523, 525–527, 601–604 Modification request analysis (MRA), 596 Modularity, 565–566, 578 Monitoring. See also Measurement multicast monitoring, 386–390 performance monitoring, 7, 14, 337, 390, 402, 405, 409 platform, 618 route monitoring, 357, 364–374, 382, 385, 390, 402, 415, 474 security, 324, 338, 468, 470, 472–477 Moving average (MA), 154, 485 SMA, 155 MPLS. See Multi-protocol label switching MRA. See Modification request analysis MRAI. See Minimum route advertisement interval MSDP. See Multiple source discovery protocol Index MSO. See Multiple system operator Multicast, 376–390 IGMP, 47 in IPTV distribution, 79–80 PIM, 47 PRM, 234–236 Multicast monitoring, 7, 14, 358, 378, 382, 386–390 challenges, 388, 390 routing tables, 386–387 SNMP-based, 386 tree discovery, 388–389 Multicast packet delivery, 377 Multicast routing, 48, 125, 357, 383–387, 390, 391 overview, 377–382 protocols, 377–382 shared tree, 378 source trees, 378–382 Multicast routing protocol, 377–382 Multihoming, 98, 182, 191–196, 201 Multipath interdomain routing (MIRO), 217 Multipath routing, 211, 218 Multiple exit discriminator (MED), 187, 188, 194, 199, 200, 361, 369 Multiple source discovery protocol (MSDP), 358, 378, 381–382 monitoring, 385, 390 Multiple system operator (MSO), 21 Multi-protocol label switching (MPLS), 23, 41, 42, 46, 49–60, 76, 83, 85, 97, 100, 217, 258, 264, 272, 289, 292, 321, 323, 324, 327, 369, 375–376, 391, 392, 401, 402, 407, 408, 410–412, 416–419, 425, 434, 436–439, 441, 442, 462 CSPF, 55, 56, 58 explicit path, 51, 56–58 FIB, 375 FRR, 12, 62, 65, 70–74, 81, 86, 128, 215, 375, 395 layer 3 virtual private network, 58 layer 2 virtual private networks, 33 TE, 20, 45, 52, 54–58, 168–170, 172, 358, 375 Multi-tree delivery, 236 N Nachi worm, 484, 485 NeighborhoodWatch, 230–232 NetFlow, 323, 327–329, 331, 334–336 Network as a database, 259, 420 Network coding, 246 Index Network design patterns, 285, 528 Network layer reachability information (NLRI), 45 Network layers, 5, 7, 12, 19–25, 27, 32, 33, 36, 38, 40, 45, 47, 54, 61, 62, 65, 66, 74, 76–79, 85, 116, 120, 319, 321, 323, 376, 377, 384, 401, 429, 430, 437, 442 Network merger, 152 Network operations, 7, 8, 10, 15, 66, 222, 261, 398, 399, 401, 415, 427, 437, 439–441, 444, 448, 458, 466–468, 470, 471, 474, 493, 494, 501–502, 505, 507–509, 532, 539 organization, 437 roles, 474 Network operations center (NOC), 336, 466, 467 Network planning, 76, 137–174, 322, 323, 340, 407 process, 137 NLRI. See Network layer reachability information Non-invasive testing by analyzing configuration, 278 Non-stop forwarding (NSF), 100, 101 Non-stop routing (NSR), 101 Normalization, 9, 11, 73, 335, 431, 432 NSF. See Non-stop forwarding NSR. See Non-stop routing O OpenDHT, 229–230 Open Shortest Path First (OSPF), 12, 20, 100, 117, 141, 181, 271, 281, 358, 404, 474, 522 area, 43, 44, 141, 282, 284, 289, 290, 294, 296, 362, 363, 373, 374 backbone area, 43, 362 border router, 43, 44, 290, 362, 363 AS Border Router (ASBR), 44, 363 flooding, 43, 44, 66, 68, 363 Link State Advertisement (LSA), 43–45, 48, 58, 66–70, 363, 369, 372–374 link-state database, 44, 45, 363, 373 SPF computation, 43, 44, 362 SPF tree, 43, 45, 66, 362 Open shortest path first (OSPF), 20, 43, 48, 49, 56, 66, 169, 181, 184, 404, 474 Open shortest path first-traffic engineering (OSPF-TE), 45, 56 Operational Readiness Test (ORT), 8, 573 637 Operations, 4, 19, 98, 119, 138, 222, 261, 278, 323, 358, 398, 448, 517, 551, 592 network, 7, 19, 166, 222, 261, 278, 398–402, 408, 420, 422, 427, 436, 438–441, 448, 518, 605 security, 15, 307, 477–511, 552 Operations support systems (OSS), 8, 10, 65, 260, 261, 468, 518, 520, 521, 533, 538–539 Optical transponder (OT), 28, 37, 63, 75, 118 Optical transport network (OTN), 26, 32, 116 Originator and hidden load problem, 242, 243 OSPFScan, 372–374 canonical form, 374 Outage, 5, 6, 11, 20, 66, 82, 84, 101–103, 105, 110, 166, 170, 182, 183, 211, 214, 224, 256, 265, 277, 322, 381, 397, 399, 406, 424, 425, 439, 459, 506, 508, 525, 532, 536, 537, 558, 559, 562, 572, 575–576, 599, 601–603, 616, 617 duration, 102, 105, 110, 425, 439, 525, 558, 562 impact, 105, 110, 183, 399, 406, 425, 439, 525, 537, 559, 602 planned, 101, 102, 166, 575 unplanned, 11, 101–103, 558 Overcast, 233 Overlay network, 13, 21, 79, 138, 221–248, 280 resilience, 13, 221–226, 230, 243 Overload behavior under overload, 551, 553, 605, 607, 610, 612 Overload control origin, 243 (See also load shedding) P Packet Sampling (PSAMP) standard, 332, 336–337 Packet selection filtering, 336, 337 hash based, 336 primitive, 336 standards, 336, 337 Pareto analysis, 428 Path diversity, 201, 211, 218 Path-vector routing, 181, 185, 211, 308, 359, 360, 371 Peer-assisted delivery, 223 Peer-to-peer file sharing, 224, 244–248, 492 block distribution problem, 245–246 free rider, 245–247 638 Performability, 12, 110, 113–133 analysis, 115, 124, 125, 128, 133 evaluation, 12, 110, 113–133 (See also Performability analysis) guarantees, 114, 115, 122 nperf analyzer, 12, 116, 117, 119, 121, 125–127 Performability analysis, 115, 124, 125, 128, 133 Performability evaluation, 12, 110, 113–133 approximate, 122 exact, 121 most probable states, 123, 125 state generation methods, 12, 115, 121, 123, 125, 126 Performance, 4, 20, 110, 113, 137, 181, 224, 255, 278, 319–351, 357–392, 397–442, 468, 517, 552, 581–625 assessment, 16, 370–371, 581, 582, 604 key performance indicator (KPI), 424–428, 439, 440, 442, 563 metrics, 11, 14, 16, 114, 160, 322, 342–348, 370, 424–426, 442, 582, 584, 586, 589–594, 605, 607, 609, 614, 615, 617, 618, 624, 625 (See also Performance measure) modeling, 11, 327, 346, 374, 593, 595, 598, 601, 606, 609–613, 624 monitoring, 7, 9, 14, 320, 323–325, 328, 330, 337–339, 341, 349, 357, 358, 364–369, 372–374, 376–378, 382–385, 387, 388, 390–392, 398, 401, 402, 405, 409, 439, 440, 442, 471, 604, 609, 618, 621, 623 reports, 321–325, 327–330, 332, 333, 336–338, 342, 343, 345, 346, 349, 386, 404, 409, 430, 592, 604, 605, 616–618, 624 requirements, 282, 291, 321, 332, 340, 344–346, 366, 369, 391, 582, 586, 589–595, 598, 604, 607–609, 613, 615, 618 testing, 345, 349, 582, 593, 596, 603–609, 611, 617, 624 Performance management, 9, 10, 15, 397–442, 582, 616–618, 621, 624. See also Event management “best practices,” 16, 399, 441, 442, 624–625 BGP example, 426, 433–435 correlation testing, 431, 433–435 data integration, 429–432, 442 “data silos,” 7, 430 defects per million (DPM), 10–11, 424 Index exploratory data mining (EDM), 424, 429, 431, 433, 436, 442 hardware/software analysis, 429 key performance indicators (KPIs), 424–428, 438, 439, 442 lab testing, 429, 436, 440 Pareto analysis, 428 resource growth projection, 618 root cause analysis, 410, 428–436, 441, 442 trending, 424–427, 431, 440 Performance measure, 115, 117, 121, 123–126, 171, 174, 321, 325, 338, 342, 350, 351, 404, 405, 605 algebraic bound, 123 distribution, 114, 116, 127, 132, 133, 328, 332, 333, 344, 348, 375, 377, 387, 583, 585, 604, 614, 617, 624 expected value, 113, 121 loss in network bandwidth, 124 lost traffic, 115, 124, 129 mean time between access disconnection, 130, 132 standardization of metric, 323, 326, 342–344 statistical bound, 123 steady-state, 114, 122, 125, 584, 595 Persistent route oscillation, 182, 196–199 Phishing, 448, 453, 472, 475, 476, 498, 499 PIM. See Protocol independent multicast Planned maintenance, 6, 15, 66, 102, 153, 397–442, 575 Planning horizon, 153, 156, 160, 163, 168, 174 Platform lifecycle, 581, 588, 590, 593, 595, 604, 613, 616 architecture phase, 582 deployment phase, 613 design phase, 582, 588, 590 development phase, 582, 595 growth phase, 616 test phase, 604 Point of presence (POP), 21, 22, 39–41, 51, 59, 137, 143, 144, 147, 151, 160, 162–166, 186 inter-POP design, 160, 164–168 intra-POP design, 160, 162–164, 166 Poisson process, 583–584 Policy management, 262, 263, 466, 468 POP. See Point of presence; Post Offce Protocol Port number, 102, 140, 150, 160, 326, 340 Post Office Protocol (POP), 556, 561, 587–589, 606, 618, 623, 626 Power, electrical, 76, 521, 528 Index Predictability, 468, 472, 473, 502, 540–541, 564 Prediction, 13, 138, 139, 146, 148, 152–161, 163, 168, 170, 172–174, 224, 325, 541 failure, 139, 157, 160, 170, 174, 224 traffic, 137–139, 146, 148, 152–161, 163, 168, 170, 173, 174, 325, 541 Prefix hijacking, 453, 466, 474, 558 PRM. See Probabilistic resilient multicast Probabilistic resilient multicast (PRM), 233–236 Process automation, 15, 399, 418–422, 440, 441 adaptive maintenance, 422 example, 420–422 expert system, 420, 422 role in event management, 419–422 rule, 420, 421 Proof of unsolvability, 296, 297, 302–304 Protocol independent multicast (PIM), 7, 42, 47–48, 80, 84, 85, 358, 378–383, 385–388, 390, 403, 408 bidirectional (PIM-Bidir), 378 Dense Mode (PIM-DM), 48, 378 source specific (PIM-SSM), 48, 378, 380 sparse mode (PIM-SM), 48, 378 Provider edge (PE) router, 6, 34, 47, 49, 50, 52, 58, 60, 98, 101, 102, 104, 105, 257, 263, 418, 433–435, 438, 466 configuration of, 6, 101, 257, 263 Q QoS. See Quality-of-Service Quality-of-Service (QoS), 61, 77–80, 150, 151, 258, 259, 262–264, 266, 267, 272, 273, 286, 292 R Randomized forwarding, 234, 235 RBD. See Reliability block diagram Reachability analysis, 293, 307–308, 311 Reconfigurable optical add-drop multiplexer (ROADM), 27, 29–31, 36–38, 63, 74 Recovery point objective (RPO), 525 Recovery time objective (RTO), 525 Redundancy, 12, 97–99, 101–104, 108, 130, 163, 165, 182, 191, 195, 201, 224, 279, 293, 295, 297, 306, 401, 416–418, 436, 438, 439 639 Reliability. See also Performability analysis, 12 assessment, 16 metrics, 12 modeling, 12–13 requirements, 3–4, 16 Reliability block diagram (RBD), 595, 597 Remote monitoring (RMON), 325 Reporting, 16, 132, 259, 268, 321–323, 325, 327–330, 332, 337, 342, 345, 346, 349, 406, 431, 486, 494, 506, 560, 569, 604, 616–618, 624 configuration, 336 in event management, 403–405, 407, 409, 410, 412, 413, 426, 427 performance, 346, 604, 616 security, 479 traffic, 256 Requirements, 8, 20, 97, 140, 222, 256, 277, 321, 366, 397, 463, 537, 548, 582 functional, 551, 570 non-functional, 551, 552 Resource budgeting, 593–595 estimation, 593–595 Resource reservation protocol (RSVP), 54, 56–58, 71, 375, 376 Restoration, 35, 40, 43, 57, 61–66, 70, 72–82, 84–86, 97, 99, 101, 115, 117, 119, 120, 122–125, 127, 130, 133, 340, 401, 440, 523, 531, 532, 535–537 one-by-one (1:1), 62 one-plus-one (1C1), 62 shared, 64 Reverse path forwarding, 48, 128 Risk management, 519, 521 Risk matrix, 524 RMON. See Remote monitoring ROADM. See Reconfigurable optical add-drop multiplexer Robust planning, 13, 138, 139, 170–174 Root cause analysis, 410, 428–436, 441, 442, 445, 504, 506, 508 Route discriminator. See Route distinguisher (RD) Route distinguisher (RD), 59, 258, 260, 267 Route monitoring multicast, 376–390 and security, 473–477 unicast, 358–376 Route monitors information, 142, 367–368 utility, 368–370 Route processor (RP), 98, 257, 424 640 Router hardware architecture, 98–99 LC, 66, 76, 81, 101, 410 reliability modeling, 102–104, 599–601 RP, 98, 257, 424 Route reflector, 47, 361 Router farm, 102 Route target (RT), 59, 258, 267, 270 Routing distance-vector, 184, 359, 360, 365–367, 371 high-availability (HA) protocol extensions, 100 interdomain, 183–191 intradomain, 184, 187, 215, 360, 368, 409 link-state, 43, 184, 215, 359, 360, 362, 365–367, 372, 373 loop, 101, 182, 183, 196, 204–205, 208–211, 218, 338 matrix, 141, 142, 156, 157, 165, 167, 169 multicast, 48, 125, 357, 377–387, 390, 391 NSR, 101 optimal, 125 path-vector, 185, 308, 359, 360, 371 policy, 182, 183, 186–190, 196–201, 203, 204, 216, 309–311 re-convergence, 110, 168 uncapacitated shortest path, 125 unicast, 48, 73, 358–363, 377, 378, 381, 383, 385, 391 Routing information base (RIB), 46, 100. See also Forwarding information base (FIB) Routing message collection CLI, 365 routing session, 365 SNMP MIBs, 365 splitters, 365 Routing protocol performance convergence, 370–371 stability, 370, 371 RP. See Route processor RPO. See Recovery point objective RSVP. See Resource reservation protocol RT. See Route target RTO. See Recovery time objective S Safe operating point, 161 Sampling consistent, 337–338 estimation error, 334 Index PSAMP standard, 336–337 statistical impact, 334–335 Sampling method adaptive sampling, 335 flow record sampling, 332–334 priority sampling, 333–334 random packet sampled flows, 331–332 smart sampling, 333, 334 stateful packet sampling, 335 stepping method, 335 threshold sampling, 333, 334 trajectory sampling, 337–338 SAT Solver, 296, 297, 306, 308 minimum-cost symbolic SAT solver, 279, 306 ZChaff, 279, 296, 297, 312 Scalability, 16, 43, 44, 184, 186, 226, 230, 241, 309, 328, 362, 367, 368, 370, 391, 448, 468, 477–479, 510–511, 554, 581, 582, 586, 588, 593, 605, 606, 617, 618 Scalability assessment, 16, 582, 613–616 Schema, 259–261, 268, 274, 286, 288 SDH. See Synchronous digital hierarchy Seasonal moving average (SMA), 155 Secure overlay services (SOS), 244 Security framework security, 460, 509 and network controls, 15 (See also Control, network) operations, 448, 493, 501–508 requirements, for configuration validation, 281, 291 services, 459, 487–501 Self-healing rings, 32, 63 Sensitivity analysis robust optimization, 173 Server engineering server sparing, 557 Server selection, 240–242 Service impact, 441, 524, 530, 533, 565 Service Level Agreement (SLA), 20, 61, 111, 322, 342, 345–347, 548, 552 process for developing, 345–347 requirements, 346 Service level objective (SLO), 8, 9, 617, 624 Service provisioning, 271–274 Shared risk link group (SRLG), 65 Shortest path first (SPF), 43 Silent failure, 406, 444, 445 Simple Network Management Protocol (SNMP), 323, 404, 465 MIB, 141, 323 Index SNMP informs, 407 SNMP traps, 407 Simplex condition, 530 Single point of failure (SPOF), 204, 278, 282, 289, 290, 293, 295, 312, 506, 528, 533, 578, 595, 598 SLA. See Service Level Agreement Slammer worm, 340, 459, 483, 484 SLO. See Service level objective SMA. See Seasonal moving average Smart sampling, 333–335 SNMP. See Simple Network Management Protocol Snort, 339, 341 Software availability, 595–598 defects, 5, 106, 109, 294, 569, 571 deployment, 451, 572, 574 installation, 341, 559, 562 integration, 4 reliability, 548, 549, 562, 566–568, 571, 574 upgrade, 101, 110, 117–119, 439, 441, 530, 559 Software architecture, 550 assessment, 553–562 constraints, 554, 556 principles, 554, 559, 561, 562 secure, 455–457 Software change control, 530 Software component, 7, 562, 569, 586, 587, 591, 593, 594, 605 Software configurability, 564–565 Software defects, 5, 106, 109, 294, 569, 571 Software deployment, 451, 572, 574 Software design external design, 562–563 internal design, 562–563 review, 565, 566 Software Reliability Engineering (SRE), 596 Software testing endurance, 570 FIT, 596 network management validation, 573 network validation, 573 operational readiness, 573 rainy day, 570 regression, 567, 571 stress, 570, 571, 573 unit, 567 SONET. See Synchronous optical network SOS. See Secure overlay services Source code control, 568 641 Spam, 448, 457, 471, 472, 476, 487, 488, 491–493, 495, 587, 589–591, 604 mitigation, 497–499, 510 SPF. See Shortest path first Splitstream, 233, 236 SPOF. See Single point of failure SQL. See Structured query language SRE. See Software Reliability Engineering SRLG. See Shared risk link group Standards IPFIX, 327–329, 332, 336 IPPM, 14, 342–344 PSAMP, 332, 336, 337 Y.1540, 343, 344 Stateful packet sampling, 335 State generation, 12, 115, 121, 123, 125, 126 State space size of network state space, 122 State transition, 206, 307, 583, 584, 599, 600 Statistical impact, 334–335 Stepping method, 335 Storage engineering, 556, 558–559 Streaming media overlay hybrid, 233 mesh-based, 233 multi-tree system, 233 single tree system, 232–233 swarming, 233 Structured query language (SQL), 292–295 Support systems, 414, 462, 559, 560 Swarming, 225, 233 Symbolic model checking, 279 Synchronous digital hierarchy (SDH), 26, 27, 31, 32, 63, 115, 162 Synchronous optical network (SONET), 5, 25–28, 31, 32, 34, 36, 39, 40, 62–64, 74, 75, 81, 101, 102, 133, 162, 171, 401, 408, 430, 522, 530 Syslog, 400, 403, 408, 415, 419, 429, 430, 434, 469 T Table top exercise, 527 TACACSC. See Terminal Access Controller Access-Control System Plus TDI. See Tie Down Information TDM. See Time Division Multiplexing TE. See Traffic engineering Telecommunications Service Priority (TSP), 535 Template management, 263–264 642 Terminal Access Controller Access-Control System Plus (TACACSC), 465 Test and turn-up, 270–271 Threat model, 449–452 Threats, to the network, 520–521 Threshold sampling, 333, 334 Ticketing system, 261, 268, 274, 400, 401, 413–414, 419, 420, 441, 444, 521, 560 Tie Down Information (TDI), 265, 272 Tier support, 573, 575 Time Division Multiplexing (TDM), 25, 26, 31–34, 39–42, 65, 79, 85 Tit-for-tat, 247 Tivoli, 410 Tomography network performance, 159, 351 traffic matrix, 159 Topology information, 43, 211, 224, 359–360, 488 learning, 359–360 Traffic congestion, 432, 533–535 Traffic controls, 193, 518 Traffic engineering (TE), 20, 45, 52–58, 85, 127, 137, 139, 157, 160, 168–170, 172–174, 182, 242, 358, 375 Traffic management, 398, 464, 518, 534 Traffic matrix ingress/egress, 143 inverse from link aggregate, 156–157 origin/destination, 143 Traffic patterns spatial, 148–150 temporal, 144–148 Trajectory sampling, 337 Transaction delay, 605, 607, 608 load, 605, 606 Transient routing failure, 182, 183, 200–205, 208, 215, 216 Trap, 259, 408, 476, 559–562 Triggered NAK, 235 Troubleshooting hardware/software analysis, 416 lab testing, 416 repair, 416–418 restore, 416–418 Trouble tickets, 5, 256, 560 Trust model, 452–455 TSP. See Telecommunications Service Priority Type inference, 279, 307, 312 PADS/ML, 307 Index U Unidirectional Path Switched Ring (UPSR), 63, 64 Uniform resource locator (URL) rewriting, 238, 240 Unplanned maintenance, 102, 103, 438 UPSR. See Unidirectional path switched routing User request reroute, 238 V Validation, configuration, 14, 170–171, 277–313 Validation products, 286–295, 311 IP Assure, 286–295 NetDoctor, 311 Netsys, 311 WANDL, 311 Virtual IP (VIP) address, 557 Virtual Private LAN Service (VPLS), 33, 41, 42, 58, 60, 65, 358, 375 Virtual Private Network (VPN), 7, 12, 20, 36, 39, 42, 46, 51–54, 58–61, 65, 85, 262, 264, 266, 272, 273, 279, 299, 345, 349, 358, 369, 375–376, 462, 494, 501, 562 provider-based, 377 Virtual Private Wire Service (VPWS), 58, 60 VPLS. See Virtual Private LAN Service VPN. See Virtual Private Network VPWS. See Virtual Private Wire Service Vulnerabilities, 6, 14, 139, 271, 278, 279, 341, 439, 451, 453, 455, 458, 461, 463, 465, 467, 470, 471, 475, 482, 483, 489, 491, 492, 501, 502, 504–506, 508, 522–525, 528, 542, 586 W Walk-through, 527, 565 WANDL. See Wide Area Network Design Laboratory Warehousing, 329, 616 Warm zone, 539 Wavelength continuity, 29 Wavelength Division Multiplexing (WDM), 75, 141, 164, 171 WDM. See Wavelength Division Multiplexing Wide Area Network Design Laboratory (WANDL), 311 Index Wireless Priority Service (WPS), 535 Work center, 518, 520–523, 525, 527, 529, 530, 533. See also Operations Workload, 590–595 characterization, 590–591 modeling, 591–595 World Trade Center, 517, 518 643 Worm propagation, 483–485 WPS. See Wireless Priority Service Y Y.1540, 343, 344
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : No Author : Charles R. Kalmanek, Sudip Misra, Yang (Richard) Yang Create Date : 2010:06:14 08:53:14Z Keywords : 1848828276, 9781848828278 Modify Date : 2010:06:14 08:53:50+02:00 Subject : Springer XMP Toolkit : Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04 Format : application/pdf Creator : Charles R. Kalmanek, Sudip Misra, Yang (Richard) Yang Description : Springer Title : Guide to Reliable Internet Services and Applications (Computer Communications and Networks) Creator Tool : Adobe Acrobat Pro Extended 9.3.2 Metadata Date : 2010:06:14 08:53:50+02:00 Producer : Adobe Acrobat Pro Extended 9.3.2 Document ID : uuid:ce2d250c-047f-4c1f-90f9-675eabf74843 Instance ID : uuid:ffa68b3c-8503-4f34-87e9-c163ab8c3b5e Page Layout : SinglePage Page Mode : UseOutlines Page Count : 637EXIF Metadata provided by EXIF.tools