Guide To Reliable Internet Services And Applications (Computer Communications Networks)
User Manual:
Open the PDF directly: View PDF .
Page Count: 637
Download | ![]() |
Open PDF In Browser | View PDF |
Computer Communications and Networks For other titles published in this series, go to www.springer.com/series/4198 The Computer Communications and Networks series is a range of textbooks, monographs and handbooks. It sets out to provide students, researchers and non-specialists alike with a sure grounding in current knowledge, together with comprehensible access to the latest developments in computer communications and networking. Emphasis is placed on clear and explanatory styles that support a tutorial approach, so that even the most complex of topics is presented in a lucid and intelligible manner. Charles R. Kalmanek Y. Richard Yang • Sudip Misra Editors Guide to Reliable Internet Services and Applications 123 Editors Charles R. Kalmanek AT&T Labs Research 180 Park Ave. Florham Park NJ 07932 USA crk@research.att.com Y. Richard Yang Yale University Dept. of Computer Science 51 Prospect St. New Haven CT 06511 USA yry@cs.yale.edu Sudip Misra Indian Institute of Technology Kharagpur School of Information Technology Kharagpur-721302, India smisra.editor@gmail.com Series Editor Professor A.J. Sammes, BSc, MPhil, PhD, FBCS, CEng Centre for Forensic Computing Cranfield University DCMT, Shrivenham Swindon SN6 8LA UK ISSN 1617-7975 ISBN 978-1-84882-827-8 e-ISBN 978-1-84882-828-5 DOI 10.1007/978-1-84882-828-5 Springer London Dordrecht Heidelberg New York British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2010921296 c Springer-Verlag London Limited 2010 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: SPi Publisher Services Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Foreword An oft-repeated adage among telecommunication providers goes, “There are five things that matter: reliability, reliability, reliability, time to market, and cost. If you can’t do all five, at least do the first three.” Yet, designing and operating reliable networks and services is a Herculean task. Building truly reliable components is unacceptably expensive, forcing us to construct reliable systems out of unreliable components. The resulting systems are inherently complex, consisting of many different kinds of components running a variety of different protocols that interact in subtle ways. Inter-networks such as the Internet span multiple regions of administrative control, from campus and corporate networks to Internet Service Providers, making good end-to-end performance a shared responsibility borne by sometimes uncooperative parties. Moreover, these networks consist not only of routers, but also lower-layer devices such as optical switches and higher-layer components such as firewalls and proxies. And, these components are highly configurable, leaving ample room for operator error and buggy software. As if that were not difficult enough, end users understandably care about the performance of their higher-level applications, which has a complicated relationship with the behavior of the underlying network. Despite these challenges, researchers and practitioners alike have made tremendous strides in improving the reliability of modern networks and services. Their efforts have laid the groundwork for the Internet to evolve into a worldwide communications infrastructure – one of the most impressive engineering artifacts ever built. Yet, much of the amassed wisdom of how to design and run reliable networks has been spread across a variety of papers and presentations in a diverse array of venues, in tools and best-common practices for managing networks, and sometimes only in the minds of the many engineers who design networking equipment and operate large networks. This brings us to this book, which captures the state-of-the-art for building reliable networks and services. Like the topic of reliability itself, the book is broad, ranging from reliability modeling and planning, to network monitoring and network configuration, to disaster preparedness and reliable applications. A diverse collection of experts, from both industry and the academe, have come together to distill the collective wisdom. The book is both grounded in practical challenges and v vi Foreword forward looking to put the design and operation of reliable networks on a strong foundation. As such, the book can help us build more reliable networks and services today, and face the many challenges of achieving even greater reliability in the years ahead. Jennifer Rexford Princeton University Preface Overview This book arose from a conversation at the Internet Network Management workshop (INM) in 2007. INM’07 was subtitled “The Five Nine’s Workshop” because it focused on raising the availability of Internet services to “Five Nine’s” or 99.999%, an availability metric traditionally associated with the telephone network. During our conversation, we talked about and vehemently agreed that there was a need for a comprehensive book on reliable Internet services and applications – a guide that would collect in one volume the accumulated wisdom of leading researchers and practitioners in the field. Networks and networked application services using the Internet Protocol have become a critical part of society. Service disruptions can have significant impact on people’s lives and business. In fact, as the Internet has grown, application requirements have become more demanding. In the early days of the Internet, the typical applications were nonreal-time applications, where packet retransmission and application layer retry would hide underlying transient network disruptions. Today, applications such as online stock trading, online gaming, Voice over IP (VoIP), and video are much more sensitive to small perturbations in the network. For example, following one undersea cable failure in the Pacific, AT&T restored the service on an alternate route, which introduced 5 ms of additional packet delay. This seemingly small additional delay was sufficient to cause problems for an enterprise customer that operated an application between a call center in India and a data center in Canada. This problem led to subsequent re-engineering of the customer’s end-to-end connection. In addition, networked application services have become an increasingly important part of people’s lives. The Internet and virtual private networks support many mission critical business services. Ten years ago, it would have been just an inconvenience if someone lost their IP service. Today, people and businesses depend on Internet applications. Online stock trading companies are not in business if people cannot implement their trades. The Department of Defense cannot operate their information-based programs if their information infrastructure is not operating. Call centers with VoIP services cannot serve their customers without their IP network. vii viii Preface Although we started work on this book with a focus on network reliability, it should be obvious from the preceding description that it is important to consider both reliability and performance, and to consider both networks and networked application services. Examples of networked applications include email, VoIP, search engines, ecommerce sites, news sites, or content delivery networks. Features This book has a number of features that make it a unique and valuable guide to reliable Internet services and applications. Systematic, interdisciplinary approach: Building and operating reliable network services and applications requires a systematic approach. This book provides comprehensive, systematic, and interdisciplinary coverage of the important technical topics, including areas such as networking; performance, and reliability modeling; network measurement; configuration, fault, and security management; and software systems. The book provides an introduction to all of the topics, while at the same time, going into enough depth for interested readers that already understand the basics. Specifically, the book is divided into seven parts. Part I provides an introduction to the challenges of building reliable networks and applications, and presents an overview of the structure of a large Internet Service Provider (ISP) network. Part II introduces reliability modeling and network capacity planning. Part III extends the discussion beyond a single network administrative domain, covering interdomain reliability and overlay networks. Part IV provides an introduction to an important aspect of reliability: configuration management. Part V introduces network measurements, which provide the underpinning of network management. Part VI covers network and security management, and disaster preparedness. Part VII describes techniques for building application services, and provides a comprehensive overview of capacity and performance engineering for these services. Taken in total, the book provides a comprehensive introduction to an important topic. Coverage of pragmatic problems arising in real, operational deployments: Building and operating reliable networks and applications require an understanding of the pragmatic challenges that arise in an operational setting. This book is written by leading practitioners and researchers, and provides a unique perspective on the subject matter arising from their experience. Several chapters provide valuable “best practices” to help readers translate ideas into practice. Content and structure allows reference reading: Although the book can be read from cover to cover, each chapter is designed to be largely self-contained, allowing readers to jump to specific topics that they may be interested in. The necessary overlap across a few of the chapters is minimal. Preface ix Audience The goal of this book is to present a comprehensive guide to reliable Internet services and applications in a form that will be of broad interest to educators and researchers. The material is covered in a level of detail that would be suitable for an advanced undergraduate or graduate course in computer science. It can be used as the basis or supplemental material for a one-or-two semester course, providing a solid grounding in both theory and practice. The book will also be valuable to researchers seeking to understand the challenges faced by service providers and to identify areas that are ripe for research. The book is also intended to be useful to practitioners who want to broaden their understanding of the field, and/or to deepen their knowledge of the fundamentals. By focusing our attention on a large ISP network and associated application services, we consider a problem that is large enough to expose the real challenges and yet broad enough to expose guidelines and best practices that will be applicable in other domains. For example, though the book does not discuss access or wireless networks, we believe that the principles and approaches to reliability that are presented in this book apply to them and are in fact, broadly applicable to any large network or networked application. We hope that you will find the book to be informative and useful. Florham Park, NJ India New Haven, CT Charles R. Kalmanek Sudip Misra Y. Richard Yang Acknowledgments The credit for this book goes first and foremost to the authors of the individual chapters. It takes a great deal of effort to crystallize one’s understanding of a topic into an overview that is self-contained, technically deep, and interesting. The authors of this volume have done an outstanding job. The editors acknowledge the contributions of many reviewers, whose comments clearly improved the quality of the chapters. Simon Rees and Wayne Wheeler, our editors at Springer, have been helpful and supportive. The editors also acknowledge the support that they have been given by their families and loved ones during the long evenings and weekends spent developing this book. xi Contents Part I Introduction and Reliable Network Design 1 2 The Challenges of Building Reliable Networks and Networked Application Services .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . Charles R. Kalmanek and Y. Richard Yang 3 Structural Overview of ISP Networks.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 19 Robert D. Doverspike, K.K. Ramakrishnan, and Chris Chase Part II Reliability Modeling and Network Planning 3 Reliability Metrics for Routers in IP Networks . . . . . . . . . . . . . . . .. . . . . . . . . . . 97 Yaakov Kogan 4 Network Performability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .113 Kostas N. Oikonomou 5 Robust Network Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .137 Matthew Roughan Part III Interdomain Reliability and Overlay Networks 6 Interdomain Routing and Reliability .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .181 Feng Wang and Lixin Gao 7 Overlay Networking and Resiliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .221 Bobby Bhattacharjee and Michael Rabinovich Part IV 8 Configuration Management Network Configuration Management . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .255 Brian D. Freeman xiii xiv 9 Contents Network Configuration Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .277 Sanjai Narain, Rajesh Talpade, and Gary Levin Part V Network Measurement 10 Measurements of Data Plane Reliability and Performance .. .. . . . . . . . . . .319 Nick Duffield and Al Morton 11 Measurements of Control Plane Reliability and Performance.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .357 Lee Breslau and Aman Shaikh Part VI Network and Security Management, and Disaster Preparedness 12 Network Management: Fault Management, Performance Management, and Planned Maintenance . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .397 Jennifer M. Yates and Zihui Ge 13 Network Security – A Service Provider View . . . . . . . . . . . . . . . . . .. . . . . . . . . . .447 Brian Rexroad and Jacobus Van der Merwe 14 Disaster Preparedness and Resiliency . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .517 Susan R. Bailey Part VII Reliable Application Services 15 Building Large-Scale, Reliable Network Services.. . . . . . . . . . . . .. . . . . . . . . . .547 Alan L. Glasser 16 Capacity and Performance Engineering for Networked Application Servers: A Case Study in E-mail Platform Planning . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .581 Paul Reeser Index . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .629 Part I Introduction and Reliable Network Design Chapter 1 The Challenges of Building Reliable Networks and Networked Application Services Charles R. Kalmanek and Y. Richard Yang 1.1 Introduction In the decades since the ARPANET interconnected four research labs in 1969 [1], computer networks have become a critical infrastructure supporting our information-based society. Our dependence on this infrastructure is similar to our dependence on other basic infrastructures such as the world’s power grids and the global transportation systems. Failures of the network infrastructure or major applications running on top of it can have an enormous financial and social cost with serious consequences to the organizations and consumers that depend on these services. Given the importance of this communications and applications infrastructure to the economy and society as a whole, reliability is a major concern of network and service providers. After a survey of major network carriers including AT&T, BT, and NTT, Telemark [7] concludes that, “The three elements which carriers are most concerned about when deploying communication services are network reliability, network usability, and network fault processing capabilities. The top three elements all belong to the reliability category.” Unfortunately, the challenges associated with running reliable, large-scale networks are not well documented in the research literature. Moreover, while networking and software-educational curricula provide a good theoretical foundation, there is little training in the techniques used by experienced practitioners to address reliability challenges. Another issue is that while traditional telecommunications vendors gained extensive experience in building reliable software, the pace of change has accelerated as the Internet has grown and Internet system vendors do not meet the level of reliability traditionally associated with “carrier grade” systems. Newer vendors accustomed to building consumer software are C.R. Kalmanek () AT&T Labs, 180 Park Ave., 07932, Florham Park, NJ, USA e-mail: crk@research.att.com Y.R. Yang Yale University, 51 Prospect Street, New Haven, CT, USA e-mail: yry@cs.yale.edu C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 1, c Springer-Verlag London Limited 2010 3 4 C.R. Kalmanek and Y.R. Yang entering the service provider market, but they do not have a culture that focuses on the higher level of required reliability. This places a greater burden on service providers who integrate their software to help these vendors “raise the bar” on reliability to offer reliable services. Although we emphasize network reliability in the foregoing section, it is important to consider both reliability and performance and to consider both networks and networked application services. Users are interested in the performance of an endto-end service. When a user is unable to access his e-mail, he does not particularly care whether the network or the application is at fault. Examples of network applications include e-mail, Voice over IP, search engines, e-commerce sites, news sites, or content delivery networks. 1.2 Why Is Reliability Hard? Supporting reliable networks and networked application services involves some of the most complex engineering and operational challenges that are dealt with in any industry. Much of this complexity is intentionally transparent to the end users, who expect things to “just work.” Moreover, the end users are typically not exposed to the root causes of network or service problems when their service is degraded or interrupted. As a result, it is natural for end users to assume that network and service reliability are not hard. In part, users get this impression because most service providers and Internet-facing web services operate at very high levels of reliability. Though it may look easy, this level of reliability is a result of solid engineering and “constant vigilance.” The best service providers engage in a process of continuous improvement, similar to the Japanese “Kaizen” philosophy that was popularized by Deming [2]. In this book, we address the challenges faced by service providers and the approaches that they use to deliver reliable services to their users. Before delving into the solution, we ask ourselves, why is it so hard to build highly reliable networks and networked application services? We can characterize the difficulty as resulting from three primary causes. The first challenge is scale and complexity; the second is that the services operate in the presence of constant change. These challenges are inherent to large-scale networks. The third challenge is less fundamental but still important. It relates to challenges with measurement and data. 1.2.1 Scale and Complexity Challenges Scale and complexity challenges are fundamental to any large network or service infrastructure. As Steve Bellovin remarked, “Things break. Complex systems break in complex ways” [8]. In particular, large service provider networks contain hundreds of thousands of network elements distributed around the world, and tens of 1 The Challenges of Building Reliable Networks and Networked Application Services 5 thousands of different models of equipment. These network elements are interconnected and must interoperate correctly to offer services to the network users. Failures in one part of the network can impact other parts of the network. Even if we consider only the infrastructure needed to provide basic IP connectivity services, it consists of a vast number of complex building blocks: routers, multiplexers, transmission equipment, servers, systems software, load balancers, storage, firewalls, application software, etc. At any given point in time, some network elements have failed, have been taken out of service, or will be operating at a degraded performance level. The preceding description only hints at the challenges. Despite the careful engineering and modeling that is done through all stages of the service life cycle, if we look at the service infrastructure as a system, we note that the system does not always behave as expected. There are many reasons for this, including: Software defects in network elements; Inadequate modeling of dependencies; Complex software-support systems. The vast majority of the elements involved in providing a network service contain software, which can be buggy, particularly when the software function is complex. If a bug is triggered, a piece of equipment can behave in unexpected ways. Even though the correct operation of router software is critical to service, we have seen design flaws in the way that the router-operating system handles resource management and scheduling, which manifest themselves as latent outages. The history of the telephone network contains examples of major network outages caused by software faults, such as the famous “crash” of the AT&T long-distance telephone network in 1990 [3]. Similarly, the network elements that make up the IP network infrastructure contain complex control-plane software implementing distributed protocols that must interoperate properly for the network to work. When compared to the telephone switching software, control plan software of IP networks changes more frequently and is far more likely to be subject to undetected software faults. These faults occasionally result in unexpected behaviors that can lead to outages or degraded performance. In a large complex infrastructure, operators do not have a comprehensive model of all of the dependencies between systems supporting a given service: they rely on simplifying abstractions such as network layering and administrative separation of concerns. These abstractions can break down in unexpected ways. For example, there are complex interactions between network layers, such as the transport and IP layers, that affect reliability. Consider a link between two routers that is transported over a SONET ring. Networks are typically designed so that protection switching at the SONET layer is transparent to the IP layer. However, several years ago, AT&T experienced problems in the field, whereby a SONET “protection switching event” triggered a router-software bug that caused several minutes of unexpected customer downtime. Since the protection switch occurred correctly, the problem did not trigger an alarm and was only uncovered by correlating customer trouble tickets with 6 C.R. Kalmanek and Y.R. Yang network event data. This cross-layer interaction is an example of the kinds of dependency that can be difficult to anticipate and troubleshoot. In addition to the scale of the network and the complexity of the network equipment, correct operation depends on the operation of complex software systems that manage the network and support customer care. Router-configuration files contain a large number of parameters that must be configured correctly. Incorrect configuration of an access control list can create security vulnerabilities, or alternatively, can cause traffic to be “blackholed” by blocking legitimate traffic. If there is a mismatch between the Quality of Service settings on a customer-edge router and those on the provider-edge router that it connects, some applications may experience performance problems under heavy load. An inconsistency between the network inventory database and the running network can lead to stranded network capacity, service degradations, network outages, etc. These problems sometimes manifest themselves weeks or months after the inconsistency appeared – for this reason, they are sometimes referred to as “time bombs.” 1.2.2 Constant Change The second challenge relates to the fact that any large-scale service infrastructure undergoes constant change. Maintenance and customer-provisioning activities in a large global network are ongoing, spanning multiple time zones. On a typical workday, new customers are being provisioned, service for departing customers is being turned down, and change orders to change some service characteristic are being processed for existing customers. Capacity augmentation and traffic grooming, whereby private-line connections are rearranged to use network resources more efficiently, take place daily. Routine maintenance activities such as software upgrades also take place during predefined maintenance “windows.” More complex maintenance activities, such as network migrations, also occur periodically. Examples of network migration include moving a customer connection from one access router to another, replacing a backbone router, or consolidating all of a regional network’s traffic onto a national backbone network in order to retire an older backbone. Replacing a backbone router in a service provider network requires careful planning and execution of a sequence of moves of the “uplinks” from access routers in order to minimize the amount of traffic that is dropped. Decision-support tools are used to model the traffic that impinges on all of the affected links at every step of the move to ensure that links are not congested. In the midst of these day-to-day changes, network failures can occur at any time. The network is designed to automatically restore service after a failure. However, during planned maintenance activities, it is possible that some network capacity has been removed from service temporarily, potentially leaving the network more vulnerable to specific failures. Under normal conditions, maintenance to repair the failed network element is scheduled to occur later at a convenient time, after which the network traffic may revert back to its original path. 1 The Challenges of Building Reliable Networks and Networked Application Services 7 Finally, in addition to the day-to-day changes of new customers, or the occasional changes that come from major network migrations, there are also architectural changes. These changes might result from the introduction of new features and services, or new protocols. An example might be the addition of a new “class of service” in the backbone. Another example might be turning up support for multicast services in MPLS-based VPNs. The first example (class of service) involves configuration changes that may touch every router in the network. The second example involves introducing a new architectural element (i.e., a PIM rendezvous point), enabling a new protocol (i.e., PIM), validating the operation of multicast monitoring tools, etc. All of these changes would have been tested in the lab prior to the First Field Application (FFA), which is typically the first time that everything comes together in an operational network carrying live customer traffic. If there are problems during the FFA with the new feature that is being deployed, network operations will execute procedures to gracefully back out of the change until the root cause of the problem is analyzed and corrected. 1.2.3 Measurement and Data Challenges The third challenge associated with building reliable networks is associated with measurement and data. Vendor products deployed by service providers often suffer from an inadequate implementation of basic telemetry functions that are necessary to monitor and manage the equipment. In addition, because of the complexity of the operating environment described earlier, there are many, diverse data sources, with highly variable data quality. We present two examples. Despite the maturity of SNMP [4], AT&T has seen an implementation of a commercial SNMP poller that did not correctly handle the data impacts of router reboots or loss of data in transit. Ideally, problems like this are discovered in the lab, but occasionally they are not discovered until the equipment is deployed and supporting live service. Data problems are not limited to network layer equipment: vendor-developed software components running on servers may not support monitoring agents that export the data necessary to implement a comprehensive performance-monitoring infrastructure. When these software components are combined in a complex, multitiered application, the workflow and dependencies among the components may not be fully understood even by the vendor. When such a system is deployed, even with a well-designed server instrumentation, it may be difficult to determine exactly which component is the bottleneck with limited system throughput. Another issue is that data are often “locked up” in management system “silos.” This can result from selecting a vendor’s proprietary element-management system. Typically, proprietary systems are not designed to make data export easy, since the vendor seeks to lock the service provider into a complete “solution.” Data silos can also result from internal implementations. These often result from organizational silos: a management system is specified and built to address a specific set of functions, without the involvement of subject matter experts from other domains. 8 C.R. Kalmanek and Y.R. Yang Whatever the cause, the end result is that the data necessary to monitor and manage the infrastructure may not exist or may be difficult to access by analysts who are trying to understand the system. 1.3 Toward Network and Service Reliability The examples in Section 1.2 give only a glimpse into the complex challenges faced by service providers who seek to provide reliable services. Despite these complexities, the vast majority of users receive good service. How is this achieved? At the highest level, network and service reliability involve both good engineering design and good operational practices. These practices are inextricably linked: no matter how good the operations team is, good operation practices cannot make up for a poorly thought out design. Likewise, a good design that is implemented or operated poorly will not result in reliable service. It should be obvious that reliable services start with good design and engineering. The service design process relies on extensive domain knowledge and a good understanding of the business and service-level objectives. Network engineers develop detailed requirements for each network element in light of the end-to-end objectives for reliability, availability, and operability. Network elements are selected carefully. After a detailed paper and lab evaluation, an engineering team selects a specific product to meet a particular need. Once the product is selected, it enters a change control process where differences between the requirements and the product’s capabilities are managed by the service provider in conjunction with the vendor. The service designers, working closely with test engineers, develop comprehensive engineering rules for each of the network elements, including safe operating limits for resources such as bandwidth or CPU utilization. Detailed engineering documents are developed that describe how the network element is to be used, its engineering limits, etc. Network management requirements for the new network element are developed in conjunction with operations personnel and delivered to the IT team responsible for the operations-support systems (OSSs). Before the FFA of the new element, the element, and OSSs undergo an Operations Readiness Test (ORT), which verifies that the element and the associated OSSs work as expected, and can be managed by network operations. The preceding paragraph gives a brief overview of some of the engineering “best practices” involved in building a reliable network. In addition, reliability and capacity modeling must be done for the network as a whole. The network architecture includes the appropriate recovery mechanisms to address potential failures. Reliability modeling tools are used to model the impact on the network of failures in light of both current and forecast demands. Where possible, the tools model cross-layer dependencies between IP layer links and the underlying transport or physical layer network, such as the existence of “shared risk groups” – links or elements that may be subject to simultaneous failure. By simulating all possible failure scenarios, these tools allow the network designers to trade off network cost against survivability. The 1 The Challenges of Building Reliable Networks and Networked Application Services 9 network design also includes a comprehensive security design that considers the important threats to the network and its customers, and implements appropriate access controls and other security detection and mitigation strategies. An operations organization is typically responsible for managing the network or service on a day-to-day basis. The operations team is supported by the operationssupport systems mentioned earlier. These include configuration-management systems responsible for maintaining network inventory data and configuring the network elements, and service assurance systems that collect telemetry data from the network to support fault and performance management functions. The fault and performance management systems are the “eyes” of the operations team into the service infrastructure to figure out, in the case of problems, what needs to be repaired. We can consider fault and performance management systems as involving the following areas: Instrumentation layer; Data management layer; Management application layer. We start thinking about the instrumentation layer by asking what telemetry or measurement data need to be collected to validate that the service is meeting its service-level objectives (or to troubleshoot problems if it is not). Standardized router MIB data provide a base level of information, but additional instrumentation is needed to manage large networks supporting complex applications. Passive monitoring techniques support collection of data directly from network elements and dedicated passive monitoring devices, but active monitoring, involving the injection and monitoring of synthetic traffic, is also required and is commonly used. Since the correct operation of the IP forwarding layer (data plane) critically depends on the correct operation of the IP control plane, both data plane and the control-plane monitoring are important. In software-based application services, the telemetry frequently does not adequately capture “soft” failure modes, such as transaction timeouts between devices or errors in software settings and parameters. Both the servers supporting application software and the applications themselves need to be instrumented and monitored for both faults and key performance parameters. Large service providers typically have a significant number of data sources that are relevant to service management, and the data management layer needs to be able to handle large volumes of telemetry and alarm data. As a result, the data-collection and data-management infrastructure presents challenging systems design problems. A good design allows data-source-specific collectors to be easily integrated. It also provides a framework for data normalization, so that common fields such as timestamps, router names, etc., can be normalized to a common key during data ingest so that application developers are spared some of the complexity of understanding details of the raw data streams. Ideally, the design of the data management layer supports a common real-time and archival data store that is accessed by a range of applications. 10 C.R. Kalmanek and Y.R. Yang The management applications supported on top of the data management layer support routine operations functions such as fault and performance management, in addition to supporting more complex analyses. Given the vast quantity of event data that is generated by the network, the event management system must appropriately filter the information that must be acted upon by the operations team to avoid flooding them with spurious information. The impact of alarm storms (and the importance of alarm filtering) can be illustrated by the story of Three Mile Island, in which the computer system noted 700 distinct error conditions within the first minute of the problem, followed by thousands of error reports and updates [5]. The operators were drowning in a sea of information at a time when they needed a small number of actionable items to work on. Management applications also enable operations personnel to control the network, including performing routine tasks such as resetting a line card on a router as well as more complex tasks. Standard tasks are handled through an operations interface to an operations-support system. Ad hoc tasks that involve a complex workflow may require operations staff to use a scripting language that accesses the network inventory database and sends commands to network elements or element-management systems. Ideally, the operations-support systems automate most of the routine tasks to a large extent, audit the results of these tasks, and back them out if there are problems. It is useful to note that operations personnel are typically organized in multiple response tiers. The lower tiers of operations staff work on immediate problems, following established procedures. The tools that they use have constrained functionality, targeted at the functions that they are expected to perform. The highest tier of operations personnel consists of senior operations staff charged with diagnosing complex problems in real-time or performing postmortem analysis of complex, unresolved problems that occurred in the past. These investigations may take more time than lower-tier operations staff can afford to spend on a specific problem. When there are serious problems affecting major customers or the network as a whole, engineers from the network engineering team are also called upon to assist. In these cases, one or more analysts do exploratory data mining (EDM) using data exploration tools [6] that support data drill down, statistical data analysis, and data visualization. Well-designed data exploration tools can make a huge difference when analysts are faced with the “needle in the haystack” problem – trying to sort through huge quantities of telemetry data to draw meaningful conclusions. When analysts uncover the root cause of a particular problem, this information can be used to eliminate the problem, e.g., by pressing a vendor to fix a software bug, by repairing a configuration error, etc. As we mentioned in Section 1.2, a broad goal of both the network designers and network operations is to maintain and continuously improve network reliability, availability, and performance, despite the challenges. “Holding the gains” or staying flat on network performance is insufficient to meet increasingly tight customer and application requirements. There is evidence that the principles and best practices presented in this book have results. Figure 1.1 shows measured Defects-per-Million 1 The Challenges of Building Reliable Networks and Networked Application Services 11 DPM (linear scale) UNPLANNED DPM 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 YEAR Fig. 1.1 Unplanned DPM for AT&T IP Backbone (DPM) on the AT&T IP Backbone since the AT&T Managed Internet Service was first offered in 1999. This chart plots the total number of minutes of port outages during a year (i.e., the number of minutes each customer port was out of service), divided by the number of port minutes in that year (i.e., the number of ports times the number of minutes each was in service), times a normalization factor of 1,000,000. The points are measured data; the smooth curve resembles a classic improvement curve. Over the first 2 years of the service, DPM was reduced significantly as vendor problems were addressed, architectural improvements were put in place, and operations processes were matured. Further improvements continue to be achieved. While DPM is only one of the many fault and performance metrics that must be tracked and managed, this chart illustrates how good design and good operations pay off. The principles that underlie design and operation of reliable networks are also critical to the design and operation of reliable application services. However, there are also many differences between these two domains, including wide differences in the domain knowledge of the typical network engineer and the typical software developers. The life cycle of reliable software starts with understanding the requirements, and involves every step of the development process, including field support and application monitoring. As in networks, capacity and performance engineering of application services rely on both modeling and data collection. This section has described some of the design and network management practices that are performed by large service providers that run reliable networks and services. In Section 1.4, we provide an overview of the material that is covered in the book. 12 C.R. Kalmanek and Y.R. Yang 1.4 A Bird’s Eye View of the Book The book consists of six parts, covering both reliable networks and reliable network application services. 1.4.1 Part I: Reliable Network Design Part I introduces the challenges of building reliable networks and services, and provides background for the rest of the book. Following this chapter, Chapter 2 presents an overview of the structure of a large ISP backbone network. Since IP network reliability is tied intimately to the underlying transport network layers, this chapter presents an overview of these technologies. Section 2.4 provides an overview of the IP control plane, and introduces Multi-Protocol Label Switching (MPLS), a routing and forwarding technology that is used by most large ISPs to support Internet and Virtual Private Network (VPN) services on a shared backbone network. Section 2.5 introduces network restoration, which allows the network to rapidly recover from failures. This section provides a performance analysis of the limitations of OSPF failure detection and recovery to motivate the deployment of MPLS Fast Reroute. The chapter concludes with a case study of an IP network supporting IPTV services that links together many of the concepts. 1.4.2 Part II: Reliability Modeling and Network Planning Part II of the book covers network reliability modeling, and its close cousin, network planning. Chapter 3 starts with an overview of the main router elements (e.g., routing processors, line cards, switching fabric, power supply, and cooling system), and their failure modes. Section 3.2 introduces redundancy mechanisms for router elements, as they are important for availability modeling. Section 3.3 shows how to compute the reliability metrics of a single router with and without redundancy mechanisms. Section 3.4 extends the reliability model from a single router to a large network of edge routers and presents reliability metrics that consider device heterogeneity. The chapter also provides an overview of the challenges in measuring end-to-end availability, which is the focus of Chapter 4. Chapter 4 provides a theoretical grounding in performance and reliability (performability) modeling in the context of a large-scale network. A fundamental challenge is that the size of the state space is exponential in the number of network elements. Section 4.2 presents a hierarchical network model used for performability modeling. Section 4.3 discusses the performability evaluation problem in general and presents the state-generation approach. The chapter also introduces the nperf network performability analyzer, a software package developed at AT&T Labs 1 The Challenges of Building Reliable Networks and Networked Application Services 13 Research. Section 4.4 concludes by presenting two case studies that illustrate the material of this chapter, the first involving an IPTV distribution network, and the second dealing with architecture choices for network access. Chapter 5 focuses on network planning. Since capacity planning depends on utilization and traffic data, the chapter takes a systems view: since network measurements are of varying quality, the modeling process must be robust to data-quality problems while giving useful estimates that can be used for planning: “Essentially, all models are wrong, but some are useful.” This chapter is organized around the key steps in network planning. Sections 5.2 and 5.3 cover measurements, analysis, and modeling of network traffic. Section 5.4 covers prediction, including both incremental planning and green-field planning. Section 5.5 presents optimal network planning. Section 5.6 covers robust planning. 1.4.3 Part III: Interdomain Reliability and Overlay Networks Part III extends beyond the design of a large backbone network to interdomain and overlay networks. Chapter 6 provides an overview of interdomain routing. Section 6.3 highlights the limitations of the BGP routing protocol. For example, the protocol design does not guarantee that routing will converge to a stable route. Section 6.4 presents measurement results that quantify the impact of interdomain routing impairments on end-to-end path performance. Section 6.5 presents a detailed overview of the existing solutions to achieve reliable interdomain routing, and Section 6.6 points out possible future research directions. Overlay networks are discussed in Chapter 7 as a way of providing end-to-end reliability at the application or service layer. The overlay topology can be tailored to application requirements; overlay routing may choose application-specific policies; and overlay networks can emulate functionality not supported by the underlying network. This chapter surveys overlay applications with a focus on how they are used to increase network resilience. The chapter considers how overlay networks can make a distributed application more resilient to flash crowds, to component failures and churn, network failures and congestion, and to denial-of-service attacks. 1.4.4 Part IV: Configuration Management Network design is just one part of building a reliable network or service infrastructure; configuration management is another critical function. Part IV discusses this topic. Chapter 8 discusses network configuration management, presenting a high-level view of the software system involved in managing a large network of routers in support of carrier class services. Section 8.2 reviews key concepts to structure the types 14 C.R. Kalmanek and Y.R. Yang of data items that the system must deal with. Section 8.3 describes the subcomponents of the system and the requirements of each subcomponent. This section also discusses two approaches that are commonly used for router configuration – policybased and template-based, and highlights the different requirements associated with provisioning consumer and enterprise services. Section 8.4 gives an overview of one of the key challenges in designing a configuration-management system, which is handling changes. Finally, the chapter presents a step-by-step overview of the subscriber provisioning process. While a well-designed configuration-management system does configuration auditing, Chapter 9 looks at auditing from a different perspective, describing the need for bottom-up, network-wide configuration validation. Section 9.2 provides a case study of the challenges of configuring a multi-organization “collaboration network,” the types of vulnerabilities caused by configuration errors, the reasons these arise, and the benefits derived from using a configuration validation system. Section 9.3 abstracts from experience and proposes a reference design of a validation system. Section 9.4 discusses the IPAssure system and the design choices it has made to realize this design. Section 9.5 surveys related technologies for realizing this design. Section 9.6 discusses the experience with using IPAssure to assist a US government agency with compliance with FISMA requirements. 1.4.5 Part V: Network Measurement While measurement was not a priority in the original design of the Internet, the complexity of networks, traffic, and the protocols that mediate them now require detailed measurements to manage the network, to verify that performance meets the required goals, and to diagnose performance degradations when they occur. Part V covers network measurement, with a focus on reliability and performance monitoring. Chapter 10 covers data plane measurements. Sections 10.2–10.5 describe a spectrum of passive traffic measurement methods that are currently employed in provider networks, and also describe some newer approaches that have been proposed or may even be deployed in the medium term. Section 10.6 covers active measurement tools. Sections 10.7–10.8 review IP performance metrics and their usage in service-level agreements. Section 10.9 presents multiple approaches to deploy active measurement systems. The control plane in an IP network controls the overall flow of traffic in the network, and is critical to its operation. Chapter 11 covers control-plane measurements. Section 11.2 gives an overview of the key protocols that make up the “unicast” control plane (OSPF and BGP) describes how they are monitored, and surveys key applications of the measurement data. Section 11.3 presents the additional challenges that arise in performing multicast monitoring. 1 The Challenges of Building Reliable Networks and Networked Application Services 15 1.4.6 Part VI: Network and Security Management, and Disaster Preparedness Chapter 12 focuses on the network management systems and the tasks involved in supporting the day-to-day operations of an IP network. The goal of network operations is to keep the network up and running, and performing at or above designed levels of service performance. Section 12.2 covers fault and performance management – detecting, troubleshooting, and repairing network faults and performance impairments. Section 12.3 examines how process automation is incorporated in fault and performance management to automate many of the tasks that were originally executed by humans. Process automation is the key ingredient that enables a relatively small Operations group to manage a rapidly expanding number of network elements, customer ports, and complexity. Section 12.4 discusses tracking and managing network availability and performance over time, looking across larger numbers of network events to identify opportunities for performance improvements. Section 12.5 then focuses on planned maintenance. The chapter also presents areas for innovation and a set of best practices. Chapter 13 presents a service provider’s view of network security. Section 13.2 provides an exposition of the network security threats and their causes. A fundamental concern is that in the area of network security, the economic balance is heavily skewed in favor of bad actors. Section 13.3 presents a framework for network security, including the means of detecting security incidents. Section 13.4 deals with the importance of developing good network security intelligence. Section 13.5 presents a number of operational network security systems used for the detection and mitigation of security threats. Finally, Section 13.6 summarizes important insights and then briefly considers important new and developing directions and concerns in network security as an indication of where resources should be focused both tactically and strategically. Chapter 14 discusses disaster preparedness as the critical factor that determines an operator’s ability to recover from a network disaster. For network operators to effectively recover from a disaster, a significant investment must be made to prepare before the disaster occurs, so that network operations are prepared to act quickly and efficiently. This chapter describes the creation, exercise, and management of disaster recovery plans. With good disaster preparedness, disaster recovery becomes the disciplined management of the execution of disaster recovery plans. 1.4.7 Part VII: Reliable Application Services Large-scale networks exist to connect users to applications. Part VII expands the scope of the book to the software and servers that support network applications. Chapter 15 presents an approach to the design and development of reliable network application software. This chapter presents the entire life cycle of what it 16 C.R. Kalmanek and Y.R. Yang takes to build reliable network applications, including software development process, requirements development, architecture, design and implementation, testing methodology, support, and reporting. This chapter also discusses techniques that aid in troubleshooting failed systems as well as techniques that tend to minimize the duration of a failure. The chapter presents best practices for building reliable network applications. Chapter 16 provides a comprehensive overview of capacity and performance engineering (C/PE), which is especially critical to the successful deployment of a networked service platform. At the highest level, the goal is to ensure that the service meets all performance and reliability requirements in the most cost-effective manner, where “cost” encompasses such areas as hardware/software resources, delivery schedule, and scalability. The chapter uses e-mail as an illustrating example. Section 16.4 covers the architecture assessment phase of the C/PE process, including the flow of critical transactions. Section 16.5 covers the workload/metric assessment phase, including the workload placed on platform elements and the servicelevel performance/reliability metrics that the platform must meet. Sections 16.6 and 16.7 develop analytic models to predict how a proposed platform will handle the workload while meeting the requirements (reliability/ availability assessment and capacity/performance assessment). Sections 16.8 and 16.9 develop engineering guidelines to size the platform initially (scalability assessment) and to maintain service capacity, performance, and reliability post deployment (capacity/performance management). Best practices of C/PE are given at the end of the chapter. 1.5 Conclusion With our society’s increasing dependence on networks and networked application services, the importance of reliability and performance engineering has never been greater. Unfortunately, large-scale networks and services present significant challenges: scale and complexity, the need for correct operation in the presence of constant change, as well as measurement and data challenges. Addressing these challenges requires good design and sound operational practices. Network and service engineers start with a firm understanding of the design objectives, the technology, and the operational environment for the service; follow a comprehensive service design process; and develop capacity and performance engineering models. Network and service management rely on a well-thought out measurement design, a data collection and storage infrastructure, and a suite of management tools and applications. When done right, the end result is a network or service that works well. As customers and applications become more demanding, this “raises the bar” for reliability and performance, ensuring that this field will continue to provide opportunities for research and improvements in practice. 1 The Challenges of Building Reliable Networks and Networked Application Services 17 References 1. A History of the ARPANET. Bolt, Beranek, and Newman, 1981. 2. Deming, W. E. (2000). The new economics for government, industry and education (2nd ed.). Cambridge, MA: MIT Press. ISBN 0–262–54116–5. 3. AT&T statement (1990). The Risks Digest, 9(63). 4. Wilson, A. M. (1998). Alarm management and its importance in ensuring safety, Best practices in alarm management, Digest 1998/279. 5. Stallings, W. (1999). SNMP, SNMPv2, SNMPv3, and RMON 1 and 2 (3rd ed.). Reading, MA: Addison-Wesley. 6. Mahimkar, A., Yates, J., Zhang, Y., Shaikh, A., Wang, J., Ge, Z., et al. (December 2008). Troubleshooting chronic conditions in large IP networks. Proceedings of the 4th ACM international conference on emerging Networking Experiments and Technologies (CoNEXT). 7. Telemark Survey. http://www.telemarkservices.com/ 8. Schwartz, J. (2007). Who needs hackers? New York Times, September 12, 2007. Chapter 2 Structural Overview of ISP Networks Robert D. Doverspike, K.K. Ramakrishnan, and Chris Chase 2.1 Introduction An Internet Service Provider (ISP) is a telecommunications company that offers its customers access to the Internet. This chapter specifically covers the design of a large Tier 1 ISP that provides services to both residential and enterprise customers. Our primary focus is on a large IP backbone network in the continental USA, though similarities arise in smaller networks operated by telecommunication providers in other parts of the world. This chapter is principally motivated by the observation that in large carrier networks, the IP backbone is not a self-contained entity; it co-exists with numerous access and transport networks operated by the same or other service providers. In fact, how the IP backbone interacts with its neighboring networks and the transport layers is fundamental to understanding its structure, operation, and planning. This chapter is a hands-on description of the practical structure and implementation of IP backbone networks. Our goal is complicated by the complexity of the different network layers, each of which has its own nomenclature and concepts. Therefore, one of our first tasks is to define the nomenclature we will use, classifying the network into layers and segments. Once this partitioning is accomplished, we identify where the IP backbone fits and describe its key surrounding layers and networks. This chapter is motivated by three aspects of the design of large IP networks. The first aspect is that the design of an IP backbone is strongly influenced by the details of the underlying network layers. We will illustrate how the evolution R.D. Doverspike () Executive Director, Network Evolution Research, AT&T Labs Research, 200 S. Laurel Ave, Middletown, NJ 07748, USA e-mail: rdd@research.att.com K.K. Ramakrishnan Distinguished Member of Technical Staff, Networking Research, AT&T Labs Research, Shannon Labs, 180 Park Avenue, Florham Park, NJ 07932, USA C. Chase AT&T Labs, 9505 Arboretum Blvd, Austin, TX 78759, USA e-mail: chase@labs.att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 2, c Springer-Verlag London Limited 2010 19 20 R.D. Doverspike et al. of customer access through the metro network has influenced the design of the backbone. We also show how the evolution of the Dense Wavelength-Division Multiplexing (DWDM) layer has influenced core backbone design. The second aspect presents the use of Multiprotocol Label Switching (MPLS) in large ISP networks. The separation of routing and forwarding provided by MPLS allows carriers to support Virtual Private Networks (VPNs) and Traffic Engineering (TE) on their backbones much more simply than with traditional IP forwarding. The third aspect is how network outages manifest in multiple network layers and how the network layers are designed to respond to such disruptions, usually through a set of processes called network restoration. This is of prime importance because a major objective of large ISPs is to provide a known level of quality of service to its customers through Service Level Agreements (SLAs). Network disruptions occur from two major sources: failure of network components and maintenance activity. Network restoration is accomplished through preplanned network design processes and real-time network control processes, as provided by an Interior Gateway Protocol (IGP) such as Open Shortest Path First (OSPF). We present an overview of OSPF reconvergence and the factors that affect its performance. As customers and applications place more stringent requirements on restoration performance in large ISPs, the assessment of OSPF reconvergence motivates the use of MPLS Fast Reroute (FRR). Beyond the motivations described above, the concepts defined in this chapter lay useful groundwork for the succeeding chapters. Section 2.2 provides a structural basis by providing a high-level picture of the network layers and segments of a typical, large nationwide terrestrial carrier. It also provides nomenclature and technical background about the equipment and network structure of some of the layers that have the largest impact on the IP backbone. Section 2.3 provides more details about the architecture, network topology, and operation of the IP backbone (the IP layer) and how it interacts with the key network layers identified in Section 2.2. Section 2.4 discusses routing and control protocols and their application in the IP backbone, such as MPLS. The background and concepts introduced in Sections 2.2– 2.4 are utilized in Section 2.5, where we describe network restoration and planning. Finally, Section 2.6 describes a “case study” of an IPTV backbone. This section unifies many of the concepts presented in the earlier sections and how they come together to allow network operators to meet their network performance objectives. Section 2.7 provides a summary, followed by a reference list, and a glossary of acronyms and key terms. 2.2 The IP Backbone Network in Its Broader Network Context 2.2.1 Background and Nomenclature From the standpoint of large telecommunication carriers, the USA and most large countries are organized into metropolitan areas, which are colloquially referred to as metros. Large intrametro carriers place their transmission and switching equipment 2 Structural Overview of ISP Networks 21 in buildings called Central Offices (COs). Business and residential customers typically obtain telecommunication services by connecting to a designated first CO called a serving central office. This connection occurs over a feeder network that extends from the CO toward the customer plus a local loop (or last mile) segment that connects from the last equipment node of the feeder network to the customer premise. Equipment in the feeder network is usually housed in above-ground huts, on poles, or in vaults. The feeder and last-mile segments usually consist of copper, optical fiber, coaxial cable, or some combination thereof. Coaxial cable is typical to a cable company, also called a Multiple System Operator (MSO). While we will not discuss metro networks in detail in this chapter, it is important to discuss their aspects that affect the IP backbone. However, the metro networks we describe coincide mostly with those carriers whose origins are from large telephone companies (sometimes called “Telcos”). Almost all central offices today are interconnected by optical fiber. Once a customer’s data or voice enters the serving central office, if it is destined outside that serving central office, it is routed to other central offices in the same metro area. If the service is bound for another metro, it is routed to one or more gateway COs. If it is bound for another country, it eventually routes to an international gateway. A metro gateway CO is often called a Point of Presence (POP). While POPs were originally defined for telephone service, they have evolved to serve as intermetro gateways for almost all telecommunication services. Large intermetro carriers have one or more POPs in every large city. Given this background, we now employ some visualization aids. Networks are organized into network layers, which we depict vertically with two network graphs vertically stacked on top of one another in Fig. 2.1. Each of the network layers can be considered to be an overlay network with respect to the network below. Inter-metro network Metro 5 Metro 4 Metro 3 Metro 1 Metro 2 Fig. 2.1 Conceptual network layers and segmentation 22 R.D. Doverspike et al. We can further organize these layers into access, metro, and core network segments. Figure 2.1 shows the core segment connected to multiple metro segments. Each metro segment represents the network layers of the equipment located in the central offices of a given metropolitan area. The access segment represents the feeder network and loop network associated with a given metro segment. The core segment represents the equipment in the POPs and network structures that connect them for intermetro transport and switching. In this chapter, we focus on the ISP backbone network, which is primarily associated with the core segment. We refer only briefly to access architectures and will discuss portions of the metro segment to the extent to which they interact and connect to the core segment. Also, in this chapter we will not discuss broader telecommunication contexts, such as international networks (including undersea links), satellite, and wireless networks. More detail on the various network segments and their network layers and a historical description of how they arose can be found in [11]. Unfortunately, there is a wide variety of terminology used in the industry, which presents a challenge for this chapter because of our broad scope. Some of the terminology is local to an organization, application, or network layer and, thus, when used in a broader context can be confused with other applications or layers. Within the context of network-layering descriptions, we will use the term IP layer. However, we use the term “IP backbone” interchangeably with “IP layer” in the context of the core network segment. The terms Local Area Network (LAN), Metropolitan Area Network (MAN), and Wide Area Network (WAN) are also sometimes used and correlate roughly with the access, metro, and core segments defined earlier; however, LAN, MAN, and WAN are usually applied only in the context of packet-based networks. Therefore, in this chapter, we will use the terms access, metro, and core, since they apply to a broader context of different network technologies and layers. Other common terms for the various layers within the core segment are long-distance and long-haul networks. 2.2.2 Simple Graphical Model of Network Layers The following simple graph-oriented model is helpful when modeling routing and network design algorithms, to understand how network layers interact and, in particular, how to classify and analyze the impact of potential network disruptions. This model applies to most connection-oriented networks and, thus, will apply to some higher-layer protocols that sit on top of the IP layer. The IP layer itself is connectionless and does not fit exactly in this model. However, this model is particularly helpful to understand how lower network layers and neighboring network layers interact. In the layered model, a network layer consists of nodes, links (also called edges), and connections. The nodes represent types of switches or cross-connect equipment that exchange data in either digital or analog form via the links that connect 2 Structural Overview of ISP Networks 23 them. Note that at the lowest layer (such as fiber) nodes represent equipment, such as fiber-optic patch panels, in which connections are switched manually by crossconnecting fiber patch cords from one interface to another. Links can be modeled as directed (unidirectional) or undirected (bidirectional). Connections are crossconnected (or switched) by the nodes onto the links, and thus form paths over the nodes and links of the graph. Note that the term connection often has different names at different layers and segments. For example, in most telecommunication carriers, a connection (or portions thereof ) is called a circuit in many of the lower network layers, often referred to as transport layers. Connections can be point-to-point (unidirectional or bidirectional), point-to-multipoint or, more rarely, multipoint-tomultipoint. Generally, connections arise from two sources. First, telecommunication services can arise “horizontally” (relative to our conceptual picture of Fig. 2.1) from a neighboring network segment. Second, connections in a given layer can originate from edges of a higher-layer network layer. In this way, each layer provides a connection “service” for the layer immediately above it to provide connectivity. Sometimes, a “client/server” model is referenced, such as the User-Network Interface (UNI) model [29] of the Optical Internetworking Forum (OIF), wherein the links of higher-layer networks are “clients” and the connections of lower-layer networks are “servers”. For example, see G.7713.2 [19] for more discussion of connection management in lower-layer transport networks. Recall that the technology layers we define are differentiated by the nodes, which represent actual switching or cross-connect equipment, rather than more abstract entities, such as protocols within each of these technology layers that can create multiple protocol sublayers. An early manifestation of protocol layering is the OSI model developed by the ISO standards organization [37] and the resulting classification of packet layering, such as Layer 1, Layer 2, Layer 3, which subsequently emerged in the industry. Although these layering definitions can be somewhat strained in usage, the industry generally associates IP with Layer 3 and MPLS or Ethernet VLANS with Layer 2 (which will be described later in the chapter). Layer 1, or the Physical Layer (PHY layer) of the OSI stack, covers multiple technology layers that we will cover in the next section. We illustrate this graphical network-layering model in Fig. 2.2, which depicts two layers. Note that for simplicity, we depict the edges in Fig. 2.2 as undirected. The cross-connect equipment represented by the nodes of Layer U (“upper layer”) connect to their counterpart nodes in Layer L (“lower layer”) by interlayer links, depicted as lightly dashed vertical lines. While this model has no specific geographical correlation, we note that the switching or cross-connect equipment represented in Layer U usually are colocated in the same buildings/locations (central offices in carrier networks) as their lower-layer counterparts in Layer L. In such representations, the interlayer links are called intra-office links. The links of Layer U are transported as connections in lower Layer L. For example, Fig. 2.2 highlights a link between nodes 1 and 6 of layer U . This link is transported via a connection between nodes 1 and 6 of Layer L. The path of this connection is shown through nodes (1, 2, 3, 4, 5, 6) at Layer L. 24 R.D. Doverspike et al. Example Layer U links Nodes of Layer U and Layer L are co-located (same central office) Layer U 1 6 3 5 1 2 Layer-U link is transported as a connection in Layer L 6 3 5 4 Layer L Fig. 2.2 Example of network layering Another example is given by the link between nodes 3 and 5 of Layer U . This routes over nodes (3, 4, 5) in Layer L. As this layered model illustrates, the concept of a “link” is a logical construct, even in lower “physical layer(s)”. Along these lines, we identify some interesting observations in Fig. 2.2: 1. There are more nodes in Layer L than in Layer U . 2. When viewed as separate abstract graphs, the degree of logical connectivity in Layer L is less than that for Layer U . For example, there are at the most three edge-diverse paths between nodes 1 and 6 in layer U . However, there are at the most, only two edge-diverse paths between the corresponding pair of nodes in Layer L. 3. When we project the links of Layer U onto their connection paths in Layer L; we see some overlap. For example, the two logical links highlighted in Layer U overlap on links (3, 4) and (4, 5) of Layer L. These observations generalize to the network layers associated with the IP backbone and affect how network layers are designed and how network failures at various layers affect higher-layer networks. The second observation says that while the logical topology of an upper-layer network, such as the IP layer, looks like it has many alternate paths to accommodate network disruptions, this can be deceiving unless one incorporates the lower-layer dependencies. For example, if link 3–4 of Layer L fails, then both links 1–6 and 3–5 of Layer U fail. Put more generally, failures of links of lower-layer networks usually cause multiple link failures in higher-layer networks. Specific examples will be described in Section 2.3.2. 2 Structural Overview of ISP Networks 25 2.2.3 Snapshot of Today’s Core Network Layers Figure 2.3 provides a representation of the set of services that might be provided by a large US-based carrier, and how these services map onto different network layers in the core segment. This figure is borrowed from [11] and depicts a mixture of legacy network layers (i.e., older technologies slowly being phased out) and current or emerging network layers. For a connection-oriented network layer (call it layer L), demand for connections comes from two sources: (1) links of higher network layers that route over layer L and (2) demand for telecommunications services provided by layer L but which originate outside layer L’s network segment. The second source of demand is depicted by rounded rectangles in Fig. 2.3. Note that Fig. 2.3 is a significant simplification of reality; however, it does capture most predominant layers and principal interlayer relationships relevant to our objectives. Note that an important observation in Fig. 2.3 is that links of a given layer can be spread over multiple lower layers including “skipping” over intermediate lower layers. Before we describe these layers, we provide some preliminary background on Time Division Multiplexing (TDM), whose signals are often used to transport links of the IP layer. Table 2.1 summarizes the most common TDM transmission rates. The Synchronous Optical Network (SONET) digital-signal standard [35], pioneered Frame Relay & ATM Private Line (DS3 to OC-12) Residential IPTV Voice over IP ISP & Business VPN Ethernet Services Ethernet Layer IP Layer ATM Layer Circuitswitched Voice DS1 Private Line Circuit-Switched Layer W-DCS Layer DCS-3/3 Layer Ethernet Private Line Intelligent Optical Switch (IOS) Layer SONET Ring Layer Wavelength Services Key: ROADM / Pt-to-pt DWDM Layer Service Layer-Layer Service Fiber Layer Network Layer Legacy Layer Fig. 2.3 Example of core-segment network layers Connections Gigabit Ethernet Private Line Pre-SONET Transmission Layer 26 R.D. Doverspike et al. Table 2.1 Time division multiplexing (TDM) digital hierarchy (partial list) Approximate rate DS-n Plesiosynchronous SONET SDH 64 Kb/s DS-0 E0 1.5 Mb/s DS-1 2.0 Mb/s E-1 34 Mb/s E-3 45 Mb/s DS-3 51.84 Mb/s STS-1 VC-3 155.5 Mb/s OC-3 STM-1 622 Mb/s OC-12 STM-3 2.5 Gb/s OC-48 STM-16 10 Gb/s OC-192 STM-48 40 Gb/s OC-768 STM-192 100 Gb/s OTN wrapper ODU-1 ODU-2 ODU-3 ODU-4 Kb/s D kilobits per second; Mb/s D megabits per second; Gb/s D gigabits per second. OTN line rates are higher than payload. ODU-2 includes 10 GigE and ODU-3 includes 40 GigE (under development). ODU-4 only includes 100 GigE by Bellcore (now Telcordia) in the early 1990s, is shown in the fourth column of Table 2.1. SONET is the existing higher-rate digital-signal hierarchy of North America. Synchronous Digital Hierarchy (SDH) is a similar digital-signal standard later pioneered by the International Telecommunication Union (ITU-T) and adopted by most of the rest of the world. The DS-n column represents the North American pre-SONET digital-signal rates, most of which originated in the Bell System. The Plesiosynchronous column represents the pre-SDH rates used mostly in Europe. However, after nearly 30 years, both DS-n and Plesiosynchronous are still quite abundant and their related private-line services are still sold actively. Finally, in the last column, we show the more recent Optical Transport Network (OTN) signals, also standardized by the ITU-T [18]. Development of the OTN signal standards were originally motivated by the need for a more robust standard to achieve very high bit rates in DWDM technologies; for example, it was needed to incorporate and standardize various bit-error recovery techniques, such as Forward Error Correction (FEC). As such, the OTN rates were originally termed “digital wrappers” to contain high rate SONET, SDH, or Ethernet signals, plus provide the extra fault notification information needed to reliably transport the high rates. Although there are many protocol layers in OTN, we just show the Optical channel Data Unit (ODU) rates in Table 2.1. To minimize confusion, in the rest of this chapter, we will mostly give examples in terms of DS-n and SONET rates. Referring back to the layered network model of the previous section, Table 2.2 gives some examples of the nodes, links, and connections in Fig. 2.3. We only list those layers that have relevance to the IP layer. We will briefly describe these layers in the following sections. 2 Structural Overview of ISP Networks 27 Table 2.2 Examples of nodes, links, and connections for network layers of Fig. 2.3 Core layer Typical node Typical link Typical connection IP Router SONET OC-n, 1/10 IP is connection-less gigabit Ethernet, ODU-n Ethernet can refer to both 1/10 Gigabit Ethernet Ethernet Ethernet switch or connection-less and or rate-limited router with connection-oriented Ethernet private Ethernet services line functionality Asynchronous ATM switch SONET OC-12/48 Permanent virtual circuit transfer (PVC), Switched virtual mode (ATM) circuit (SVC) W-DCS Wideband digital SONET STS-1 DS1 cross-connect (channelized) system (DCS) SONET Ring SONET add-drop SONET OC-48/192 SONET STS-n, DS-3 multiplexer (ADM) SONET OC-48/192 SONET STS-n IOS Intelligent optical switch (IOS) or broadband digital cross-connect system (DCS) DWDM signal SONET, SDN, or 1/10/100 DWDM Point-to-point gigabit Ethernet DWDM terminal or reconfigurable optical add-drop multiplexer (ROADM) Fiber Fiber patch panel or Fiber optic strand DWDM signal or SONET, cross-connect SDH, or Ethernet signal 2.2.4 Fiber Layer The commercial intercity fiber layer of the USA is privately owned by multiple carriers. In addition to owning fiber, carriers lease bundles of fiber from one another using various long-term Indefeasible Right of Use (IROU) contracts to cover needed connectivity in their networks. Fiber networks differ significantly between metro and rural areas. In particular, in carrier metro networks, optical fiber cables are usually placed inside PVC pipes, which are in turn placed inside concrete conduits. Additionally, fiber for core networks is often corouted in conduit or along rightsof-way with metro fiber. Generally, in metro areas, optical cables are routed and spliced between central offices. In the central office, most carriers prefer to connect the fibers to a fiber patch panel. Equipment that use (or will eventually use) the interoffice fibers are also cross-connected into the patch panels. This gives the carrier flexibility to connect equipment by simply connecting fiber patch cords on the patch panels. Rural areas differ in that there are often long distances between central offices and, as such, intermediate huts are used to splice fibers and place equipment, such as optical amplifiers. 28 R.D. Doverspike et al. 2.2.5 DWDM Layer Although many varieties of DWDM systems exist, we show a simplified view of a (one-way) point-to-point DWDM system in Fig. 2.4. Here, Optical Transponders (OTs) are Optical-Electrical-to-Optical (O-E-O) converters that input optical digital signals from routers, switches, or other transmission equipment using a receive device, such as a photodiode, on the add/drop side of the OT. The input signal has a standard intra-office wavelength, denoted by 0 . The OT converts the signal to electrical form. Various other physical layer protocols may be applied at this point, such as incorporating various handshaking called Link Management Protocols (LMPs) between the transmitting equipment and the receiving OT. A transponder is in clear channel mode if it does not change the transport protocols of the signal that it receives and essentially remains invisible to the equipment connecting to it. For example, Gigabit Ethernet (GigE) protocols from some routers or switches sometimes incorporate signaling messages to the far-end switch in the interframe gaps. If clear channel transmission is employed by the OT, such messages will be preserved as they are routed over the DWDM layer. After conversion to electrical form, the signal is retransmitted using a laser on the network or line-side of the OT. However, typical of traditional point-to-point systems, the wavelength of the laser is fixed to correspond to the wavelength assigned to a specific channel of the DWDM system, k . The output light pulses from multiple OTs at different wavelengths are then multiplexed into a single fiber by sending them through an optical multiplexer, such as an Arrayed Waveguide Grating Optical multiplexer: combines input optical signals with different wavelengths (from one optical fiber each) to output on a single optical fiber. Can be implemented with an optical grating. Optical amplifier client signals (SONET, Ethernet) λ0 λ0 λ0 optical multiplexer λ1 λ2 λn λ0 Optical Transponder (OT): inputs standard intraoffice wavelength (λ0), electrically regenerates signal, and outputs specific wavelength for longdistance transport (λk over channel k) Fig. 2.4 Simplified view of point-to-point DWDM system optical demultiplexer λ0 λ0 OT: inputsλk, electrically regenerates signal, and outputs λ0 2 Structural Overview of ISP Networks 29 (AWG) or similar device. If the distance between the DWDM terminals is sufficiently long, optical amplifiers are used to boost the power of the signal. However, power balancing among the DWDM channels is a major concern of the design of the DWDM system, as are other potential optical impairments. These topics are beyond the scope of this chapter. On the right side of Fig. 2.4, typically, the same (or similar) optical multiplexer is used in reverse, in which case, it becomes an optical demultiplexer. The OTs on the right side (the receive direction of the DWDM system) basically work in reverse to the transmit direction described above, by receiving the specific interoffice wavelength, k , converting to electrical, and then using a laser to generate the intra-office wavelength, 0 . Carrier-based DWDM systems are usually deployed in bidirectional configurations. To see this, the reader can visually reproduce the entire system in Fig. 2.4 and then flip it (mirror it) right to left. The multiplexed DWDM signal in the opposite direction is transmitted over a separate fiber. Therefore, even though the electronics and lasers of the one-way DWDM system in the reverse direction operate separately from the shown direction, they are coupled operationally. For example, the two fiber ports (receive and transmit) of the OT are usually deployed on the same line card and arranged next to one another. Optical amplification is used to extend the distance between terminals of a DWDM system. However, multiple systems are required to traverse the continental USA. Connections can be established between different point-to-point DWDM systems in an intermediate CO via an intermediate-regenerator OT (not pictured in Fig. 2.4). An intermediate-regenerator OT has the same effect on a signal as backto-back OTs. Since the signal does not have to be cross-connected elsewhere in the intermediate central office, cost savings can be achieved by omitting the intermediate lasers and receivers of back-to-back OTs. However, we note that most core DWDM networks have many vintages of point-to-point systems from different equipment suppliers. Typically, an intermediate-regenerator OT can only be used to connect between DWDM systems of the same equipment supplier. A difficulty with deploying point-to-point DWDM systems is that in central offices that interface multiple fiber spans (i.e., the node in the fiber layer has degree >2), all connections demultiplex in that office and pass through OTs. OTs are typically expensive and it is advantageous to avoid their deployment where possible. A better solution is the Reconfigurable Optical Add-Drop Multiplexer (ROADM). We show a simplified diagram of a ROADM in Fig. 2.5. The ROADM allows for multiple interoffice fibers to connect to the DWDM system. Appropriately, it is often called a multidegree ROADM or n-degree ROADM. As Fig. 2.5 illustrates, the ROADM is able to optically (i.e., without use of OTs) cross-connect channel k (transmitting at wavelength k ) arriving on one fiber to channel k (wavelength k ) outgoing on another fiber. Note that the same wavelength must be used on the two fibers. This is called the wavelength continuity constraint. The ROADM can also be configured to terminate (or “drop”) a connection at that location, in which case it is cross-connected to an OT to connect to routers, switches, or transmission equipment. A “dropped” connection is illustrated by 2 on the second fiber from the top on the left in Fig. 2.5 and an “added” connection is illustrated by n on the bottom 30 R.D. Doverspike et al. Optical Transponders (OT) also provided in bidirectional mode for regeneration at intermediate nodes λ1 λ2 λn λ1 λ2 in out λn λ0 λ0 λ0 λ0 λ1 optical multiplexer λn ROADM optical demultiplexer λ0 λ0 Fig. 2.5 Simplified view of Reconfigurable Optical Add-Drop Multiplexer (ROADM) fiber on the left. As with the point-to-point DWDM system, optical properties of the system impose distance (also called reach) constraints. Many transmission technologies, including optical amplification, are used to extend the distance between the optical add/drop points of a DWDM system. Today, this separation is designed to be about 1,500 km for a long-distance DWDM system, as a trade-off between cost and the all-optical distance for a US-wide network. Longer connections have to regenerate their signals, usually with an intermediate-regenerator OT. As with point-to-point DWDM systems, connections crossing ROADMS from different equipment suppliers usually must add/drop and connect through OTs. We illustrate a representative ROADM layer for the continental USA in Fig. 2.6. The links represent fiber spans between ROADMS. As described above, to route a connection over the network of Fig. 2.6 may require points of regeneration. We also note, though, that today’s core transport carriers usually have many vintages of DWDM technology and, thus, there may be several ROADM networks from different equipment suppliers, plus several point-to-point DWDM networks. All this complexity must be managed when routing higher-layer links, such as those of the IP backbone, over the DWDM layer. We finish this introduction of the DWDM layer with a few observations. While most large carriers have DWDM technology covering their core networks, this is not generally true in the metro segment. The metro segment typically consists of a mixture of DWDM spans and fiber spans (i.e., spans with no DWDM). If fact, in metro areas usually only a fraction of central office fiber spans have DWDM technology routed over them. This affects how customers interface to the IP backbone network for higher-rate interfaces. Finally, we note that while most 2 Structural Overview of ISP Networks 31 Note: This figure is a simplified illustration. It does not represent the specific design of any commercial carrier Seattle Portland Chicago Salt Lake City Reconfigurable Optical Add / Drop Multiplexer (ROADM) Fig. 2.6 Example of ROADM Layer topology of the connections for the core DWDM layer arise from links of the IP layer, many of the connections come from what many colloquially call “wavelength services” (denoted by the rounded rectangle in Fig. 2.3). These come from high-rate private-line connections emanating from outside the core DWDM layer. Examples are links between switches of large enterprise customers that are connected by leased-line services. 2.2.6 TDM Cross-Connect Layers In this section, we will briefly describe the TDM cross-connect layers. TDM cross-connect equipment can be basically categorized into two common types: a SONET/SDH Add-Drop Multiplexer (ADM) or a Digital Cross-Connect System (DCS). Consistent with our earlier remark about the use of terminology, the latter often goes by a variety of colloquial or outmoded model names of equipment suppliers, such as DCS-3/1, DCS-3/3, DACS, and DSX. A TDM cross-connect device interfaces multiple high-rate digital signals, each of which uses time division multiplexing to break the signal into lower-rate channels. These channels carry lower-rate TDM connections and the TDM cross-connect device cross-connects the lower-rate signals among the channels of the different high-rate signals. Typically, an ADM only interfaces two high-rate signals, while a DCS interfaces many. However, over time these distinctions have blurred. Telcordia classified DCSs into three layers: 32 R.D. Doverspike et al. a narrowband DCS (N-DCS) cross-connects at the DS-0 rate, a wideband-DCS (W-DCS) cross-connects at the DS-1 rate, and a broadband-DCS (B-DCS) crossconnects at the DS-3 rate or higher. ADMs are usually deployed in SONET/SDH self-healing rings. The IOS and SONET Ring layers are shown in Fig. 2.3, encircled by the (broader) ellipse that represents the TDM cross-connect devices. More details on these technologies can be found in [11]. Self-healing rings and DCSs will be relevant when we illustrate how services access the wide-area ISP network layer later in this chapter. Despite the word “optical” in its name, an Intelligent Optical Switch (IOS) is a type of B-DCS. Examples can be found in [6, 34]. The major differentiator of the IOS over older B-DCS models is its advanced control plane. An IOS network can route connection requests under distributed control, usually instigated by the source node. This requires mechanisms for distributing topology updates and internodal messaging to set up connections. Furthermore, an IOS usually can restore failed connections by automatically rerouting them around failed links. More detail is given when we discuss restoration methods. Many of the connections for the core TDM-cross-connect layers (ring layers, DCS layers, IOS layer) come from higher layers of the core network. For example, many connections of the IOS layer are links between W-DCSs, ATM networks, or lower-rate portions of IP layer networks. However, much of their demand for connections comes from subwavelength private-line services, shown by the rounded rectangle in Fig. 2.3. A portion of this private-line demand is in the form of Ethernet Private Line (EPL) services. These services usually represent links between Ethernet switches or routers of large enterprise customers. For example, the Gigabit Ethernet signal from an enterprise customer’s switch is transported over the metro network and then interfaces an Ethernet card either residing on the IOS itself or on an ADM that interfaces directly onto the IOS. The Ethernet card encapsulates the Ethernet frames inside concatenated n STS-1 signals that are transported over the IOS layer. The customer can choose the rate of transport, and hence the value of n he/she wishes to purchase. The ADM Ethernet card polices the incoming Ethernet frames to the transport rate of n STS-1. 2.2.7 IP Layer The nodes of the IP layer shown in Fig. 2.3 represent routers that transport packets among metro area segments. IP generally define pairwise adjacencies between ports of the routers. In the IP backbone, these adjacencies are typically configured over SONET, SDH, or Ethernet, or OTN interfaces on the routers. As described above, these links are then transported as connections over the interoffice lowerlayer networks shown in Fig. 2.3. Note that different links can be carried in different lower-layer networks. For example, lower-rate links may be carried over the TDM cross-connect layers (IOS or SONET Ring), while higher-rate links may be carried directly over the DWDM layer, thus “skipping” the TDM cross-connect layers. We will describe the IP layer in more detail in subsequent sections. 2 Structural Overview of ISP Networks 33 2.2.8 Ethernet Layer The Ethernet layer in Fig. 2.3 refers to several applications of Ethernet technology. For example, Ethernet supports a number of physical layer standards that can be used for Layer 1 transport. Ethernet also refers to connection-oriented Layer 2 pseudowire services [16] and connection-less transparent LAN services. For example, intra-office links between routers often use an Ethernet physical layer riding on optical fiber. An important application of Ethernet today is providing wide-area Layer 2 Virtual Private Network (VPN) services for enterprise customers. Although many variations exist, these services generally support enterprise customers that have Ethernet LANs at multiple locations and need to interconnect their LANs within a metro area or across the wide area. Most large carriers provide these services as an overlay on their IP layer, and hence, why we show the layered design in Fig. 2.3. Prior to the ability to provide such services over the IP layer, Ethernet private lines were supported by TDM cross-connect layers (i.e., Ethernet frames encapsulated over Layer 1 TDM private lines as described in Section 2.2.6). However, analogous to why wide-area Frame Relay displaced wide-area DS-0 private lines in the 1990s, wide-area packet networks are often more efficient than private lines to connect LANs of enterprise customers. The principal approach that intermetro carriers use to provide wide-area Ethernet private network services is Virtual Private LAN Service (VPLS) [24, 25]. In this approach, carriers provide such Ethernet services with routers augmented with appropriate Ethernet capabilities. The reason for this approach is to provide the robust carrier-grade network capabilities provided by routers. With wide-area VPLS, the enterprise customer is connected via the metro network to the edge routers on the edge of the core IP layer. We describe how the metro network connects to the core IP layer network in the next section. The VPLS architecture is described in more detail in Section 2.4.2 when we describe MPLS. We conclude this section with the comment that standards organizations and industry forums (e.g., IEEE, IETF, and Metro Ethernet Forum) have explored the use of Ethernet switches with upgraded carrier-grade network control protocols rather than using routers as nodes in the IP layer. For example, see Provider Backbone Transport (PBT) [27] and Provider Backbone Bridge – Traffic Engineering (PBB-TE) [15]. However, most large ISPs are deploying MPLS-based solutions. Therefore, we concentrate on the layering architecture shown in Fig. 2.3 in the remainder of this chapter. 2.2.9 Miscellaneous/Legacy Layers For completeness, we depict other “legacy” network layers with dashed ovals in Fig. 2.3. These technologies have been around for decades in most carrierbased core networks. They include network layers whose nodes represent ATM 34 R.D. Doverspike et al. switches, Frame-Relay switches, DCS-3/3s (a B-DCS that cross-connects DS3s), Voice-switches (DS-0 circuit switches), and pre-SONET ADMs. Most of these layers are not material to the spirit of this chapter and we do not discuss them here. 2.3 Structure of Today’s Core IP Layer 2.3.1 Hierarchical Structure and Topology In this chapter, we further break the IP layer into Access Routers (ARs) and Backbone Routers (BRs). Customer equipment homes to access routers, which in turn home onto backbone routers. An AR is either colocated with its backbone routers or not; the latter is called a Remote Access Router (RAR). Of course, there are alternate terminologies. For example, the IETF defines similar concepts to customer equipment, access routers, and backbone routers with its definitions, respectively, of Customer-Edge (CE) equipment, Provider-Edge (PE) routers, and Provider (P) routers. A simplified picture of a typical central office containing both ARs and BRs is shown in Fig. 2.7. Access routers are dual-homed to two backbone routers to enable higher levels of service availability. The links between routers in the same office are typically Ethernet links over intra-office fiber. While we show only two ARs in Channelized OC-12 m-GigE BR SONET OC-n (e.g., n= 768) AR Intra-office Fiber BR BR CORE ROADM Layer Network AR IntraOffice TDM Layers Access/ Metro TDM Layers Example of DS1 access circuits multiplexed over channelized OC-12 interface BR RAR BR = Backbone Router = IP Layer Logical Link = IP Layer Access Link (R)AR = (Remote) Access Router = Router Line Card = Central Office Fig. 2.7 Legacy central office interconnection diagram (Layer 3) 2 Structural Overview of ISP Networks 35 Note: This figure is a simplified illustration. It does not represent the specific design of any commercial carrier Core Router Intra-building Access / Edge Router Remote Access / Edge Router Fig. 2.8 Example of IP layer switching hierarchy Fig. 2.7, note that typically there are many ARs in large offices. Also, due to scaling and sizing limitations, there may be more than two backbone routers or switches per central office used to further aggregate AR traffic before it enters the BRs. Moreover, we show a remote access router that homes to one of the BRs. Figure 2.8 illustrates this homing arrangement in a broader network example, where small circles represent ARs, diamonds represent RARs, and large squares represent BRs. Note that remote ARs are homed to BRs in different offices. Homing remote ARs to BRs in different central offices raises network availability. However, a stronger motivation for doing this is that RAR–BR links are usually routed over the DWDM layer, which generally does not offer automatic restoration, and so the dual-homing serves two purposes: (1) protect against BR failure or maintenance activity and (2) protect against failure or maintenance of a RAR–BR link. While the homing scheme described here is typical of large ISPs, other variations exist. For example, there are dual-homing architectures where (nonremote) ARs are homed to a BR colocated in the same central office and then a second BR in a different central office. While this latter architecture provides a slightly higher level of network availability against broader central office failure, it can be more costly owing to the need to transport the second AR–BR link. However, the latter architecture allows more load balancing across BRs because of the extra flexibility in homing ARs. 36 R.D. Doverspike et al. Improved load balancing can offer other advantages, including lower BR costs. Also, for ISPs with many scattered locations, but less total traffic, this latter architecture may be more cost-effective than colocating two BRs in each BR-office. The right side of Fig. 2.7 also shows the metro/access network-layer clouds to connect customer equipment to the ARs. In particular, we illustrate DS1 customer interfaces. The left side of Fig. 2.7 also shows the lower-layer DWDM clouds to connect the interoffice links between BRs. We will expand these clouds in the next sections. The reasons for segregating the IP topology into access and backbone routers are manifold: Access routers aggregate lower-rate interfaces from various customers or other carriers. This function requires significant equipment footprint and processor resources for customer-related protocols. As a result, major central offices consist of many access routers to accommodate the low-rate customer interfaces. Without the aggregation function of the backbone router, each such office would be a myriad of tie links between access routers and interoffice links. Access routers are often segregated by different services or functions. For example, general residential ISP service can be segregated from high-priority enterprise private VPN service. As another example, some access routers are sometimes segregated to be peering points with other carriers. Backbone routers are primarily designed to be IP-transport switches equipped only with the highest speed interfaces. This segregation allows the backbone routers to be optimally configured for interoffice IP forwarding and transport. 2.3.2 Interoffice Topology Figure 2.9 expands the core lower ROADM Layer cloud of Fig. 2.7. It shows ports of interoffice links between BRs connecting to ports on ROADMs. These links are transported as connections in the ROADM network. For example, today these links go up to 40 gigabits per second (Gb/s) or SONET OC-768. These connections are routed optically through intermediate ROADMs and regenerated where needed, as described in Section 2.2.5. Also, we note that the link between the remote ARs and BRs route over the same ROADM network, although the rate of this RAR–BR link may be at lower rate, such as 10 Gb/s. Figure 2.10 shows a network-wide example of the IP layer interoffice topology. There are some network-layering principles illustrated in Fig. 2.10 that we will describe. First, if we compare the IP layer topology of Fig. 2.8 with that of the DWDM layer (ROADM layer) of Fig. 2.10, we note that there is more connectivity in the IP layer graph than the DWDM layer. The reason for this is the existence of what many IP layer planners call express links. If we examine the link labeled “direct link” between Seattle and Portland, we find that when we route this link over the DWDM layer topology, there are no intermediate ROADMs. In fact, there are two types of direct links. The first type connects through 2 Structural Overview of ISP Networks 37 BR ROADM Core ROADM Layer Network CO D AMP BR BR ROADM BR CO C AMP AMP ROADM ROADM OT for transport of links of IOS Layer or high rate private line Service CO A RAR CO B ROADM = Reconfigurable Optical Add-Drop Multiplexer (R)AR = (Remote) Access Router BR = Backbone Router = Central Office (CO) = ROADM Optical Transponder (OT) = Router Line Card = ROADM Layer connection transporting IP layer link Fig. 2.9 Core ROADM Layer diagram Direct link Express link Seattle Portland Note: This figure is a simplified illustration. It does not represent the specific design of any commercial carrier Chicago Salt Lake City Core Router Aggregate Link Fig. 2.10 Example of IP layer interbackbone topology 38 R.D. Doverspike et al. no intermediate ROADMs, as illustrated by the Seattle–Portland link. The second type connects through intermediate ROADMS, but encounters no BRs in those intermediate central offices, as illustrated by the Seattle–Chicago link. In contrast, if we examine the express link between Portland and Salt Lake City, we find that any path in the DWDM layer connecting the routers in that city pair bypasses routers in at least one of its intermediate central offices. Express links are primarily placed to minimize network costs. For example, it is more efficient to place express links between well-chosen router pairs with high network traffic (enough to raise the link utilization above a threshold level); otherwise the traffic will traverse through multiple routers. Router interfaces can be the most-expensive single component in a multilayered ISP network; therefore, costs can usually be minimized by optimal placement of express links. It is also important to consider the impact of network layering on network reliability. Referring to the generic layering example of Fig. 2.2, we note that the placement of express links can cause a single DWDM link to be shared by different IP layer links. This gives rise to complex network disruption scenarios, which must be modeled using sophisticated network survivability modeling tools. This is covered in more detail in Section 2.5.3. Returning to Fig. 2.10, we also note the use of aggregate links. Aggregate links also go by other names, such as bundled links and composite links. An aggregate link bundles multiple physical links between a pair of routers into a single virtual link from the point of view of the routers. For example, an aggregate link could be composed of five OC-192 (or 10 GigE) links. Such an aggregate link would appear as one link with 50 Gb/s of capacity between the two routers. Generally, aggregate links are implemented by a load-balancing algorithm that transparently switches packets among the individual links. Usually, to reduce jitter or packet reordering, packets of a given IP flow are routed over the same component link. The main advantage of aggregate links is that as IP networks grow large, they tend to contain many lower-speed links between a pair of routers. It simplifies routing and topology protocols to aggregate all these links into one. If one of the component links of an aggregate link fails, the aggregate link remains up; consequently, the number of topology updates due to failure is reduced and network rerouting (called reconvergence) is less frequent. Network operators seek to achieve network stability, and therefore shy away from many network reconvergence events; aggregate links result in less network reconvergence events. On the downside, if only one link of a (multiple link) aggregate link fails, the aggregate link remains “up”, but with reduced capacity. Since many network routing protocols are capacity in-sensitive, packet congestion could occur over the aggregate link. To avoid this situation, router software is designed with capacity thresholds for aggregate links that the network operator can set. If the aggregate capacity falls below the threshold, the entire aggregate link is taken out of service. While the network “loses” the capacity of the surviving links in the bundle when the aggregate link is taken out of service, the alternative is potentially significant packet loss due to congestion on the remaining links. 2 Structural Overview of ISP Networks 39 2.3.3 Interface with Metro Network Segment Figure 2.11 is a blowup of the clouds on the right side of Fig. 2.7. It provides a simplified example of how three business ISP customers gain access to the IP backbone. These could be enterprise customers with multiple branches who subscribe to a VPN service. Each access method consists of a DS1 link encapsulating IP packets that is transported across the metro segment. In carrier vernacular, using packet/TDM links to access the IP backbone is often called TDM backhaul. We do not show the inner details of the metro network here. Detailed examples can be found in [11]. Even suppressing the details of the complex metro network, the TDM backhaul is clearly a complicated architecture. To aid his/her understanding, we suggest the reader to refer back to the TDM hierarchy shown in Table 2.1. The customer’s DS-1 (which carries encapsulated IP packets) interfaces to a low-speed multiplexer located in the customer building, such as a small SONET ADM. This ADM typically serves as one node of a SONET ring (usually a 2-node ring). Each link of the ring is routed over diverse fiber, usually at OC-3 or OC-12 rate. Eventually, the DS-1 is routed to a SONET OC-48 or OC-192 ring that has one of its ADMs in the POP. The DS-1 is transported inside an STS-1 signal that is divided into 28 time slots called channels (a channelized STS-1), as specified by the SONET standard. The ADM routes all the SONET STS-1s carrying DS-1 traffic bound for the core carrier to a metro W-DCS. Note that there are often multiple AR Channelized OC-12 Customer Location Intra-Office TDM Layers W-DCS (Core) DS1/DS3 W-DCS (Metro) ADM AR MSP MSP IOS (Core) Example of 3 DS1 access circuits OC-12 SONET ADM (Metro) Access | Metro TDM Layers (R)AR = (Remote) Access Router MSP = Multi-service Platform (multiplexes low-rate TDM circuits) = Layer 3 Logical Link = IP Layer Access Link = Router Line Card W-DCS = Wideband Digital Cross-Connect System ADM = Add-Drop Multiplexer ADM DS1/DS3 Customer Location Fig. 2.11 Legacy central office interconnection diagram (intra-office TDM layers) 40 R.D. Doverspike et al. core carriers in a POP, and hence, the metro W-DCS cross-connects all the DS-1s destined for a given core carrier into channelized STS-1s and hands them off to the core W-DCS(s) of that core carrier. However, note that this handoff does not occur directly between the two W-DCSs, but rather passes through a higher-rate B-DCS, in this case the Intelligent Optical Switch (IOS) introduced in Section 2.2.6. The IOS cross-connects most of the STS-1s (multiplexed into OC-n interfaces) in a central office. Also, notice that the IOS is fronted with Multi-Service Platforms (MSPs). An MSP is basically an advanced form of SONET ADM that gathers many types of lower-speed TDM interfaces and multiplexes them up to OC-48 or OC-192 for the IOS. It usually also has Ethernet interfaces that encapsulate IP packets into TDM signals (e.g., for Ethernet private line discussed earlier). The purpose of such a configuration is to minimize the cost and scale of the IOS by avoiding using its interface bay capacity for low-speed interfaces. Finally, the core W-DCS cross-connects the DS1s destined for the access routers in the central office onto channelized STS-1s. Again, these STS-1s are routed to the AR via the IOS and its MSPs. The DS-1s finally reach a channelized SONET card on the AR (typically OC-12). This card on the AR de-multiplexes the DS-1s from the STS-1, de-encapsulates the packets, and creates a virtual interface for each of our three example customer access links in Fig. 2.11. The channelized SONET card is colloquially called a CHOC card (CHannelized OC-n). Note that the core and metro carriers depicted in Fig. 2.11 may be parts of the same corporation. However, this complex architecture arose from the decomposition of long-distance and local carriers that was dictated by US courts and the Federal Communications Commission (FCC) at the breakup of the Bell System in 1984. It persists to this day. If we reexamine the above TDM metro access descriptions, we find that there are many restoration mechanisms, such as dual homing of the ARs to the BRs and SONET rings in the metro network. However, there is one salient point of potential failure. If an AR customer-facing line card or entire AR fails or is taken out of service for maintenance in Fig. 2.11, then the customer’s service is also down. Carriers offer service options to protect against this. The most common provide two TDM backhaul connections to the customer’s equipment, often called Customer Premise Equipment (CPE), each of which terminates on a different access router. This architecture significantly raises the availability of the service, but does incur additional cost. An example of such a service is given in [1]. To retain accuracy, we make a final technical comment on the example of Fig. 2.11. Although we show direct fiber connections between the various TDM and packet equipment, in fact, most of these usually occur via a fiber patch panel. This enables a craftsperson to connect the equipment via a simple (and well-organized) patch chord or cross-connect. This minimizes expense, simplifies complex wiring, and expedites provisioning work orders in the CO. Figure 2.12 depicts how customers access the AR via emerging metro packet network layers instead of TDM. Here, instead of the traditional TDM network, the customer accesses the packet core via Ethernet. The most salient difference is the substantially simplified architecture. Although many different types of services 2 Structural Overview of ISP Networks 41 Customer Location n-GigE BR FE | GigE AR NTE Ethernet Virtual Private Line (combo of VLAN & Pseudowire) BR AR RE (Metro) RE (Core) Dual role access router and Ethernet switch AR RE NTE n-GigE RE (Metro) Access | Metro Ethernet Layer = Access Router = Router | Ethernet Switch = Network Terminating Equipment = Layer 3 Link = Layer 2-3 Link = Router | Ethernet Line Card = Virtual access link to IP Layer NTE FE | GigE Customer Location Fig. 2.12 Central office interconnection diagram (metro Ethernet interface) are possible, we describe two fundamental types of Ethernet service: Ethernet virtual circuits and Ethernet VPLS. Most enterprise customers will use both types of services. There are three basic types of connectivity for Ethernet virtual circuits: (1) intrametro, (2) ISP access via establishment of Ethernet virtual circuits between the customer location and IP backbone, and (3) intermetro. Since our main focus is the core IP backbone, we discuss the latter two varieties. For ISP access, in the example of Fig. 2.12, the customer’s CPE interfaces the metro network via Fast Ethernet (FE) or GigE into a small Ethernet switch placed by the metro carrier called Network Terminating Equipment (NTE). The NTE is the packet analog of the small ADM in the TDM access model in Fig. 2.11. For most metro Ethernet services, the customer can usually choose which policed access rate he/she wishes to purchase in increments of 1 Mb/s or similar. For example, he/she may wish 100 Mb/s for his/her Committed Information Rate (CIR) and various options for his/her Excess Information Rate (EIR). The EIR options control how his bandwidth bursts are handled/shared when they exceed his CIR. The metro packet networks uses Virtual Local Area Network (VLAN) identifiers [14] and pseudowires or MPLS LSPs to route the customer’s Ethernet virtual circuit to the metro Ethernet switch/router in the POP, as shown in Fig. 2.12. VLANs can also be used to segregate a particular customer’s services, such as the two fundamental services (VPLS vs Internet access) described here. The metro Ethernet switch/router has high-speed links 42 R.D. Doverspike et al. (such as 10 Gb/s) to the core Ethernet switch/router. However, the core Ethernet switch/router is fundamentally an access router, but with the needed features and configurations needed to provide Ethernet and VPLS, and thus homes to backbone routers as any other access router. Thus, the customer’s virtual circuit is mapped to a virtual port on the core AR/Ethernet-Switch and from that point onward is treated similarly as the TDM DS-1 virtual port in Fig. 2.11. If an intermetro Ethernet virtual circuit is needed, then an appropriate pseudowire or tunnel can be created between the ARs in different metros. Such a service can eventually substitute for traditional private-line service as metro packet networks are deployed. The second basic type of Ethernet service type is generally provided through the VPLS model described in Section 2.2.8. For example, the customer might have two LANs in metro-1, one LAN in metro-2 and another LAN in metro-3. Wide-area VPLS interconnects these LANs into a large transparent LAN. This is achieved using pseudowires (tunnels) between the ARs in metros-1, 2, and 3. Since the core access router has a dual role as access router and Ethernet VPLS switch, it has the abilities to route customer Ethernet frames among pseudowires among the remote access routers. Besides enterprise Ethernet services, connection of cellular base stations to the IP backbone network is another important application of Ethernet metro access. Until recently, this was achieved by installing DS-1s from cell sites to circuit switches in Mobile Telephone Switching Offices (MTSOs) to provide voice service. However, with the advent and rapid growth of cellular services based on 3G or 4G technology, there is a growing need for high-speed packet-based transport from cell sites to the IP backbone. The metro Ethernet structure for this is similar to that of the enterprise customer access shown in Fig. 2.12. The major differences occur in the equipment at the cell site, the equipment at the MTSO, and then how this equipment connects to the access router/Ethernet switch of the IP backbone. 2.4 Routing and Control in ISP Networks 2.4.1 IP Network Routing The IP/MPLS routing protocols are an essential part of the architecture of the IP backbone, and are key to achieving network reliability. This section introduces these control protocols. An Interior Gateway Protocol (IGP) disseminates routing and topology information within an Autonomous System (AS). A large ISP will typically segment its IP network into multiple autonomous systems. In addition, an ISP’s network interconnects with its customers and with other ISPs. The Border Gateway Protocol (BGP) is used to exchange global reachability information with ASs operated by the same ISP, by different ISPs, and by customers. In addition, IP multicast is becoming more widely deployed in ISP networks, using one of several variants of the Protocol-Independent Multicast (PIM) routing protocol. 2 Structural Overview of ISP Networks 43 2.4.1.1 Routing with Interior Gateway Protocols As described earlier, Interior Gateway Protocols are used to disseminate routing and topology information within an AS. Since IGPs disseminate information about topology changes, they play a critical role in network restoration after a link or node failure. Because of the importance of restoration to the theme of this chapter, we discuss this further in Section 2.5.2. The two types of IGPs are distance vector and link-state protocols. In link-state routing [32], each router in the AS maintains a view of the entire AS topology using a Shortest Path First (SPF) algorithm. Since link-state routing protocols such as Open Shortest Path First (OSPF) [26] and Intermediate System–Intermediate System (IS–IS) [30] are the most commonly used IGPs among large ISPs, we will not discuss distance vector protocols further. For the purposes of this chapter, which focuses on network restoration, the functionality of OSPF and IS–IS are similar. We will use OSPF to illustrate how IGPs handle failure detection and recovery. The view of network topology maintained by OSPF is conceptually a directed graph. Each router represents a vertex in the topology graph and each link between neighboring routers represents a unidirectional edge. Each link also has an associated weight (also called cost) that is administratively assigned in the configuration file of the router. Using the weighted topology graph, each router computes a shortest path tree (SPT) with itself as the root, and applies the results to build its forwarding table. This assures that packets are forwarded along the shortest paths in terms of link weights to their destinations [26]. We will refer to the computation of the shortest path tree as an SPF computation, and the resultant tree as an SPF tree. As illustrated in Fig. 2.13, the OSPF topology may be divided into areas, typically resulting in a two-level hierarchy. Area 0, known as the “backbone area”, resides at the top level of the hierarchy and provides connectivity to the nonbackbone areas (numbered 1, 2, etc.). OSPF typically assigns a link to exactly one area. Links may be in multiple areas, and multi-area links are addressed in more detail in Chapter 11 (Measurements of Control Plane Reliability and Performance by Aman Shaikh and Lee Breslau). Routers that have links to multiple areas are called border routers. For example, routers E, F and I are border routers in Fig. 2.13. Every router maintains its own copy of the topology graph for each area to which it is connected. The router performs an SPF computation on the topology graph for each area and thereby knows how to reach nodes in all the areas to which it connects. To improve scalability, OSPF was designed so that routers do not need to learn the entire topology of remote areas. Instead, routers only need to learn the total weight of the path from one or more area border routers to each node in the remote area. Thus, after computing the SPF tree for the area it is in, the router knows which border router to use as an intermediate node for reaching each remote node. Every router running OSPF is responsible for describing its local connectivity in a Link-State Advertisement (LSA). These LSAs are flooded reliably to other routers in the network, which allows them to build their local view of the topology. The flooding is made reliable by each router acknowledging the receipt of every LSA it receives from its neighbors. The flooding is hop-by-hop and hence does not depend 44 R.D. Doverspike et al. Z 5 Y B 1 10 A X 5 C 1 1 D Area 1 1 F 1 1 E H 2 1 1 1 Internal IGP Router 3 G Border Router (between OSPF Areas) I Area 2 2 Area 0 L 1 1 AS Border Router J 1 K Fig. 2.13 OSPF topology: areas and hierarchy on routing. The set of LSAs in a router’s memory is called a Link-State Database (LSDB) and conceptually forms the topology graph for the router. OSPF uses several types of LSAs for describing different parts of topology. Every router describes links to all its neighbor routers in a given area in a Router LSA. Router LSAs are flooded only within an area and thus are said to have an area-level flooding scope. Thus, a border router originates a separate Router LSA for every area to which it is connected. Border routers summarize information about one area and distribute this information to adjacent areas by originating Summary LSAs. It is through Summary LSAs that other routers learn about nodes in the remote areas. Summary LSAs have an area-level flooding scope like Router LSAs. OSPF also allows routing information to be imported from other routing protocols, such as BGP. The router that imports routing information from other protocols into OSPF is called an AS Border Router (ASBR). Routers A and B are ASBRs in Fig. 2.13. An ASBR originates External LSAs to describe the external routing information. The External LSAs are flooded in the entire AS irrespective of area boundaries, and hence have an AS-level flooding scope. While the capability exists to import external routing information from protocols such as BGP, the number of such routes that may be imported may be very large. As a result, this can lead to overheads both in communication (flooding the external LSAs) as well as computation (SPF computation scales with the number of routes). As a consequence of the scalability problems they pose, the importing of external routes is rarely utilized. Two routers that are neighbor routers have link-level connectivity between each other. Neighbor routers form an adjacency so that they can exchange routing 2 Structural Overview of ISP Networks 45 information with each other. OSPF allows a link between the neighbor routers to be used for forwarding only if these routers have the same view of the topology, i.e., the same link-state database. This ensures that forwarding data packets over the link does not create loops. Thus, two neighbors have to make sure that their link-state databases are synchronized, and they do so by exchanging parts of their link-state databases when they establish an adjacency. The adjacency between a pair of routers is said to be “full” once they have synchronized their link-state databases. While sending LSAs to a neighbor, a router bundles them together into a Link-State Update packet. We will re-examine the OSPF reconvergence process in more detail when we discuss network disruptions in Section 2.5.2.1. Although elegant and simple, basic OSPF is insensitive to network capacity and routes packets hop-by-hop along the SPF tree. As mentioned in Section 2.3.2, this has some potential shortcomings when applied to aggregate links. While aggregatelink capacity thresholds can be tuned to minimize this potentially negative effect, a better approach may be to use capacity-sensitive routing protocols, often called Traffic Engineering (TE) protocols, such as OSPF-TE [21]. Alternatively, one may use routing protocols with a greater degree of routing control, such as MPLS-based protocols. Traffic Engineering and MPLS are discussed later in this chapter. 2.4.1.2 Border Gateway Protocol The Border Gateway Protocol is used to exchange routing information between autonomous systems, for example, between ISPs or between an ISP and its large enterprise customers. When BGP is used between ASs, it is referred to as Exterior BGP (eBGP). When BGP is used within an AS to distribute external reachability information, it is referred to as Interior BGP (iBGP). This section provides a brief summary of BGP. It is covered in much greater detail in Chapters 6 and 11. BGP is a connection-oriented protocol that uses TCP for reliable delivery. A router advertises Network Layer Reachability Information (NLRI) consisting of an IP address prefix, a prefix length, a BGP next hop, along with path attributes, to its BGP peer. Packets matching the route will be forwarded toward the BGP next hop. Each route announcement can also have various attributes that can affect how the peer will prioritize its selection of the best route to use in its routing table. One example is the AS PATH attribute which is a list of ASes through which the route has been relayed. Withdrawal messages are sent to remove NLRI that are no longer valid. For example in Fig. 2.14, AjZ denotes an advertisement of NLRI for IP prefix z, and Wjs,r denotes that routes s and r are being withdrawn and should be removed from the routing table. If an attribute of the route changes, the originating router announces it again, replacing the previous announcement. Because BGP is connection-oriented, there are no refreshes or reflooding of routes during the lifetime of the BGP connection, which makes BGP simpler than a protocol like OSPF. However, like OPSF, BGP has various timers affecting behavior like hold-offs on route installation and route advertisement. 46 R.D. Doverspike et al. Router R1 BGP process RIB ---- BGP Adjacency W |s, r A |z Router R2 BGP process RIB ---- Fig. 2.14 BGP message exchange BGP maintains tables referred to as Routing Information Bases (RIBs) containing BGP routes and their attributes. The Loc-RIB table contains the router’s definitive view of external routing information. Besides routes that enter the RIB from BGP itself, routes enter the RIB via distribution from other sources, such as static or directly connected routes or routing protocols such as OSPF. While the notion of a “route” in BGP originally meant an IPv4 prefix, with the standardization of Multiprotocol BGP (MP-BGP) it can represent other kinds of reachability information, referred to as address families. For example, a BGP route can be an IPv6 prefix or an IPv4 prefix within a VPN. External routes advertised in BGP must be distributed to every router in an AS. The hop-by-hop forwarding nature of IP requires that a packet address be looked up and matched against a route at each router hop. Because the address information may match external networks that are only known in BGP, every router must have the BGP information. However, we describe later how MPLS removes the need for every interior router to have external BGP route state. Within an AS, the BGP next hop will be the IP address of the exit router or exit link from the AS through which the packet must route and BGP is used by the exit router to distribute the routes throughout the AS. To avoid creating a full mesh of iBGP sessions among the edge and interior routers, BGP can use a hierarchy of Route Reflectors (RR). Figure 2.15 illustrates how BGP connections are constructed using a Route Reflector. BGP routes may have their attributes manipulated when received and before sending to peers, according to policy design decisions of the operator. Of the BGP routes received by a BGP router, BGP first determines the validity of a route (e.g., is the BGP next hop reachable) and then chooses the best route among valid duplicates with different paths. The best route is decided by a hierarchy of tiebreakers among route attributes such as IGP metric to the next hop and BGP path attributes such as AS PATH length. The best route is then relayed to all peers except the originating one. One variation of this relay behavior is that any route received from an iBGP peer on a nonroute reflector is not relayed to any other iBGP peer. 2 Structural Overview of ISP Networks 47 CE PE iBGP client PE RR CE CE RR iBGP PE PE CE PE CE RR iBGP eBGP PE PE eBGP CE = Provider Edge router (Access Router) = Customer Edge router = Route Reflector = Interior BGP = Exterior BGP Fig. 2.15 BGP connections in an ISP with Route Reflectors (RR) 2.4.1.3 Protocol-Independent Multicast IP Multicast is very efficient when a source sends data to multiple receivers. By using multicast at the network layer, a packet traverses a link only once, and therefore the network bandwidth is utilized optimally. In addition, the processing at routers (forwarding load) as well as at the end-hosts (discarding unwanted packets) is reduced. Multicast applications generally use UDP as the underlying transport protocol, since there is no unique context for the feedback received from the various receivers for congestion control purposes. We provide a brief overview of IP Multicast in this section. It is covered in greater detail in Chapter 11. IP Multicast uses group addresses from the Class “D” address space (in the context of IPv4). The range of IP addresses that are used for IP Multicast group addresses is 224.0.0.0 to 239.255.255.255. When a source sends a packet to an IP Multicast group, all the receivers that have joined that group receive it. The typical protocol used between the end-hosts and routers is Internet Group Management Protocol (IGMP). Receivers (end-hosts) announce their presence ( join a multicast group) by sending an IGMP report to join a group. From the first router, the indication of the intent of an end-host to join the multicast group is forwarded through routers upwards along the shortest path to the root of the multicast tree. The root for an IP Multicast tree can be a source in a source-based distribution tree, or it may be a “rendezvous point” when the tree is a shared distribution tree. The routing protocol used in conjunction with IP multicast is called Protocol-Independent Multicast (PIM). PIM has variants of the routing protocol used to form the multicast tree to forward traffic from a source (or sources) to the receivers. A router forwards a multicast packet only if it was received on the upstream interface to the source or to a rendezvous point (in a shared tree). Thus, a packet sent by a source follows the distribution tree. To avoid loops, if a packet arrives on an interface that is not on the shortest path toward the source of rendezvous point, the packet is discarded 48 R.D. Doverspike et al. (and thus not forwarded). This is called Reverse Path Forwarding (RPF), a critical aspect of multicast routing. RPF avoids loops by not forwarding duplicate packets. PIM relies on the SPT created by the traditional routing protocols such as OSPF to find the path back to the multicast source using RPF. IP Multicast uses soft-state to keep the multicast forwarding state at the routers in the network. There are two broad approaches for maintaining multicast state. The first is termed PIM-Dense Mode, wherein traffic is first flooded throughout the network, and the tree is “pruned” back along branches where the traffic is not wanted. The underlying assumption is that there are multicast receivers for this group at most locations, and hence flooding is appropriate. The flood and prune behavior is repeated, in principle, once every 3 min. However, this results in considerable overhead (as the traffic would be flooded until it is pruned back) each time. Every router also ends up keeping state for the multicast group. To avoid this, the router downstream of a source periodically sends a “state refresh” message that is propagated hop-by-hop down the tree. When a router receives the state refresh message on the RPF interface, it refreshes the prune state, so that it does not forward traffic received subsequently, until a receiver joins downstream on an interface. While PIM-Dense Mode is desirable in certain situations (e.g., when receivers are likely to exist downstream of each of the routers – densely populated groups – hence the name), PIM-Sparse Mode (PIM-SM) is more appropriate for wide-scale deployment of IP multicast for both densely and sparsely populated groups. With PIM-SM, traffic is sent only where it is requested, and receivers are required to explicitly join a multicast group to receive traffic. While PIM-SM uses both a shared tree (with a rendezvous point, to allow for multiple senders) as well as a per-source tree, we describe a particular mode, PIM-Source Specific Multicast (PIM-SSM), which is more commonly used for IPTV distribution. More details regarding PIM-SM, including PIM using a shared tree, is described in Chapter 11. PIM-SSM is adopted when the end-hosts know exactly which source and group, typically denoted (S,G), to join to receive the multicast transmissions from that source. In fact, by requiring that receivers signal the combination of source and group to join, different sources could share the same group address and not interfere with each other. Using PIM-SSM, a receiver transmits an IGMP join message for the (S,G) and the first hop router sends a (S,G)join message directly along the shortest path toward the source. The shortest path tree is rooted at the source. One of the key properties of IP Multicast is that the multicast routing operates somewhat independently of the IGP routing. Changes to the network topology are reflected in the unicast routing using updates that operate on short-time scales (e.g., transmission of LSAs in OSPF reflect a link or node failure immediately). However, IP Multicast routing reflects the changed topology only when the multicast state is refreshed. For example, with PIM-SSM, the updated topology is reflected only when the join is issued periodically (which can be up to a minute or more) by the receiver to refresh the state. We will examine the consequence of this for wide-area IPTV distribution later in this chapter. 2 Structural Overview of ISP Networks 49 2.4.2 Multiprotocol Label Switching 2.4.2.1 Overview of MPLS Multiprotocol Label Switching (MPLS) is a technology developed in the late 1990s that added new capabilities and services to IP networks. It was the culmination of various IP switching technology efforts such as multiprotocol over ATM, Ipsilon’s IP Switching, and Cisco’s tag switching [7,20]. The key benefits provided by MPLS to an ISP network are: 1. Separation of routing (the selection of paths through the network) from forwarding/switching via IP address header lookup 2. An abstract hierarchy of aggregation To understand these concepts, we first consider how normal IP routing in an ISP network functions. In an IP network without MPLS, there is a topology hierarchy with edge and backbone routers. There is also a routing hierarchy with BGP carrying external reachability information and an IGP like OSPF carrying internal reachability information. BGP carries the information about which exit router (BGP next hop) is used to reach external address space. OSPF picks the paths across the network between the edges (see Fig. 2.16). It is important to note that every OSPF router knows the complete path to reach all the edges. The internal paths that OSPF picks and the exit routers from BGP are determined before the first packet is forwarded. The connection-less and hop-by-hop forwarding behavior of IP routing requires that every router have this internal and external routing information present. A CE PE PE P A.1 P PE PE CE PE CE CE P PE PE Provider Router Network PE P -Provider router (Backbone Router) PE - Provider Edge router (Access Router) CE - Customer Edge switch Packet forwarded using hopby-hop route lookup Routes chosen using OSPF interior routing protocols Fig. 2.16 Traditional IP routing with external routes distributed throughout backbone 50 R.D. Doverspike et al. Consider the example in Fig. 2.16, where a packet enters on the left with address A.1 destined to the external network A on the upper right. When the first packet arrives, the receiving provider edge router (PE) looks up the destination IP address. From BGP, it learns that the exit router for that address is the upper right PE. From OSPF, the path to reach that exit PE is determined. Even though the ingress PE knows the complete path to reach the exit PE, it simply forwards the packet to the next-hop backbone router, labeled as a P-router (P) in the figure. The backbone router then repeats the process: using the packet IP address, it determines the exit from BGP and the path to the exit from OSPF to forward the packet to the next-hop BR. The process repeats again until the packet reaches the exit PE. The repeated lookup of the packet destination to find the external exit and internal path appears to be unnecessary. The lookup operation itself is not expensive, but the issue is the unnecessary state and binding information that must be carried inside the network. The ingress router knows the path to reach the exit. If the packet could somehow be bound to the path itself, then the successive next-hop routers would only need to know the path for the packet and not its actual destination. This is what MPLS accomplishes. Consider Fig. 2.17 where MPLS sets up an end-to-end Label Switched Path (LSP) by assigning labels to the interior paths to reach exits in the network. The LSP might look like the one shown in Fig. 2.18. The backbone routers are now called Label Switch Routers (LSR). Via MPLS signaling protocols, the LSR knows how to forward a packet carrying an incoming label for an LSP to an outgoing interface and outgoing label; this is called a “swap” operation. The PE router also acts as an LSR, but is usually at the head (start) or end (tail) of the LSP where, respectively, the initial label is “pushed” onto the data or “popped” (removed) from the data. A CE A.1 PE PE LSR A.1 LSR PE CE PE PE CE LSR PE PE CE LSR - Label Switch Router PE - Proider Edge router (Access Router) CE - Customer Edge router PER LSP: Route lookup once and associated label assigned to packet Routes chosen using OSPF interior routing protocols Fig. 2.17 Routing with MPLS creates Label Switched Paths (LSP) for routes across the network 2 Structural Overview of ISP Networks POP data 51 SWAP 417 data SWAP 666 data PUSH 233 data data Label Switched Path “tail end” “head end” Fig. 2.18 Within an LSP, labels are assigned at each hop by the downstream router In the example of Fig. 2.17, external BGP routing information such as routes to network A is only needed in the edges of the network. The interior LSRs only need to know the interior path among the edges as determined by OSPF. When the packet with address A.1 arrives at the ingress PE, the same lookup operation is done as previously: the egress PE is determined from BGP and the interior path to reach the egress is found from OSPF. But this time the packet is given a label for the LSP matching the OSPF path to the egress. The internal LSRs now forward the packet hop-by-hop based on the labels alone. At the exit PE, the label is removed and the packet is forwarded toward its external destination. In this example, the binding of a packet to paths through the network is only done once – at the entrance to the network. The assignment of a packet to a path through the network is separated from the actual forwarding of the packet through the network (this is the first benefit that was identified above). Further, a hierarchy of forwarding information is created: the external routes are only kept at the edge of the network while the interior routers only know about interior paths. At the ingress router all received packets needing to exit the same point of the network receive the same label and follow the same LSP. MPLS takes these concepts and generalizes them further. For example, the LSP to the exit router could be chosen differently from the IGP shortest path. IPv4 provides a method for explicit path forwarding in the IP header, but it is very inefficient. With MPLS, explicit routing becomes very efficient and is the primary tool for traffic engineering in IP backbones. In the previous example, if an interior link was heavily utilized, the operator may desire to divert some traffic around that link by taking a longer path as shown in Fig. 2.19. Normal IP shortest path forwarding does not allow for this kind of traffic placement. The forwarding hierarchy can be used to create provider-based VPNs. This is illustrated in Fig. 2.20. Virtual private routing contexts are created at the PEs, one per customer VPN. The core of the network does not need to maintain state information about individual VPN routes. The same LSPs for reaching the exits of the network are used, but there are additional labels assigned for separating the different VPN states. 52 R.D. Doverspike et al. A CE PE PE LSR LSR PE CE PE PE CE LSR PE CE PE PE LSR - Label Switch Router PE - Provider Edge router (Access Router) CE - Customer Edge router LSP Routes chosen using OSPF interior routing protocols Fig. 2.19 MPLS with Traffic Engineering can use alternative to the IGP shortest path A CE PE PE LSR LSR PE PE CE PE CE LSR PE PE CE LSR - Label Switch Router PE - Provider Edge router CE - Customer Edge router PE LSP Fig. 2.20 MPLS VPNs support separated virtual routing contexts in PEs interconnected via LSPs In summary, the advantages to the IP backbone of decoupling of routing and forwarding are: It achieves efficient explicit routing. Interior routers do not need any external reachability information. 2 Structural Overview of ISP Networks 53 Packet header information is only processed at head of LSP (e.g., edges of the network). It is easy to implement nested or hierarchical identification (such as with VPNs). 2.4.2.2 Internet Route Free Core The ability of MPLS to remove the external BGP information plus Layer 3 address lookup from the interior of the IP backbone is sometimes referred to as an Internet Route Free Core. The “interior” of the IP backbone starts at the left-side (BR-side) port of the access routers in Fig. 2.7. Some of the advantages of Internet Route Free Core include: Traffic engineering using BGP is much easier. Route reflectors no longer need to be in the forwarding plane, and thus can be dedicated to IP layer control plane functions or even placed on a server separate from the routers. Denial of Service (DoS) attacks and security holes are better controlled because BGP routing decisions only occur at the edges of the IP backbone. Enterprise VPN and other priority services can be better isolated from the “Public Internet”. We provide more clarification for the last advantage. Many enterprise customers, such as financial companies or government agencies, are concerned about mixing their priority traffic with that of the public Internet. Of course, all packets are mixed on links between backbone routers; however, VPN traffic can be functionally segregated via LSPs. In particular, since denial of service attacks from the compromised hosts on the public Internet rely on reachability from the Internet, the private MPLS VPN address space isolates VPN customers from this threat. Further, enterprise premium VPN customers are sometimes clustered onto access routers dedicated to the VPN service. Furthermore, higher performance (such as packet loss or latency) for premium VPN services can be provided by implementing priority queueing or providing them bandwidth-sensitive LSPs (discussed later). A similar approach can be used to provide other performance-sensitive services, such as Voice-over-IP (VoIP). 2.4.2.3 Protocol Basics MPLS encapsulates IP packets in an MPLS header consisting of one or more MPLS labels, known as a label stack. Figure 2.21 shows the most commonly used MPLS encapsulation type. The first 20 bits are the actual numerical label. There are three bits for inband signaling of class of service type, followed by and End-of-Stack bit (described later) and a time-to-live field, which serves the same function as an IP packet time-to-live field. MPLS encapsulation does not define a framing mechanism to determine the beginning and end of packets; it relies on existing underlying link-layer technologies. 54 R.D. Doverspike et al. Layer 2 Header | PID MPLS Label 1 MPLS Label 2 Label (20bits) … | CoS (3 bits) | Stack (1 bit) MPLS Label n | Layer 3 Packet TTL (8 bits) Fig. 2.21 Generic MPLS encapsulation and header fields Existing protocols such as Ethernet, Point-to-Point Protocol (PPP), ATM, and Frame Relay have been given new protocol IDs or new link-layer control fields to allow them to directly encapsulate MPLS-labeled packets. Also, MPLS does not have a protocol ID field to indicate the type of packet encapsulated, such as IPv4, IPv6, Ethernet, etc. Instead, the protocol type of the encapsulated packet is implied by the label and communicated by the signaling protocol when the label is allocated. MPLS defines the notion of a Forwarding Equivalence Class (FEC) (not to be confused with Forward Error Correction (FEC) in lower network layers defined earlier). All packets with the same forwarding requirements, such as path and priority queuing treatment, can belong to the same FEC. Each FEC is assigned a label. Many FEC types have been defined by the MPLS standards: IPv4 unicast route, VPN IPv4 unicast route, IPv6 unicast route, Frame Relay permanent virtual circuit, ATM virtual circuit, Ethernet VLAN, etc. Labels can be stacked, with the number of stacked labels indicated by the endof-stack bit. This allows hierarchical nesting of FECs, which permits VPNs, traffic engineering, and hierarchical routing to be created simultaneously in the same network. Consider the previous VPN example where a label may represent the interior path to reach an exit and an inner label may represent a VPN context. MPLS is entitled “multiprotocol” because it can be carried over almost any transport as mentioned above, ironically even IP itself, and because it can carry the payload for many different packet types – all the FEC types mentioned above. Signaling of MPLS FECs and their associated label among routers and switches can be done using many different protocols. A new protocol, the Label Distribution Protocol (LDP), was defined specifically for MPLS signaling. However, existing protocols have also been extended to signal FECs and labels: Resource Reservation Protocol (RSVP) [3] and BGP, for example. 2.4.2.4 IP Traffic Engineering and MPLS The purpose of IP traffic engineering is to enable efficient use of backbone capacity. That is, both to ensure that links and routers in the network are not congested and that they are not underutilized. Traffic engineering may also mean ensuring that certain performance parameters such as latency or minimum bandwidth are met. 2 Structural Overview of ISP Networks 55 To understand how MPLS traffic engineering plays a role in ISP networks, we first explain the generic problem to be solved – the multicommodity flow problem – and how it was traditionally solved in IP networks versus how MPLS can solve the problem. Consider an abstract network topology with traffic demands among nodes. There are: Demands d.i; j / from node i to j Constraints – link capacity b.i; j / between nodes Link costs C.i; j / Path p.k/ or route for each demand The traffic engineering problem is to find paths for the demands that fit the link constraints. The problem can be specified at different levels of difficulty: 1. Find any feasible solution, regardless of the path costs. 2. Find a solution that minimizes the costs for the paths. 3. Find a feasible or a minimum cost solution after deleting one or more nodes and/or links. Traffic Engineering an IP Network In an IP network, the capacities represent link bandwidths between routers and the costs might represent delay across the links. Sometimes, we only want to find a feasible solution, such as in a multicast IPTV service. Sometimes, we want to minimize the maximum path delay, such as in a Voice-over-IP service. And sometimes, we want to ensure a design that is survivable (meaning it is still feasible to carry the traffic) for any single- or dual-link failure. Consider how a normal ISP without traffic engineering might try to solve the problem. The tools available on a normal IP network are: Metric manipulation, i.e., pick OSPF weights to create a feasible solution. Simple topology or link augmentation: this tends to overengineer the network and restrict the possible topology. Source or policy route using the IPv4 header option or router-based source routes. Source routes are very inefficient resulting in tremendously lower router capacity and they are not robust, making the network very difficult to operate. Figure 2.22 illustrates a network with a set of demands and an example of the way that particular demands might be routed using OSPF. Although the network has sufficient total capacity to carry the demands, it is not possible to find a feasible solution (with no congested links) by only setting OSPF weights. A small ISP facing this situation without technology like MPLS would probably resort to installing more link capacity on the A-D-C node path. The generic solution to an arbitrary traffic engineering problem requires specifying the explicit route (path) for each demand. This is a complex problem that can take an indeterminate time to solve. But there are other approaches that can solve a large subset of problems. One suboptimal approach is Constraint-based Shortest 56 Fig. 2.22 IP routing is limited in its ability to meet resource demands. It cannot successfully route the demands within the link bandwidths in this example R.D. Doverspike et al. D 2 3 A C 1 B 4 All link capacities = 1 unit, except C-3 = 2 units Demand (2,3) = 0.75 units Demand (1,3) = 0.4 units Demand (1,4) = 0.4 units Path First (CSPF). CSPF has been implemented in networks with ATM Private Network-to-Network Interface (P-NNI) and IP MPLS. For currently defined MPLS protocols, the constraints can be bandwidths per class of service for each link. Also, links can be assigned a set of binary values, which can be used to include or exclude the links from routing a given demand. CSPF is implemented in a distributed fashion where all nodes have a full knowledge of network resource allocation. Then, each node routes its demands independently by: 1. Pruning the network to only feasible paths 2. Pick the shortest of the feasible paths on the pruned network Although CSPF routing is suboptimal when compared with a theoretical multicommodity flow solution, it is a reasonable compromise to solving many traffic engineering problems in which the nodes route their demands independently of each other. For more complex situations where CSPF is inadequate, network planners must use explicit paths computed by an offline system. The next section discusses explicit routing in more detail. Traffic Engineering Using MPLS The main problems with traffic engineering an IP backbone with only a Layer 3 IGP routing protocol (such as OSPF) are (1) lack of knowledge of resource allocation and (2) no efficient explicit routing. The previous example of Fig. 2.22 shows how OPSF would route all demands onto a link that does not have the necessary capacity. Another example problem is when a direct link is needed for a small demand between nodes to meet certain delay requirements. But OSPF cannot prevent other traffic demands from routing over this smaller link and causing congestion. MPLS solves this with extensions to OSPF (OSPF-TE) [21] to provide resource allocation knowledge and RSVP-TE [2] for efficient signaling of explicit routes to use those resources. See Fig. 2.23 for a simple example of how an explicit path is created. RSVP-TE can create an explicit hop-by-hop path in the PATH message downstream. The PATH 2 Structural Overview of ISP Networks 57 2 D 3 A 1 C 51 9 B 1 3. PATH 0.4 Mbps RESV with labels Fig. 2.23 RSVP messaging to set up explicit paths Fig. 2.24 MPLS-TE enables efficient capacity usage through traffic engineering to solve the example in Fig. 2.22 D 2 3 A C 1 B 4 All link capacities = 1 unit, except C-3 = 2 units Demand (2,3) = 0.75 units Demand (1,3) = 0.4 units Demand (1,4) = 0.4 units message can request resources such as bandwidth. The return message is an RESV, which contains the label that the upstream node should use at each link hop. In this example, a traffic-engineered LSP is created along path A-B-C for 0.4 Mb/s. These LSPs are referred to as traffic engineering tunnels. Tunnels can be created and differentiated for many purposes (including restoration to be defined in later sections). But in general, primary (service route) tunnels can be considered as a routing mechanism for all packets of a given FEC between a given pair of routers or router interfaces. Using this machinery, Fig. 2.24 illustrates how MPLS-TE can be used to solve the capacity overload problem in the network shown in Fig. 2.22. The explicit path used in RSVP-TE signaling can be computed by an offline system and automatically configured in the edge routers or the routers themselves can compute the path. In the latter case, the edge routers must be configured with the IP prefixes and their associated bandwidth reservations that are to be trafficengineered to other edges of the network. Because the routers do this without knowledge of other demands being routed in the network, the routers must receive periodic updates about bandwidth allocations in the network. 58 R.D. Doverspike et al. OSPF-TE provides a set of extensions to OSPF to advertise traffic engineering resources in the network. For example, bandwidth resources per class of service can be allocated to a link. Also, a link can be assigned binary attributes, which can be used for excluding or including a link for routing an LSP. These resources are advertised in an opaque LSA via OSPF link-state flooding and are updated dynamically as allocations change. Given the knowledge of link attributes in the topology and the set of demands, the router performs an online CSPF to calculate the explicit paths. The path outputs of the CSPF are given to RSVP-TE to signal in the network. As TE tunnels are created in the network, the link resources change, i.e., available bandwidth is reduced on a link after a tunnel is allocated using RSVP-TE. Periodically, OSPF-TE will advertise the changes to the link attributes so that all routers can have an updated view of the network. 2.4.2.5 VPNs with MPLS Figure 2.20 illustrates the key concept in how MPLS is used to create VPN services. VPN services here refer to carrier-based VPN services, specifically the ability of the service provider to create private network services on top of a shared infrastructure. For the purposes of this text, VPNs are of two basic types: a Layer 3 IP routed VPN or a Layer 2 switched VPN. Generalized MPLS (GMPLS) [19] can also be used for creating Layer 1 VPNs, which will not be discussed here. A Layer 3 IP VPN service looks to customers of the VPN as if the provider built a router backbone for their own use – like having their own private ISP. VPN standards define the PE routers, CE routers, and backbone P-routers interconnecting the PEs. Although the packets share (are mixed over) the ISP’s IP layer links, routing information and packets from different VPNs are virtually isolated from each other. A Layer 2 VPN provides either point-to-point connection services or multipoint Ethernet switching services. Point-to-point connections can be used to support end-to-end services such as Frame Relay permanent virtual circuits, ATM virtual circuits, point-to-point Ethernet circuits (i.e., with no Media Access Control (MAC) learning or broadcasting) and even a circuit emulation over packet service. Interworking between connection-oriented services, such as Frame Relay to ATM interworking, is also defined. This kind of service is sometimes called a Virtual Private Wire Service (VPWS). Layer 2 VPN multipoint Ethernet switching services support a traditional Transparent LAN over a wide-area network called Virtual Private LAN Service (VPLS) [24, 25]. Layer 3 VPNs over MPLS As mentioned previously, Layer 3 VPNs maintain a separate virtual routing context for each VPN on the PE routers at the edge of the network. External CEs connect to the virtual routing context on a PE that belongs to a customer’s VPN. 2 Structural Overview of ISP Networks 59 Layer 3 VPNs implemented using MPLS are often referred to as BGP MPLS VPNs because of the important role BGP has in the implementation. BGP is used to carry VPN routes between the edges of the network. BGP keeps the potentially overlapping VPN address spaces unique by prepending onto the routes a route distinguisher (RD) that is unique to each VPN. The RD + VPN IPv4 prefix combination creates a new unique address space carried by BGP, sometimes called the VPNv4 address space. VPN routes flow from one virtual routing instance into other virtual routing instances on PEs in the network using a BGP attribute called a Route Target (RT). An RT is an address configured by the ISP to identify all virtual routing instances that belong to a VPN. RTs constrain the distribution of VPN routes among the edges of the network so that the VPN routes are only received by the virtual routing instances belonging to the intended (targeted) VPN. We note that RDs and RTs are only used in the BGP control plane – they are not values that are somehow applied to user packets themselves. Rather, for every advertised VPNv4 route, BGP also carries a label assignment that is unique to a particular virtual router on the advertising PE. Every VPN packet that is forwarded across the network receives two labels at the ingress PE: an inner label associated with the advertised VPNv4 route and an outer label associated with the LSP to reach the egress advertising PE (dictated by the BGP next-hop address). See Fig. 2.25 for a simplified example. In this example, LSR3 L2 → pop LNK1 data: vr1 vr1: RT1, RD1 table: Rt Z → L4, PE2 PE2 → L1, LSR1 L1→L2 LSR1 PE1 L1|L4|Z| packet LSR2 PE2 Route Z CE1 Li- labels LSP LNK2 data: vr1 vr1: RT1, RD1 table: Rt Z → L4,CE2,LNK2 CE2 Fig. 2.25 In this VPN example, a virtual routing context (vr1) in the PEs contains the VPN label and routing information such as route target (RT1) and route distinguisher (RD1), attached CE interfaces, and next-hop lookup and label binding. VPN traffic is transported using a label stack of VPN label and interior route label 60 R.D. Doverspike et al. there is a VPN advertising a route Z, which enters the receiving virtual router (vr1) and is distributed by BGP to other PE virtual routers using RTs. A packet entering the VPN destined toward Z is looked up in the virtual routing instance, where the two labels are found – the outer label to reach the egress PE and the inner label for the egress virtual routing instance. Layer 2 VPNs over MPLS The implementation of Layer 2 VPNs over MPLS is similar to Layer 3 VPNs. Because there is no IP routing in the VPN service, there is instead a virtual switching context created on the edge PEs to isolate different VPNs. These virtual switching contexts keep the address spaces of the edge services from conflicting with each other across different VPNs. Layer 2 VPNs use a two-label stack approach that is similar to Layer 3 VPNs. Reaching an egress PE from an ingress PE is done using the same network interior LSPs that the Layer 3 VPN service would use. And then, there is an inner label associated with either the VPWS or VPLS context at the egress PE. This inner label can be signaled using either LDP or BGP. The inner label and the packet encapsulation comprise a pseudowire, as defined in the PWE3 standards [16]. The pseudowire connects an ingress PE to an egress PE switching context and is identified by the inner label. The VPWS service represents a single point-to-point connection, so there will only be a single pseudowire setup in each direction. For VPLS however, carriers typically set up a full mesh of pseudowires/LSPs among all PEs belonging to that VPLS. Forwarding for a VPWS is straightforward: the CE connection is associated with the appropriate pseudowires in each direction when provisioned. For VPLS, forwarding is determined by the VPLS forwarding table entry for the destination Ethernet MAC address. Populating the forwarding table is based on source MAC address learning. The forwarding table records the inbound interface on which a source MAC was seen. If the destination MAC is not in the table, then the packet is flooded to all interfaces attached to the VPLS. Flooding of unknown destination MACs and broadcast MACs follows some special rules within a VPLS. All PEs within a backbone are assumed to be full mesh connected with pseudowires. So, packets received from the backbone are not flooded again into the backbone, but are only flooded onto CE interfaces. On the other hand, packets from a CE to be flooded are sent to all attached CE interfaces and all pseudowire interfaces toward the other backbone PEs. There is also a VPLS variation called Hierarchical VPLS to constrain the potential explosion of mesh point-to-point LSPs needed among the PE routers. This might happen with a PE that acts like a spoke with a single pseudowire attached to a core of meshed PEs. In this model, a flooding packet received at a mesh connected PE from a spoke PE pseudowire is sent to all attached CEs and pseudowires. In such a model, the PE interconnectivity must be guaranteed to be loop-free or a spanning tree protocol may be run among the PEs for that VPLS. 2 Structural Overview of ISP Networks 61 2.5 Network Restoration and Planning The design of an IP backbone is driven by the traffic demands that need to be supported, and network availability objectives. The network design tools model the traffic carried over the backbone links not only in a normal “sunny day” scenario, but also in the presence of network disruptions. Many carriers offer Service Level Agreements (SLAs). SLAs will vary across different types of services. For example, SLAs for private-line services are quite different from those for packet services. SLAs also usually differ among different types of packet services. The SLAs for general Internet, VPN, and IPTV services will generally differ. A packet-based SLA might be expressed in terms of Quality of Service (QoS) metrics:For example, the SLA for a premium IP service may cover up to three QoS metrics: latency, jitter, and packet loss. An example of the latter is “averaged over time period Y , the customer will receive at least X % of his/her packets transmitted.” Some of these packet services may be further differentiated by offering different levels of service, also called Class of Service (CoS). To provide its needed SLAs, an ISP establishes internal network objectives. Network availability is a key internal metric used to control packet loss. Furthermore, network availability is also sometimes used as the key QoS metric for private-line services. Network availability is often stated colloquially in “9s”. For example, “four nines” of availability means the service is available at least 0.9999 of the time. Stated in the contra-positive, the service should not be down more than 0.0001 of the time (approximately 50 min per year). Given its prime importance, we will concentrate on network availability in the remainder of this section. The single largest factors in designing and operating the IP backbone such that it achieves its target network availability are modeling its potential network disruptions and the response of the network to those disruptions. Network disruptions most typically are caused by network failures and maintenance activities. Maintenance activities include upgrading of equipment software, replacement of equipment, and reconfiguration of network topologies or line cards. Because of the complex layering and segmentation of networks surrounding the IP backbone and because of the variety and vintage of equipment that accumulates over the years, network planners, architects, network operators, and engineers spend considerable effort to maintain network availability. In this section, we will briefly describe the types of restoration methods we find at the various network layers. Then, we will describe how network disruptions affect the IP backbone, the types of restoration methods used to handle them, and finally how the network is designed to meet the needed availability. Table 2.3 summarizes typical restoration methods used in some of today’s network core layers that are most relevant to the IP backbone. See [11] for descriptions of restoration methods used in other layers shown in Fig. 2.3. In the next sections, we will describe the rows of this table. Note that the table is approximate and does not apply universally to all telecommunication carriers. 62 R.D. Doverspike et al. Table 2.3 Example of core-segment restoration methods Network layer Fiber DWDM SONET Ring IOS (DCS) W-DCS IP backbone Restoration method(s) against network failures that originate at that layer or lower layers No automatic rerouting 1) Manual 2) 1 C 1 restoration (also called dedicated protection) Bidirectional Line-Switched Rings (BLSR) Distributed path-based mesh restoration No automatic rerouting 1) IGP reconfiguration 2) MPLS Fast Reroute (FRR) Exemplary restoration time scale Hours (manual) 1) Hours (manual) 2) 3–20 ms 50–100 ms Sub-second to seconds Hours 1) 10–60 s 2) 50–100 ms 2.5.1 Restoration in Non-IP Layers 2.5.1.1 Fiber Layer As we described earlier, in most central offices today, optical interfaces on switching or transport equipment connect to fiber patch panels. Some carriers have installed an automated fiber patch panel, also called a Fiber Cross-Connect (FXC), which has the ability for an operator to remotely control the cross-connects. Some of the enabling technologies include physical crossbars using optical collometers and Micro-Electro-Mechanical Systems (MEMS). A good overview of these technologies can be found in [12]. When disruptions occur to the fiber layer, most commonly from construction activity, network operators can reroute around the failed fiber by using a patch panel to cross-connect the equipment onto undamaged fibers. This may require coordination of cross-connects at intermediate central offices to patch a path through alternate COs if an entire cable is damaged. Of course, this typically is a slow manual process, as reflected in Table 2.3 and so higher-layer restoration is usually utilized for disruptions to the fiber layer. 2.5.1.2 DWDM Layer Some readers may be surprised to learn that carriers have deployed few (if any) automatic restoration methods in their DWDM layers (neither metro nor core segment). The one type of restoration occasionally deployed is one-by-one (1:1) or one-plus-one (1 C 1) tail-end protection switching, which switches at the endpoints of the DWDM layer connection. With 1C1 switching, the signal is duplicated and transmitted across two (usually) diversely routed connections. The path of the connection during the nonfailure state is usually called the working path (also called the primary or service path); the path of the connection during the failure state is called the restoration path (also called protection path or backup path). The receiver 2 Structural Overview of ISP Networks 63 consists of a simple detector and switch that detects failure of the signal on the working path (more technically, detects performance errors such as average BER threshold crossings) and switches to the restoration path upon alarm. Once adequate signal performance is again achieved on the signal along the working path (including a time-out threshold to avoid link “flapping”), it switches back to the working path. In 1:1 protection switching, there is no duplication of signal, and thus the restoration connection can be used for other transport in nonfailure states. The transmitted signal is switched to the restoration path upon detection of failure of the service path and/or notification from the far end. Technically speaking, in ROADM or Point-to-point DWDM systems, 1 C 1 or 1:1 protection switching is usually implemented electronically via the optical transponders. Consequently, these methods can be implemented at other transport layers, such as DCS, IOS, and SONET. The major advantage of 1 C 1 or 1:1 methods is that they can trigger in as little as 3–20 ms. However, because these methods require restoration paths that are dedicated (one-for-one) for each working connection, the resulting restoration capacity cannot be shared among other working connections for potential failures. Furthermore, the restoration paths are diversely routed and are often much longer than their working paths. Consequently, 1 C 1 and 1:1 protection switching tend to be the costliest forms of restoration. 2.5.1.3 SONET Ring Layer The two most common types of deployed SONET or SDH self-healing ring technology are Unidirectional Path Switched Ring (UPSR-2F) and Bidirectional Line-Switched Ring (BLSR-2F). The “2F ” stands for “2-Fibers”. For simplicity, we will limit our discussion to SONET rings, but there is a very direct analogy for SDH rings. However, note that ADM-ADM ring links are sometimes transported over a lower DWDM layer, thus forming a “connection” that is routed over channels of DWDM systems, instead of direct fiber. Although there is no inherent topographical orientation in a ring, many people conceptually visualize each node of a SONET self-healing ring as an ADM with an east bidirectional OC-n interface (i.e., a transmit port and a receive port) and a west OC-n interface. Typically, n D 48 or 192. An STS-k SONET-Layer connection enters at an add/drop port of an ADM, routes around the ring on k STS-1 channels of the ADM–ADM links and exits the ring at an add/drop port of another ADM. The UPSR is the simplest of the devices and works similarly to the 1 C 1 tail-end switch described in Section 2.5.1.2, except that each direction of transmission of a connection routes counterclockwise on the “outer” fiber around the ring (west direction) and therefore an STS-k connection used the same k STS-1 channels on all links around the ring. At each add/drop transmit port, the signal is duplicated in the opposite direction on the “inner” fiber. The selector responds to a failure as described above. The BLSR-2F partitions the bidirectional channels of its East and West highspeed links in half. The first half is used for working (nonfailure) state, and the second half is reserved for restoration. When a failure to a link occurs, 64 R.D. Doverspike et al. the surrounding ADMs loop back that portion of the connection paths onto the restoration channels around the opposite direction of the ring. The UPSR has very rapid restoration, but suffers the dedicated-capacity condition described in Section 2.5.1.2; as a consequence, today UPSRs are now confined mostly to the metro network, in particular to the portion closest to the customer, often extending into the feeder network. Because BLSR signaling is used to advertise failures among ADMs and real-time intermediate cross-connections have to be made, a BLSR restores more slowly than a UPSR. However, the BLSR is capable of having multiple connections share restoration channels over nonsimultaneous potential network failures, and is thus almost always deployed in the middle of the metro network or parts of the core network. Rings are described in more detail in [11]. 2.5.1.4 IOS Layer The typical equipment that comprise today’s IOS layer use distributed control to provision (set-up) connections. Here, links of the IOS network (SONET bidirectional OC-n interfaces) are assigned routing weights. When a connection is provisioned over the STS-1 channels of an IOS network, its source node (IOS) computes its working path (usually along a minimum-weight path) plus also computes its restoration path that is diversely routed from the working path. After the connection is set up along its working path, the restoration path is stored for future use. The nodes communicate the state of the network connectivity via topology update messages transmitted over the SONET overhead on the links between the nodes. When a failure occurs, the nodes flood advertisement messages to all nodes indicating the topology change. The source node for each affected connection then instigates the restoration process for its failed connections by sending connection request messages along the links of the (precalculated) restoration path, seeking spare STS-1 channels to reroute its connections. Various handshaking among nodes of the restoration paths are implemented to complete the rerouting of the connections. Note that in contrast to the dedicated and ring methods, the restoration channels are not prededicated to specific connections and, therefore, connections from a varied set of source/destination pairs can potentially use them. Such a method is called shared restoration because a given spare channel can be used by different connections across nonsimultaneous failures. Shared mesh restoration is generally more capacity-efficient than SONET rings in mesh networks (i.e., networks with average connectivity greater than 2). We now delve a little more into IOS restoration to make a key point that will become relevant to the IP backbone, as well. The example in Fig. 2.2 shows two higher-layer connections routing over the same lower-layer link. In light of the discussion above about the restoration path being diverse from the working path in the IOS layer, the astute reader may ask “diverse relative to what?” The answer is that, in general, the path should be diverse all the way down through the DWDM and Fiber Layers. This requires that the IOS links contain information about how they share these lower-layer links. Often, this is accomplished via a mechanism called 2 Structural Overview of ISP Networks 65 “bundle groups”. That is, a bundle group is created for each lower-layer link, but is expressed as a group of IOS links that share (i.e., route over) that link. Diverse restoration paths can be discovered by avoiding IOS links that belong to the same bundle group of a link on the working path. Of course, the equipment in the IOSLayer cannot “see” its lower layers, and consequently has no idea how to define and create the bundle groups. Therefore, bundle groups are provisioned in the IOSs using an Operations Support System (OSS) that contains a database describing the mapping of IOS links to lower-layer networks. This particular example illustrates the importance of understanding network layering; else we will not have a reliable method to plan and engineer the network to meet the availability objective. This point will be equally important to the IP backbone. A set of bundled links is also referred to as a Shared Risk Link Group (SRLG) in the telecommunications industry, since it refers to a group of links that are subject to a shared risk of disruption. 2.5.1.5 W-DCS Layer and Ethernet Layer There are few restoration methods provided at the W-DCS layer itself. This is because most disruptions to a W-DCS link occurs from a disruption of (1) a W-DCS line card or (2) a component in a lower layer of which the link routes. Disruptions of type (1) are usually handled by providing 1:1 restorable intra-office links between the W-DCS and TDM node (IOS or ADM). Disruptions of type (2) are restored by the lower TDM layers. This only leaves failure or maintenance of the W-DCS itself as an unrestorable network disruption. However, a W-DCS is much less sophisticated than a router and less subject to failure. Restoration of Layer 2 VPNs in an IP/MPLS backbone is discussed in Section 2.5.2. We note here that restoration in enterprise Ethernet networks is typically based on the Rapid Spanning Tree Protocol (RSTP). When enterprise Ethernet VPNs are connected over the IP backbone (such as VPLS), an enterprise customer who employs routing methods such as RSTP expects it to work in the extended network. By encapsulating the customer’s Ethernet frames inside pseudowires ensures that the client’s RTSP control packets are transported transparently across the wide area. For example, a client VPN may choose to restore local link disruptions by routing across other central offices or even distant metros. Since all this appears as one virtual network to the customer, such applications may be useful. 2.5.2 IP Backbone There are two main restoration methods we describe for the IP layer: IGP reconfiguration and MPLS Fast Reroute (FRR). 66 R.D. Doverspike et al. 2.5.2.1 OSPF Failure Detection and Reconvergence In a formal sense, the IGP reconvergence process responds to topology changes. Such topology changes are usually caused by four types of events: 1. Maintenance of an IP layer component 2. Maintenance of a lower-layer network component 3. Failure of an IP layer component (such as a router line card or common component) 4. Failure of a lower-layer network component (such as a link) When network operations staff perform planned maintenance on an IP layer link, it is typical to raise the OSPF administrative weight of the link to ensure that all traffic is diverted from the link (this is often referred to as “costing out” the link). In the second case, most carriers have a maintenance procedure where organizations that manage the lower-layer networks schedule their daily maintenance events and inform the IP layer operations organization. The IP layer operations organization responds by costing out all the affected links before the lower-layer maintenance event is started. In the first two cases (planned maintenance activity), the speed of the reconvergence process is usually not an issue. This is because the act of changing an IGP routing weight on a link causes LSAs to be issued. During the process of updating the link status and recomputation of the SPF tree, the affected links remain in service (i.e., “up”). Therefore, once the IGP reconfiguration process has settled, the routers can redirect packets to their new paths. While there may be a transient impact during the “costing out” period, in terms of transient loops and packet loss, the service impact is kept to a minimum by using this costing out technique to remove a link from the topology for performing maintenance. In the last two cases (failures), once the affected links go down, packets may be lost or delayed until the reconvergence process completes. Such a disruption may be unacceptable to delay or loss-sensitive applications. This motivates us to examine how to reduce the time required for OSPF to converge from unexpected outages. This is the focus of the remainder of this section. While most large IP backbones route over lower layers, such as DWDM, those do not provide restoration. Layer 1 failure detection is a key component of the IP layer restoration process. A key component of the overall failure recovery time in OSPFbased networks is the failure detection time. However, lower-layer failure detection mechanisms sometimes do not coordinate well with higher-layer mechanisms and do not detect disruptions that originate in the IP layer control plane. As a result, OSPF routers periodically exchange Hello messages to detect the loss of a link adjacency with a neighbor. If a router does not receive a Hello message from its neighbor within a RouterDeadInterval, it assumes that the link to its neighbor has failed, or the neighbor router itself is down, and generates a new LSA to reflect the changed topology. All such LSAs generated by the routers affected by the failure are flooded throughout the network. This causes the routers in the network to redo the SPF 2 Structural Overview of ISP Networks 67 calculation and update the next-hop information in their respective forwarding tables. Thus, the time required to recover from a failure consists of: (1) the failure detection time, (2) LSA flooding time, (3) the time to complete the new SPF calculations and update the forwarding tables. To avoid a false indication that an adjacency is down because of congestion related loss of Hello messages, the RouterDeadInterval is usually set to be four times the HelloInterval – the interval between successive Hello messages sent by a router to its neighbor. With the RFC suggested default values for these timers (HelloInterval value of 10 s and RouterDeadInterval value of 40 s), the failure detection time can take anywhere between 30 and 40 s. LSA flooding times consist of propagation delay and additional pacing delays inserted by the router. These pacing delays serve to rate-limit the frequency with which LSUpdate packets are sent on an interface. Once a router receives a new LSA, it schedules an SPF calculation. Since the SPF calculation using Dijkstra’s algorithm (see e.g., [8]) constitutes a significant processing load, a router typically waits for additional LSAs to arrive for a time interval corresponding to spfDelay (typically 5 s) before doing the SPF calculation on a batch of LSAs. Moreover, routers place a limit on the frequency of SPF calculations (governed by a spfHoldTime, typically 10 s, between successive SPF calculations), which can introduce further delays. From the description above, it is clear that reducing the HelloInterval can substantially reduce the Hello protocol’s failure detection time. However, there is a limit to which the HelloInterval can be safely reduced. As the HelloInterval becomes smaller, there is an increased chance that network congestion will lead to loss of several consecutive Hello messages and thereby cause a false alarm that an adjacency between routers is lost, even though the routers and the link between them are functioning. The LSAs generated because of a false alarm will lead to new SPF calculations by all the routers in the network. This false alarm would soon be corrected by a successful Hello exchange between the affected routers, which then causes a new set of LSAs to be generated and possibly new path calculations by the routers in the network. Thus, false alarms cause an unnecessary processing load on routers and sometimes lead to temporary changes in the path taken by network traffic. If false alarms are frequent, routers have to spend considerable time doing unnecessary LSA processing and SPF calculations, which may significantly delay important tasks such as Hello processing, thereby leading to more false alarms. False alarms can also be generated if a Hello message gets queued behind a burst of LSAs and thus cannot be processed in time. The possibility of such an event increases with the reduction of the RouterDeadInterval. Large LSA bursts can be caused by a number of factors such as simultaneous refresh of a large number of LSAs or several routers going down/coming up simultaneously. Choudhury [5] studies this issue and observes that reducing the HelloInterval lowers the threshold (in terms of number of LSAs) at which an LSA burst will lead to generation of false alarms. However, the probability of LSA bursts leading to false alarms is shown to be quite low. 68 R.D. Doverspike et al. Since the loss and/or delayed processing of Hello messages can result in false alarms, there have been proposals to give such packets prioritized treatment at the router interface as well as in the CPU processing queue [5]. An additional option is to consider the receipt of any OSPF packet (e.g., an LSA) from a neighbor as an indication of the good health of the router’s adjacency with the neighbor. This provision can help avoid false loss of adjacency in the scenarios where Hello packets get dropped because of congestion, caused by a large LSA burst, on the link between two routers. Such mechanisms may help mitigate the false alarm problem significantly. However, it will take some time before these mechanisms are standardized and widely deployed. It is useful to make a realistic assessment regarding how small the HelloInterval can be, to achieve faster detection and recovery from network failures while limiting the occurrence of false alarms. We summarize below the key results from [13]. This assessment was done via simulations on the network topologies of commercial ISPs using a detailed implementation of the OSPF protocol in the NS2 simulator. The work models all the important OSPF protocol features as well as various standard and vendor-introduced delays in the functioning of the protocol. These are shown in Table 2.4. Goyal [13] observes that with the current default settings of the OSPF parameters, the network takes several tens of seconds before recovering from a failure. Since the main component in this delay is the time required to detect a failure using the Hello protocol, Goyal [13] examines the impact of lower HelloInterval values on failure detection and recovery times. Table 2.5 shows typical results for failure detection and recovery times after a router failure. As expected, the failure detection time is within the range of three to four times the value of HelloInterval. Once a neighbor detects the router failure, it generates a new LSA about 0.5 s after the failure detection. The new LSA is flooded throughout the network and will lead to scheduling of an SPF calculation 5 s (spfDelay) after the LSA receipt. This is done to allow one SPF calculation to take care of several new LSAs. Once the SPF calculation is done, the router takes about 200 ms more to update the forwarding table. After including the LSA propagation and pacing delays, one can expect the failure recovery to take place about 6 s after the ‘earliest’ failure detection by a neighbor router. Notice that many entries in Table 2.5 show the recovery to take place much sooner than 6 s after failure detection. This is partly an artifact of the simulation because the failure detection times reported by the simulator are the “latest” ones rather than the “earliest”. In one interesting case (seed 2, HelloInterval 0.75 s), the failure recovery takes place about 2 s after the ‘latest’ failure detection. This happens because the SPF calculation scheduled by an earlier false alarm takes care of the LSAs generated because of router failure. There are also many cases in which failure recovery takes place more than 6 s after failure detection (notice entries for HelloInterval 0.25 s, seeds 1 and 3). Failure recovery can be delayed because of several factors. The SPF calculation frequency of the routers is limited by spfHoldTime (typically 10 s), which can delay the new SPF calculation in response to the router failure. The delay caused by spfDelay is also a contribution. 2 Structural Overview of ISP Networks 69 Table 2.4 Various delays affecting the operation of OSPF protocol Standard configurable delays RxmtInterval The time delay before an un-acked LSA is retransmitted. Usually 5 s. HelloInterval The time delay between successive Hello packets. Usually 10 s. RouterDeadInterval The time delay since the last Hello before a neighbor is declared to be down. Usually four times the HelloInterval. Vendor-introduced configurable delays Pacing delay The minimum delay enforced between two successive Link-State Update packets sent down an interface. Observed to be 33 ms. Not always configurable. spfDelay The delay between the shortest path calculation and the first topology change that triggered the calculation. Used to avoid frequent shortest path calculations. Usually 5 s. spfHoldTime The minimum delay between successive shortest path calculations. Usually 10 s. Standard fixed delays LSRefreshTime MinLSInterval MinLSArrival Router-specific delays Route install delay LSA generation delay LSA processing delay SPF calculation delay The maximum time interval before an LSA needs to be reflooded. Set to 30 min. The minimum time interval before an LSA can be reflooded. Set to 5 s. The minimum time interval that should elapse before a new instance of an LSA can be accepted. Set to 1 s. The delay between the shortest path calculation and update of forwarding table. Observed to be 0.2 s. The delay before the generation of an LSA after all the conditions for the LSA generation have been met. Observed to be around 0.5 s. The time required to process an LSA including the time required to process the Link-State Update packet before forwarding the LSA to the OSPF process. Observed to be less than 1 ms. The time required to do shortest path calculation. Observed to be 0.00000247x 2 C 0.000978 s on Cisco 3600 series routers; x being the number of nodes in the topology. Finally, the routers with a low degree of connectivity may not get the LSAs in the first try because of loss due to congestion. Such routers may have to wait for 5 s (RxmtInterval) for the LSAs to be retransmitted. The results in Table 2.5 show that a smaller value of HelloInterval speeds up the failure detection but is not effective in reducing the failure recovery times beyond a limit because of other delays like spfDelay, spfHoldTime, and RxmtInterval. Failure recovery times improve as the HelloInterval reduces down to about 0.5 s. Beyond that, as a result of more false alarms, we find that the recovery times actually go up. While it may be possible to further speed up 70 R.D. Doverspike et al. Table 2.5 Failure detection time and failure recovery time for a router failure with different HelloInterval values Seed 1 Seed 2 Seed 3 Hello interval (s) FDT (s) FRT (s) FDT (s) FRT (s) FDT (s) FRT (s) 10 2 1 0.75 0.5 0.25 32:08 7:82 3:81 2:63 1:88 0:95 36:60 11:68 9:02 7:84 6:98 10:24 39:84 7:63 3:80 2:97 1:82 0:84 46:37 12:18 8:31 5:08 6:89 6:08 33:02 7:79 3:84 2:81 1:79 0:99 38:07 12:02 10:11 7:82 6:85 13:41 the failure recovery by reducing the values of these delays, eliminating such delays altogether is not prudent. Eliminating spfDelay and spfHoldTime will result in potentially additional SPF calculations in a router in response to a single failure (or false alarm) as the different LSAs generated because of the failure arrive one after the other at the router. The resulting overload on the router CPUs may have serious consequences for routing stability, especially when there are several simultaneous changes in the network topology. Failure recovery below the range of 1–5 s is difficult with OSPF. In summary, OSPF recovery time can be lowered by reducing the value of HelloInterval. However, too small a value of HelloInterval will lead to many false alarms in the network, which cause unnecessary routing changes and may lead to routing instability. The optimal value for the HelloInterval that will lead to fast failure recovery in the network, while keeping the false alarm occurrence within acceptable limits for a network, is strongly influenced by the expected congestion levels and the number of links in the topology. While the HelloInterval can be much lower than current default value of tens of seconds, it is not advisable to reduce it to the millisecond range because of potential false alarms. Further, it is difficult to prescribe a single HelloInterval value that will perform optimally in all cases. The network operator needs to set the HelloInterval conservatively taking into account both the expected congestion as well as the number of links in the network topology. 2.5.2.2 MPLS Fast Reroute MPLS Fast Reroute (FRR) was designed to improve restoration performance using the additional protocol layer provided by MPLS LSPs [17]. Primary and alternate (backup) LSPs are established. Fast rerouting over the alternate paths after a network disruption is achieved using preestablished router forwarding table entries. Equipment suppliers have developed many flavors of FRR, some of which are not totally compliant with standardized MPLS FRR. This section provides an overview of the basic concept. There are two basic varieties of backup path restoration in MPLS FRR, called next-hop and next-next-hop. The next-hop approach identifies a unidirectional link to be protected and a backup (or bypass) unidirectional LSP that routes around the 2 Structural Overview of ISP Networks 71 MPLS secondary LSP tunnel X MPLS primary LSP tunnels PHY layer links MPLS next-hop backup path X MPLS next-nexthop backup paths Fig. 2.26 Example of Fast Reroute backup paths link if it fails. The protected link can be a router–router link adjacency or even another layer of LSP tunnel itself. The backup LSP routes over alternate links. The top graph in Fig. 2.26 illustrates a next-hop backup path for the potential failure of a given link (designated with an “X”). For now ignore the top path labeled “MPLS secondary LSP tunnel”, which will be discussed later. With the next-next-hop approach, the primary entities to protect are two-link working paths. The backup path is an alternate path over different links and routers than the protected entity. In general, a next-hop path is constructed to restore against individual link failures while next-next-hop paths are constructed to restore against both individual link failures and node failures. The trade-off is that next-hop paths are simpler to implement because all flows routing over the link can be rerouted similarly, whereas next-nexthop requires more LSPs and routing combinations. This is illustrated in the lower example of Fig. 2.26, wherein the first router along the path carries flows that terminate on different second hop routers, and therefore must create multiple backup LSPs that originate at that node. We will briefly describe an implementation of the next-hop approach to FRR. A primary end-to-end path is chosen by RSVP. This path is characterized by the Forwarding Equivalence Class (FEC) discussed earlier and reflects packets that are to be corouted and have similar CoS queuing treatment and ability to be restored with FRR. Often, a mesh of fully connected end-to-end LSPs between the backbone routers (BRs) is created. 72 R.D. Doverspike et al. As discussed in earlier sections, an LSP is identified in forwarding tables by mappings of pairs of label and interface: (In-Label, In-Interface)! (Out-Label, Out-Interface). An end-to-end LSP is provisioned (set up) by choosing and populating these entries at each intermediate router along the path by a protocol such as RSVP-TE. For the source router of the LSP, the “In-Label” variable is equivalent to the FEC. As a packet hops along routers, the labels are replaced according to the mapping until it reaches the destination router, in which case, the MPLS shim headers are popped and packets are placed on the final output port. With next-hop, facility-based FRR, a backup (or bypass) LSP is set up for each link. For example, consider a precalculated backup path to protect a link between routers A and B, say (A-1, B-1), where A-1 is the transmit interface at router A, B-1 is the receive interface at router B, and L-1 is the MPLS label for the path over this link. The forwarding table entries are of form (L-i, A-k) ! (L-1, A-1) at router A and (L-1, B-1) ! (L-j, B-s) at router B. When this link fails, a Layer 1 alarm is generated and forwarded to the router controller or line card at A and B. For packets arriving at router A, mapping entries in the forwarding table with the Out-Interface D A-1 have another (outer) layer of label pushed on the MPLS stack to coincide with the backup path. This action is preloaded into the forwarding table and triggered by the alarm. Forwarding continues along the routers of this backup LSP by processing the outer layer labels as with any MPLS packet. The backup path ends at router B and, therefore, when the packets arrive at router B, their highest (exterior) layer label is popped. Then, from the point of view of router B, after the outer label is popped, the MPLS header is left with (In-Label, In-Interface) D (L-1, B-1) and therefore the packets continue their journey beyond router B just as they would if link (A-1, B-1) were up. In this way, all LSPs that route over the particular link are rerouted (hence the term “facility based”). Various other specifications can be made to segregate the backup path to be pushed on given classes of LSPs, for example to provide restoration for some IP CoSs rather than others. Another common implementation of next-hop FRR defines 1-hop pseudowires for each key link. Each pseudowire has defined a primary LSP and backup LSP (a capability found in most routers). If the link fails, a similar alarm mechanism causes the pseudowire to reroute over the backup LSP. When the primary LSP is again declared up, the pseudowire switches back to the primary path. An advantage of this method is that the pseudowire appears as a link to the IGP routing algorithm. Weights can be used to control how packets route over it or the underlying Layer 1 link. Section 2.6 illustrates this method for an IPTV backbone network. MPLS FRR has been demonstrated to work very rapidly (less than 100 ms) in response to single-link (IP layer PHY link) failures by many vendors and carriers. Most FRR implementations behave similarly during the small interval immediately after the failure and before IGP reconvergence. However, implementations differ in what happens after IGP reconvergence. We describe two main approaches in the context of next-hop FRR here. In the first approach, the backup LSP stays in place until the link goes back into service and IGP reconverges back to its nonfailure state. This is most common when a separate LSP or pseudowire is associated with each link in next-hop FRR. In this case, the link-LSP is rerouted onto its backup LSP and stays that way until the primary LSP is repaired. 2 Structural Overview of ISP Networks 73 In the second approach, FRR provides rapid restoration and then, after a short settling period, the network recomputes its paths [4]. Here, each primary end-toend LSP is recomputed during the first IGP reconfiguration process after the failure. Since the IGP knows about the failed link(s), it reroutes the primary end-to-end LSPs around them and the backup LSPs become moot. This is illustrated in the three potential paths in the topmost diagram of Fig. 2.26. The IP flow routes along the primary LSP during the nonfailure state. Then, the given link fails and the path of the flow over the failed link deviates along the backup LSP, as shown by the lower dashed line. After the first IGP reconfiguration process, the end-to-end LSP path is recomputed, illustrated by the topmost dashed line. When a failed component is repaired or a maintenance procedure is completed, the disrupted links are put back into service. The process to return the network to its nonfailure state is often called normalization. During the normalization process, LSAs are broadcast by the IGP and the forwarding tables are recalculated. The normalization process is often controlled by an MPLS route mechanism/timer. A similar procedure would occur for next-next hop. The reason for the second approach is that while FRR enables rapid restoration, because these paths are segmental “patches” to the primary paths, the alternate route is often long and capacity-inefficient. With the first approach, IP flows continue routing over the backup paths until the repair is completed and alarms clear, which may span hours or days. Another reason is that if multiple link failures occur, then some of the backup FRR paths may fail; some response is needed to address this situation. These limitations of the first approach were early key inhibitors to implementation of FRR in large ISPs. The key to implementing this second FRR strategy is that the switch from FRR backup paths to new end-to-end paths is hitless (i.e., negligible packet loss), else we may suffer three hits from each single failure (the failure itself, the process to reroute the end-to-end paths immediately after the failure, and then the process to revert to the original paths after repair). If the alternate end-to-end LSPs are presetup and the forwarding table changes implemented efficiently for most routers (often using pointers), this process is essentially hitless for most IP unicast (point-to-point) applications. However, we note that today’s multicast does not typically enjoy hitless switchover to the new forwarding table because most multicast trees are usually built via join and prune request messages issued backwards (upstream) from the destination nodes. However, it is expected that different implementations of multicast will fix this problem in the future. We discuss this again in Section 2.6 and refer the reader to [36] for more discussion of hitless multicast. For the network design phase of implementing FRR, for next-hop FRR, each link (say L) along the primary path needs a predefined a backup path whose routing is diverse in lower layers. That is, the paths of all lower-layer connections that support the links of the backup path are disjoint from the path of the lower-layer connection for link L. The key is in predefining the backup tunnels. While next-next-hop paths can be also used to restore against single-link failures, the network becomes more complex to design if there is a high degree of lower-layer link overlap. More generally, the major difficulty for the FRR approach is defining the backup LSPs so 74 R.D. Doverspike et al. that the service paths can be rerouted, given a predefined set of lower-layer failures. Furthermore, when multiple lower-layer failures occur and MPLS backup paths fail, FRR does not work and the network must revert to the slower primary path recalculation approach (described in method 2 above). 2.5.3 Failures Across Multiple Layers Now that the reader is armed with background on network layering and restoration methods, we are poised to delve deeper into the factors and carrier decision variables that shape the availability of the IP backbone. Let us briefly revisit Fig. 2.9, which gives a simple example of the core ROADM Layer Diagram. Consider a backbone router (BR) in central office B with a link to one of the backbone routers in central office A. Furthermore, consider the remote access router (RAR) that is homed to the backbone router in office A. However, let us add a twist wherein the link between the RAR and BR routes over the IOS layer instead of directly onto the ROADM (DWDM layer) as pictured in Fig. 2.9. This can occur for RAR–BR links with lower bandwidth. This modification will illustrate more of the potential failure modes. In particular, we have constructed this simple example to illustrate several key points: Computing an estimate of the availability of the IP backbone involves analysis of many network layers. Network disruptions can originate from many different sources within each layer. Some lower layers may provide restoration and others do not; how does this affect the IP backbone? Figure 2.27 gives examples of the types of individual component disruptions (“down events”) that might cause links to fail in this network example, but still only shows a few of the many disruptions that can originate at these layers. As one can see, this is a four-layer example; and, some of the layers are skipped. Note that for simplicity, we illustrate point-to-point DWDM systems at the DWDM layer; however, the concepts apply equally well for ROADMs. Some readers perhaps may think that the main source of network failures is fiber cuts and, therefore, the entire area of multilayer restoration can be reduced to analyzing fiber cuts. However, this oversimplifies the problem. For example, an amplifier failure can often be as disruptive as a fiber cable cut and will likely result in the failure of multiple IP layer links. Furthermore, amplifier failures are more frequent. Let us examine the effect of some of the failures illustrated in Fig. 2.27. IOS interface failure: The IOS network has restoration capability, as described in earlier sections. Consequently, the IOS layer reroutes its failed SONET STS-n connection that supports the RAR–BR link onto its restoration path. In this case, once the SONET alarms are detected by the two routers (the RAR and BR), they take the link out of service and generate appropriate LSAs to the correct IGP 2 Structural Overview of ISP Networks 75 OC-n router common component OC-n BR BR AR router line card IP Layer IOS common component DWDM common component or Amplifier intra-office fiber IOS IOS IOS IOS line card IOS Layer OTs D W D M D W D M OTs D W D M D W D M OTs OTs D W D M D W D M OTs OT ROADM/Point-to-point DWDM Layer fiber cable Fiber Layer BR = Backbone Router ROADM = Reconfigurable Optical Add//Drop Multiplexer AR = Access Router OT = Optical Transponder DWDM = (Dense) Wavelength Division Multiplexer IOS = Intelligent Optical Switch Fig. 2.27 Example of components disruptions (failure or maintenance activity) at multiple layers administrative areas or control domains to announce the topology change. Assuming that the IOS-layer restoration is successful, the AR–BR link comes back after a short time (as specified in the IOS layer of Table 2.3) and the SONET alarm clears. After perhaps, an appropriate time-out on the routers to avoid link flapping, the link is brought back up by the router and the topology change is announced via LSAs. We note that in a typical AR/BR homing architecture, the LSAs from an AR–BR link are only announced in subareas and so do not affect unaffected ARs or BRs. Fiber cut: In the core network, the probability of a fiber cut is roughly proportional to its length. They are less frequent than many of the other failures, but highly disruptive, where usually many simultaneous IP layer links fail because of the concentration of capacity enabled by DWDM. Optical Transponder: OT failure is the most common of the failures shown in Fig. 2.27. However, a single OT failure only affects individual IP backbone links. Some of the more significant problems with OT failures are (1) performance degradation, where bit errors occasionally trip BER threshold crossing alerts and (2) there is a nonnegligible probability of multiple failures in the network, in which an OT fails while another major failure is in progress or vice versa. DWDM terminal or amplifier: Amplifier failure is usually the most disruptive of failures because of its impact (multiple wavelengths) and sheer quantity, often placed every 50–100 miles, depending on the vintage and bit rate of the wavelengths of the DWDM equipment. Failure of the DWDM terminal equipment not associated with amplifiers and OTs is less probable because of the increased use of 76 R.D. Doverspike et al. passive (nonelectrical or powered) components. Note that in Fig. 2.27, for the OT, fiber cut, and amplifier failure, the affected connections at their respective layers are unrestored. Thus, the IP layer must reroute around its lost link capacity. Intra-office fiber: These disruptions usually occur from maintenance, reconfiguration, and provisioning activity in the central office. This has been minimized over the years due to the use of fiber patch panels; however, when significant network capacity expansion or reconfiguration occurs, especially for the deployment of new technologies, architectures, or services, downtime from these class of failures typically spikes. However, it is typical to lump the intra-office fiber disruptions into the downtime for a linecard or port and model them as one unit. Router: These network disruptions include failure of router line cards, failure of router common equipment, and maintenance or upgrade of all or parts of the router. Note that for these disruptions that originate at the IP layer, no lower-layer restoration method can help because rerouting the associated connections at the lower layers will not bring the affected link back up. However, in the dual-homing AR–BR architecture, all the ARs that home to the affected router can alternatively reroute through the mate BR. The method of rerouting the AR traffic to the surviving AR–BR links differs per carrier. Usually, IGP reconfiguration is used. However, this can be unacceptably slow for some high-priority services, as evidenced by Table 2.3. Therefore, other faster techniques are sometimes used, such as Ethernet link load balancing or MPLS FRR. We generalize some simple observations on multilayer restoration illustrated by Fig. 2.27 and its subsequent discussion: 1. Because of the use of express links, a single network failure or disruption at a lower layer usually results in multiple link failures at higher layers. 2. Failures that originate at an upper layer cannot be restored at a lower layer. 3. To meet most ISP network availability objectives, some form of restoration (even if rudimentary) must be provided in upper layers. 2.5.4 IP Backbone Network Design Network design is covered in more detail in Chapter 5. However, to tie together the concepts of network layering, network failure modeling, and restoration, we provide a brief description of IP network design here to illustrate its importance in meeting network availability targets. In this section, we give a brief description about how these factors are accommodated in the network design. To illustrate this, we describe a very simplified network design (or network planning) process as follows. This process would occur every planning period or whenever major changes to the network occur: 2 Structural Overview of ISP Networks 77 1. Derive a traffic matrix. 2. Input the existing IP backbone topology and compute any needed changes. That is, determine the homing of AR locations to the BR locations and determine which BR pairs are allowed to have links placed between them. 3. Determine the routing of BR–BR links over the lower-layer networks (e.g., DWDM, IOS, fiber). 4. Route the traffic matrix over the topology and size the links. This results in an estimate of network cost across all the needed layers. 5. Resize the links by finding their maximum needed capacity over all possible events in the Failure Set, which models potential network disruptions (both component failures and maintenance activity). This step simulates each failure event, determining which IP layer link or nodes fail after lower-layer restoration, if it exists, is applied and determining the capacity needed after traffic is rerouted using IP layer restoration. 6. Re-optimize the topology by going back to step 2 and iterating with the objective of lowering network cost. Note in steps 2 and 3 that most carriers are reluctant to make large changes to the existing IP backbone topology, since these can be very disruptive and costly events. Therefore, steps 2 and 3 usually incur small topology changes from one planning period to another planning period. We will not describe detailed algorithms for the above in detail here. Approaches to the above problem can be found in [22, 23]. The traffic matrix can come in a variety of forms, such as the peak 5-min average loads between AR-pairs or average loads, etc. Unfortunately, many organizations responsible for IP network design either have little or no data about their current or future traffic matrices. In fact, many engineers who manage IP networks expand their network by simply observing link loads. When a link load exceeds some threshold, they add more capacity. Given no knowledge or high uncertainty of the true, stochastic traffic matrix, this may be a reasonable approach. However, network failures and their subsequent restorations are the phenomena that cause the greatest challenges with such a simple approach. Because of the extensive rerouting that can occur after a network failure, there is no simple or intuitive parameter to determine the utilization threshold for each link. Traffic matrix estimation is discussed in detail in Chapter 5. A missing ingredient in the above network design algorithm is we did not describe how to model the needed network availability for an ISP to achieve its SLAs. Theoretically, even if we assume the traffic matrix (present and/or future) is completely accurate, to achieve the network design availability objective, all the component failure modes and all the network layering must be modeled to design the IP backbone. The decision variables are the layers where we provide restoration (including what type of restoration should be used) and how much capacity should be deployed at each layer to meet the QoS objectives for the IP layer. This is further complicated by the fact that while network availability objectives for transport layers are often expressed in worst-case or average-case connection uptimes, IP backbone QoS objective often use packet-loss metrics. 78 R.D. Doverspike et al. However, we can approximate the packet loss constraints in large IP layer networks by establishing maximum link utilization targets. For example, through separate analysis it might be determined that every flow can achieve the objective maximum packet loss target by not exceeding 90% utilization on any 40 Gb/s link, with perhaps lower utilization maxima needed on lower-rate links. Then, one can model when this utilization condition is met over the set of possible failures, including subsequent restoration procedures. By modeling the probabilities of the failure set, one can compute a network availability metric appropriate for packet networks. The probabilities of events in the failure set can be computed using Markov models and the Mean Time Between Failures (MTBF) and the Mean Time to Repair (MTTR) of the component disruptions. These parameters are usually obtained from a combination of equipment-supplier specifications, network observation/data, and carrier policies and procedures. A major stumbling block with this theoretical approach is that the failure event space is exponential in size. Even for very small networks and a few layers, it is intractable to compute all potential failures, let alone the subsequent restoration and network loss. An approach to probabilistic modeling to solve this problem is presented in more detail in Chapter 4 and in [28]. Armed with this background, we conclude this section by revisiting the issue of why we show the IP backbone routing over an unrestorable DWDM layer in the network layering of Fig. 2.3. This at first may seem counterintuitive because it is generally true that, per unit of capacity, the cost of links at lower layers is less than that of higher layers. Some of the reasons for this planning decision, which is consistent with most large ISPs, were hinted at in Section 2.5.3. We summarize them here. 1. Backbone router disruptions (failures or maintenance events) originate within the IP layer and cannot be restored at lower layers. Extra link capacity must be provided at the IP layer for such disruptions. Once placed, this extra capacity can then also be used for IP layer link failures that originate at lower layers. This obviates most of the cost advantages of lower-layer restoration. 2. Under nonfailure conditions, there is spare capacity available in the IP layer to handle uncertain demand. For example, restoration requirements aside, to handle normal service demand, IP layer links could be engineered to run below 80% utilization during peak intervals of the traffic matrix and well below that at off-peak intervals. If we allow higher utilization levels during network disruption events, then this provides an existing extra buffer during those events. Furthermore, there may be little appreciable loss during network disruptions during off-peak periods. As QoS and CoS features are deployed in the IP backbone, there is yet another advantage to IP layer restoration. Namely, the IP layer can assign different QoS objectives to different service classes. For example, one such distinction might be to plan network restoration so that premium services receive better performance than best-effort services during network disruptions. In contrast, the DWDM layer cannot make such fine-grain distinctions; it either restores or does not restore the entire IP layer link, which carries a mixture of different classes of services. 2 Structural Overview of ISP Networks 79 2.6 IPTV Backbone Example Some major carriers now offer nationwide digital television, high-speed Internet, and Voice-over-IP services over an IP network. These services typically include hundreds of digital television channels. Video content providers deliver their content to the service provider in digital format at select locations called super hub offices (SHOs). This in turn requires that the carrier have the ability to deliver high-bandwidth IP streaming to its residential customers on a nationwide basis. If such content is delivered all the way to residential set-top boxes over IP, it is commonly called IPTV. There are two options to providing such an IPTV backbone. The first option is to create a virtual network on top of the IP backbone. Since video service consists mostly of streaming channels that are broadcast to all customers, IP multicast is usually the most cost-effective protocol to transport the content. However, users have high expectations for video service and even small packet losses negatively impact video quality. This requires the IP backbone to be able to transport multicast traffic at a very high level of network availability and efficiency. The first option results in a mixture of best-effort traffic and traffic with very high quality of service on the same IP backbone, which in turn requires comprehensive mechanisms for restoration and priority queuing. Consequently, some carriers have followed the second option, wherein they create a separate overlay network on top of the lower-layer DWDM or TDM layers. In reality, this is another (smaller) IP layer network, with specialized traffic, network structure, and restoration mechanisms. We describe such an example in this section. Because of the high QoS objectives needed for broadcast TV services, the reader will find that this section builds on most of the previous material in this chapter. 2.6.1 Multicast-Based IPTV Distribution Meeting the stringent QoS required to deliver a high-quality video service (such as low latency and loss) requires careful consideration of the underlying IP-transport network, network restoration, and video and packet recovery methods. Figure 2.28 (borrowed from [9]) illustrates a simplified architecture for a network providing IPTV service. The SHO gathers content from the national video content providers, such as TV networks (mostly via satellite today) and distributes it to a large set of receiving locations, called video hub offices (VHOs). Each VHO in turn feeds a metropolitan area. IP routers are used to transport the IPTV content in the SHO and VHOs. The combination of SHO and VHO routers plus the links that connect them comprise the IPTV backbone. The VHO combines the national feeds with local content and other services and then distributes the content to each metro area. The long-distance backbone network between the SHO and the VHO includes a pair of redundant routers that are associated with each VHO. This allows for protection against router component failures, router hardware maintenance, or software 80 R.D. Doverspike et al. Dashed Links used for restoration SHO VHO VHO Edges of Multicast Tree VHO VHO VHO VHO VHO VHO VHO VHO S / VHO = Super / Video Hub Office Router Metro Intermediate Office RG Metro Set-top Box Access Video Serving Office DSLAM = Digital Subscriber Loop Access Multiplexer RG RG = Residential Gateway Fig. 2.28 Example nationwide IPTV network upgrades. IP multicast is used for delivery as it provides economic advantages for the IPTV service to distribute video. With multicast, packets traverse each link at most once. The video content is encoded using an encoding standard such as H.264. Video frames are packetized and are encapsulated in the Real-Time Transport Protocol (RTP) and UDP. In this example, PIM-SSM is used to support IP multicast over the video content. Each channel from the national live feed at the SHO is assigned a unique multicast group. There are typically hundreds of channels assigned to standard-definition (SD) (1.5 to 3 Mb/s) and high-definition (HD) (6 to 10 Mb/s) video signals plus other multimedia signals, such as “picture-in-picture” channels and music. So, the live feed can be multiple gigabits per second in aggregate bandwidth. 2.6.2 Restoration Mechanisms The IPTV network can use various restoration methods to deliver the needed video QoS to end-users. For example, it can recover from relatively infrequent and short bursts of loss using a combination of video and packet recovery mechanisms and protocols, including the Society of Motion Picture and Television Engineers (SMPTE; www.smpte.org/standards) 2022–1 Forward Error Correction (FEC) 2 Structural Overview of ISP Networks 81 standard, retransmission approaches based on RTP/RTCP [33] and Reliable UDP (R-UDP) [31], and video player loss-concealment algorithms in conjunction with set-top box buffering. R-UDP supports retransmission-based packet-loss recovery. In addition to protecting against video impairments due to last-mile (loop) transmission problems in the access segment, a combination of these methods can recover from a network failure (e.g., fiber link or router line card) of 50 ms or less. Repairing network failures usually takes far more than 50 ms (potentially several hours), but when combined with link-based FRR, this restoration methodology could meet the stringent requirements needed for video against single-link failures. Figure 2.29 (borrowed from [9]) illustrates how we might implement link-based FRR in an IPTV backbone by depicting a network segment with four node pairs that have defined virtual links (or pseudowires). This method is the pseudowire, next-hop FRR approach described in Section 2.5.2.2. For example, node pair E-C has a lower-layer link (such as SONET OC-n or Gigabit Ethernet) in each direction and a pseudowire in each direction (a total of four unidirectional logical links) used for FRR restoration. The medium dashed line shows the FRR backup path for the pseudowire E!C. Note that links such as E-A are for restoration and, hence, have no pseudowires defined. Pseudowire E!C routes over a primary path that consists of the single lower-layer link E!C (see the solid line in Fig. 2.29). If a failure occurs to a lower-layer link in the primary path such as C-E, then the router at node E attempts to switch to the backup path using FRR. The path from the root to node A will switch to the backup path at node E (E-A-B-C). Once it reaches node C, it will A E F Backup path for Pseudowire E→C B IGP view of Multicast tree C D Root Path of flow from Root to node A A E F X B D C Layer 1 Link (High weight) Layer 1 Link (High weight – used for restoration only) Pseudowire (Low weight – sits on top of Layer 1 solid black link) Fig. 2.29 Fast Reroute in IPTV backbone 82 R.D. Doverspike et al. continue on its previous (primary) path to node A (C-B-F-A). The entire path from E to A during the failure is shown by the outside dotted line. Although the path retraces itself between the routers B and C, the multicast traffic does not overlap because of the links’ unidirectionality. Also, although the IGP view of the topology realizes that the lower-layer links between E and C have gone “down,” because the pseudowire from E!C is still “up” and has the least weight, the shortest path tree remains unchanged. Consequently, the multicast tree remains unchanged. The IGP is unaware of the actual routing over the backup path. Note that these backup paths are precomputed, by analyzing all possible link failures in a comprehensive manner, a priori. If we route the pseudowire FRR backup path on a lower-layer path that is diverse from its primary path, FRR operates rapidly (suppose around 50 ms), and we set the hold-down timers appropriately, IGP will not detect the effect of any single fiber or DWDM layer link failure. Therefore, the multicast tree will remain unaffected, reducing the outage time of any single-link failure from tens of seconds to approximately 50 ms. This order of restoration time is needed to achieve the stringent IPTV network availability objectives. 2.6.3 Avoiding Congestion from Traffic Overlap A drawback of restoration using next-hop FRR is that since it reroutes traffic on a link-by-link basis, it can suffer traffic overlap during link failures, thus requiring more link capacity to meet the target availability. Links are deployed bidirectionally, and traffic overlap means that the packets of the same multicast flows travel over the same link (in the same direction) two or more times. If we avoid overlap, we can run the links at higher utilization and thus design more cost-effective networks. This requires that the multicast tree and backup paths be constructed so that traffic does not overlap. To illustrate traffic overlap, Fig. 2.30a shows a simple network topology with node S as the source and nodes d1 to d8 as the destinations. Here, each router is connected by a pair of directed links (in opposite directions). The two links of the pair are assigned the same IGP weight and the multicast trees are derived from these weights. The Fig. 2.30a illustrates two sets of link weights. Figure 2.30b shows the multicast tree derived from the first set of weights. In this case, there exists a singlelink failure that causes traffic overlap. For example, the dotted line shows the backup route for link d1–d4. If link d1–d4 fails, then the rerouted traffic will overlap with other traffic on links S -d 2 and d 2–d 6, thereby resulting in congestion on those links. Client routers downstream of d 2 and d 6 will see impairments as a result of this congestion. It is desirable to avoid this congestion wherever possible by constructing a multicast tree such that the backup path for any single-link failure does not overlap with any downstream link on the multicast tree. This is achieved by choosing OSPF link weights suitably. The tree derived from the second pair of weights is shown in Fig. 2.30c. In this case, the backup paths do not cause traffic overlap in response to any single-link 2 Structural Overview of ISP Networks 83 a S 1,10 1,10 1,100 d1 1,10 d4 d5 1,10 b d2 1,10 1,10 d6 d7 d8 1,10 1,100 1,10 Topology S d3 d1 X d4 d3 c S d1 d2 d2 d3 X d5 d6 d7 d8 d4 Multicast Tree with 1st weights d5 d6 d7 d8 Multicast Tree with 2nd weights Fig. 2.30 Example of traffic overlap from single-link failure failure. The multicast tree link is now from d 6 to d 2. The backup path for link d1–d4 is the same as in Fig. 2.30b. Observe that traffic on this backup path does not travel in the same direction as any link of the multicast tree. An algorithm to define FRR backup paths and IGP weights so that the multicast tree does not overlap from any single failure can be found in [10]. 2.6.4 Combating Multiple Concurrent Failures The algorithm and protocol in [10] helps in avoiding traffic overlap of the multicast tree during single-link failures. However, multiple link failures can still cause overlap. An example is shown in Fig. 2.31. Assume that links d1–d4 and d3–d8 are both down. If the backup path for edge d1–d4 is d1-S-d2-d6-d5-d4 (as shown in Fig. 2.30b and in Fig. 2.31) and the backup path for edge d3–d8 is d3-S-d2-d6-d7-d8, traffic will overlap paths on edges S-d2 and d2–d6. There would be significant traffic loss due to congestion if the links of the network are sized to only handle a single stream of multicast traffic. This situation essentially occurs because MPLS FRR occurs at Layer 2 and therefore the IGP is unaware of the FRR backup paths. Furthermore, the FRR backup paths are precalculated and there is no real-time (dynamic) accommodation for 84 R.D. Doverspike et al. Fig. 2.31 Example of traffic overlap from multiple link failures S d1 d2 d3 X X d4 d5 d6 d7 d8 different combinations of multiple-link failures. In reality, multiple (double and even triple) failures can happen. When they occur, they can have a large impact on the performance of the network. Yuksel [36] describes an approach that builds on the FRR mechanism but limits its use to a short period. When a single link fails and a pseudowire’s primary path fails, the traffic is rapidly switched over to the backup path as described above. However, soon afterwards, the router sets the virtual link weight to a high value and thus triggers the IGP reconvergence process – this is colloquially called “costing out” the link. Once IGP routing converges, a new PIM tree is rebuilt automatically. This avoids long periods where routing occurs over the FRR backup paths, which are unknown to the IGP. This ensures rapid restoration from single-link failures while allowing the multicast tree to dynamically adapt to any additional failures that might occur during a link outage. It is only during this short, transient period when FRR starts and IGP reconvergence finishes that another failure could expose the network to a path overlapping on the same link. The potential downside of this approach is that it incurs two more network reconvergence processes – that is, the period right after FRR has occurred and then again when the failure is repaired. If it is not carefully executed, this alternative approach can cause many new video interruptions due to small “hits” after single failures. Yuksel [36] proposes a careful multicast recovery methodology to accomplish this approach, yet avoid such drawbacks. A key component of the method is the make-before-break change of the multicast tree – that is, the requirement to hitlessly switch traffic from the old multicast tree to the new multicast tree. When the failure is repaired, the method normalizes the multicast tree to its original shortest path tree again in a hitless manner. The key modification to the multicast tree-building process (pruning and joining nodes) is that the prune message to remove the branch to the previous parent is not sent until the router receives PIM–SSM data packets from its new parent for the corresponding (S,G) group. Another motivation for this modification is because current PIM–SSM multicast does not have an explicit acknowledgement to a join request. It is only through the receipt of a data packet on that interface that the node knows that the join request was successfully received and processed at the upstream node. The soft-state approach of IP Multicast (refresh the state by periodically sending join requests) is also used to ensure consistency. This principle is used to guide the tree reconfiguration process at a node in reaction to a 2 Structural Overview of ISP Networks 85 failure. In this way, routers do not lose data packets during the switchover period. Of course, this primarily works in the PIM-SSM case, where there is a single source. As we can observe from the description above, building an IPTV backbone with high network availability builds on most of the protocols, multilayer failure models, and restoration machinery we have described in the previous sections of the chapter. In particular, given the underlying probabilities of network failures plus these complex failure and restoration mechanisms, such an approach must include the network design methodology to evaluate and estimate the theoretical network availability of the IPTV backbone. If such a methodology was not utilized, a carrier would run the risk of having its video customers dissatisfied with their video service because of inadequate network availability. 2.7 Summary This chapter presents an overview of the layered network design that is typical in a large ISP backbone. We emphasized three aspects that influence the design of an IP backbone. The first aspect is that the IP network design is strongly influenced by its relationship with the underlying network layers (such as DWDM and TDM layers) and the network segments (core, metro, and access). ISP networks use a hierarchy of specialized routers, generally called access and backbone routers. At the edge of the network, the location of access routers, and the types of interfaces that they need to support are strongly influenced by the way the customers connect to the backbone through the metro network. In the core of a large carrier network, backbone routers are interconnected using DWDM transmission technology. As IP traffic is the dominant source of demand for the DWDM layer, the backbone demands drive requirements for the DWDM layer. The need for multiple DWDM links has driven the evolution of aggregate links in the core. The second aspect is that ISP networks have evolved from traditional IP forwarding to support MPLS. The separation of routing and forwarding and the ability to support a routing hierarchy allow ISPs to support new functionality including Layer 2 and Layer 3 VPNs and flexible traffic engineering that could not be as easily supported in a traditional IP network. Finally, this chapter provided an overview of the issues that affect IP network reliability, including the impact of network disruptions at multiple network layers and, conversely, how different network layers respond to disruptions through network restoration. We described how failures and maintenance events originate at various network layers and how they impact the IP backbone. We presented an overview of the performance of OSPF failure recovery to motivate the need for MPLS Fast Reroute. We summarized the interplay between network restoration and the network design process. To tie these concepts together, we presented a “case study” of an IPTV backbone. An IPTV network can be thought of as an IP layer with a requirement for very high performance, essentially high network availability and low packet loss. This 86 R.D. Doverspike et al. requires the interlacing of multiple protocols, such as R-UDP, MPLS Fast Reroute, IP Multicast, and Forward Error Control. We described how lower-layer failures (including multiple failures) affect the IP layer and how these IP layer routing and control protocols respond. Understanding the performance of network restoration protocols and the overall availability of the given network design requires careful modeling of the types and likelihood of network failures, as well as the behavior of the restoration protocols. This chapter endeavored to lay a good foundation for reading the remaining chapters of this book. We conclude by alerting the reader to an important observation about IP network design. Telecommunications and its technologies undergo constant change. Therefore, this chapter describes a point in time. The contents of this chapter are different from what they would have been 5 years ago. There will be further changes over the next 5 years and, consequently, the chapter written 5 years from now may look quite different. References 1. AT&T (2003). Managed Internet Service Access Redundancy Options, from http://www. pnetcom.com/AB-0027.pdf. Accessed 15 April 2009. 2. Awduche, D., Berger, L., Gan, D., Li. T., Srinivasan, V., & Swallow, G. (2001). RSVP-TE: Extensions to RSVP for LSP Tunnels. IETF RFC 3209, Dec. http://tools.ietf.org/html/rfc3209. Accessed 29 January 2010. 3. Braden, R., Zhang, L., Berson, S., Herzog, S., & Jamin, S. (1997). Resource ReSerVation Protocol (RSVP) – Version 1 Functional Specification. IETF RFC 2205, Sept. http://tools.ietf.org/html/rfc2205. Accessed 29 January 2010. 4. Chiu, A., Choudhury, G., Doverspike, R., & Li, G. (2007). Restoration design in IP over reconfigurable all-optical networks. NPC 2007, Dalian, P.R. China, September 2007. 5. Choudhury, G. (Ed.) (2005). Prioritized Treatment of Specific OSPF Version 2 Packets and Congestion Avoidance. IETF RFC 4222, Oct. 6. Ciena Core Director. http://www.ciena.com/products/products coredirector product overview. htm. Accessed 13 April 2009. 7. Cisco (1999). Tag Switching in Internetworking Technology Handbook, Chapter 23, http:// www.cisco.com/en/US/docs/internetworking/technology/handbook/Tag-Switching.pdf, accessed 12/26/09. 8. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms, second edition (pp. 595–601). Cambridge: MIT Press, New York: McGraw-Hill. ISBN 0–262– 03293–7. Section 24.3: Dijkstra’s algorithm. 9. Doverspike R., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., Sinha, R. K., Wang, D., et al. (2009). Designing a reliable IPTV network. IEEE Internet Computing Magazine May/June, pp. 15–22. 10. Doverspike, R., Li, G., Oikonomou, K., Ramakrishnan, K. K., & Wang, D. (2007). IP backbone design for multimedia distribution: architecture and performance. INFOCOM-2007, Anchorage Alaska April 2007. 11. Doverspike, R., & Magill, P. (2008). Commercial optical networks, overlay networks and services. In I. Kaminow, T. Li, & A. Willner, (Eds), Chapter 13 in Optical fiber telecommunications VB. San Diego, CA: Academic. 12. Feuer, M., Kilper, D., & Woodward, S. (2008). ROADMs and their system applications. In I. Kaminow, T. Li, & A. Willner, (Eds), Chapter 8 in Optical fiber telecommunications VB. San Diego, CA: Academic. 2 Structural Overview of ISP Networks 87 13. Goyal, M., Ramakrishnan K. K., & Feng W. (2003) “Achieving Faster Failure Detection in OSPF Networks,” IEEE International Conference on Communications (ICC 2003), Alaska, May 2003. 14. IEEE 802.1Q-2005 (2005) Virtual Bridged Local Area Networks; ISBN 0–7381–3662-X. 15. IEEE: 802.1Qay – Provider Backbone Bridge Traffic Engineering. http://www.ieee802. org/1/pages/802.1ay.html. Accessed October 7, 2008. 16. IETF PWE3: Pseudo Wire Emulation Edge to Edge (PWE3) Working Group. http://www. ietf.org/html.charters/pwe3-charter.html. Accessed 7 Nov 2008. 17. IETF RFC 4090 (2005) Fast Reroute Extensions to RSVP-TE for LSP Tunnels. http:// www.ietf.org/rfc/rfc4090.txt. May 2005. Accessed 7 Nov 2008. 18. ITU-T G.709, “Interfaces for the Optical Transport Network,” March 2003. 19. ITU-T G.7713.2. Distributed Call and Connection Management: Signalling mechanism using GMPLS RSVP-TE. 20. Kalmanek, C. (2002). A Retrospective View of ATM. ACM Sigcomm CCR, Vol. 32, Issue 5, Nov, ISSN: 0146–4833. 21. Katz, D., Kompella, K., & Yeung, D. (2003). IETF RFC 3630: Traffic Engineering (TE) Extensions to OSPF Version 2. http://tools.ietf.org/html/rfc3630. Accessed 4 May 2009. 22. Klincewicz, J. G. (2005). Issues in link topology design for IP networks. SPIE Conference on performance, quality of service and control of next-generation communication networks III, SPIE Vol. 6011, Boston, MA. 23. Klincewicz, J. G. (2006). Why is IP network design so difficult? Eighth INFORMS telecommunications conference, Dallas, TX, March 30–April 1, 2006. 24. Kompella, K., & Rekhter, Y. (2007). IETF RFC 4761: Virtual private LAN service (VPLS) using BGP for auto-discovery and signaling. http://tools.ietf.org/html/rfc4761, accessed 12/26/09. 25. Lasserre, M., & Kompella, V. (2007). IETF RFC 4762: Virtual private LAN service (VPLS) using label distribution protocol (LDP) signaling. http://tools.ietf.org/html/rfc4762, accessed 12/26/09. 26. Moy, J. (1998). IETF RFC 2328: OSPF Version 2. http://tools.ietf.org/html/rfc2328, accessed 12/26/09. 27. Nortel. (2007). Adding scale, QoS and operational simplicity to Ethernet. http://www.nortel. com/solutions/collateral/nn115500.pdf, accessed 12/26/09. 28. Oikonomou, K., Sinha, R., & Doverspike, R. (2009). Multi-Layer Network Performance and Reliability Analysis. The International Journal of Interdisciplinary Telecommunications and Networking (IJITN), Vol. 1 (3), pp. 1–29, Sept. 29. Optical Internetworking Forum (OIF) (2008). OIF-UNI-02.0-Common–User Network Interface (UNI) 2.0 Signaling Specification: Common Part. http://www.oiforum.com/public/ documents/OIF-UNI-02.0-Common.pdf. 30. Oran, D. (1990). IETF RFC 1142: OSI IS-IS intra-domain routing protocol. http://tools. ietf.org/html/rfc1142. 31. Partridge, C., & Hinden, R. (1990). Version 2 of the Reliable Data Protocol (RDP), IETF RFC 1151. April. 32. Perlman, R. (1999). Interconnections: Bridges, Routers, Switches, and Internetworking Protocols, 2e. Addison-Wesley Professional Computing Series. 33. Schulzrinne, H., Casner, S., Frederick, R., & Jacobson, V. (2003). RTP: A Transport Protocol for Real-Time Application, IETF RFC 3550. http://www.ietf.org/rfc/rfc3550.txt, accessed 12/26/09. 34. Sycamore Intelligent Optical Switch. (2009). http://www.sycamorenet.com/products/sn16000. asp. Accessed 13 April 2009. 35. Telcordia GR-253-CORE (2000) Synchronous Optical Network (SONET) Transport Systems: Common Generic Criteria. 36. Yuksel, M., Ramakrishnan, K. K., & Doverspike, R. (2008). Cross-layer failure restoration for a robust IPTV service. LANMAN-2008, Cluj-Napoca, Romania September. 37. Zimmermann, H. (1980). OSI reference model – the ISO model of architecture for open systems interconnection. IEEE Transactions on Communications, 28(Suppl. 4), 425–432. 88 R.D. Doverspike et al. Glossary of Acronyms and Key Terms 1:1 1C1 Access Network Segment ADM Administrative Domain Aggregate Link AR AS ASBR ATM AWG B-DCS Backhaul BER BGP BLSR BR Bundled Link CE switch Channelized CHOC Card CIR CO Composite Link Core Network Segment CoS CPE One-by-one (signal switched to restoration path on detection of failure) One-plus-one (signal duplicated across both service path and restoration path; receiver chooses surviving signal upon detection of failure) The feeder network and loop segments associated with a given metro segment Add/Drop Multiplexer Routing area in IGP Bundles multiple physical links between a pair of routers into a single virtual link from the point of view of the routers. Also called bundled or composite link Access Router Autonomous System Autonomous System Border Router Asynchronous Transfer Mode Arrayed Waveguide Grating Broadband Digital Cross-connect System (cross-connects at DS-3 or higher rate) Using TDM connections that encapsulate packets to connect customers to packet networks Bit Error Rate Border Gateway Protocol Bidirectional Line-Switched Ring Backbone Router See Aggregate Link Customer-Edge switch A TDM link/connection that multiplexes lower-rate signals into its time slots CHannelized OC-n card Committed Information Rate Central Office See Aggregate Link Equipment in the POPs and network structures that connect them for intermetro transport and switching Class of Service Customer Premises Equipment 2 Structural Overview of ISP Networks CSPF DCS DDoS DoS DS-0 DS-1 DS-3 DWDM E-1 eBGP EGP EIGRP EIR EPL FCC FE FEC FEC Feeder Network FRR FXC Gb/s GigE GMPLS HD HDTV Hitless iBGP IETF IGP Internet Route Free Core IGMP Inter-office Links Constraint-based Shortest Path First Digital Cross-connect System Distributed Denial of Service (security attack on router) Denial of Service (security attack on router) Digital Signal – level 0 a pre-SONET signal carrying one voice-frequency channel at 64 kb/s) Digital Signal – level 1 (a 1.544 Mb/s signal). A channelized DS-1 carries 24 DS0s Digital Signal – level 3 (a 44.736 Mb/s signal). A channelized DS-3 carries 28 DS1s Dense Wavelength-Division Multiplexing European plesiosynchronous (pre-SDH) rate of 2.0 Mb/s External Border Gateway Protocol Exterior Gateway Protocol Enhanced Interior Gateway Routing Protocol Excess Information Rate Ethernet Private Line Federal Communications Commission Fast Ethernet (100 Mb/s) Forward Error Correction – bit-error recovery technique in TDM transmission and some IPs Forwarding Equivalence Class – classification of flows defined in MPLS The portion of the access network between the loop and first metro central office Fast Re-Route Fiber Cross-Connect Gigabits per second (1 billion bits per second) Gigabit Ethernet (nominally 1 Gb/s) Generalized MPLS High definition (short for HDTV) High-definition TV (television with resolution exceeding 7201280) Method of changing network connections or routes that incur negligible loss Interior Border Gateway Protocol Internet Engineering Task Force Interior Gateway Protocol Where MPLS removes external BGP information plus Layer 3 address lookup from the interior of the IP backbone Internet Group Management Protocol Links whose endpoints are contained in different central offices 89 90 Intra-office Links IOS IP IPTV IROU IS-IS ISO ISP ITU Kb/s LAN LATA Layer n LDP LMP Local Loop LSA LSDB LSP LSR MAC MAN Mb/s MEMS Metro Network Segment MPEG MPLS MSO MSP MTBF R.D. Doverspike et al. Links that are totally contained within the same central office Intelligent Optical Switch Internet Protocol Internet Protocol television (i.e., entertainment-quality video delivered over IP) Indefeasible Right of Use Intermediate-System-to-Intermediate-System (IP routing and control plane protocol) International Organization for Standardization (not an acronym) Internet Service Provider International Telecommunication Union Kilobits per second (1,000 bits per second) Local Area Network Local Access and Transport Area A colloquial packet protocol layering model, with origins to the OSI reference model. Today, roughly Layer 3 corresponds to IP packets, Layer 2 to MPLS LSPs, pseudowires, or Ethernet-based VLANs, and Layer 1 to all lower-layer transport protocols Label Distribution Protocol Link Management Protocol The portion of the access segment between the customer and feeder network. Also called “last mile” Link-State Advertisement Link-State Database Label Switched Path Label Switch Router Media Access Control Metropolitan Area Network Megabits per second (1 Million bits per second) Micro-Electro-Mechanical Systems The network layers of the equipment located in the central offices of a given metropolitan area Moving Picture Experts Group Multiprotocol Label Switching Multiple System Operator (typically coaxial cable companies) Multi-Service Platform – A type of ADM enhanced with many forms of interfaces Mean Time Between Failure 2 Structural Overview of ISP Networks MTSO MTTR Multicast N-DCS n-degree ROADM Next-hop Next-next-hop Normalization NTE OC-n ODU O-E-O OIF OL OSPF OSPF-TE OSS OT OTN P Router PBB-TE PBT PE Router PIM PL P-NNI POP PPP PPPoE Pseudowire PVC PWE3 QoS RAR RD Reconvergence Mobile Telephone Switching Office Mean Time to Repair Point-to-multipoint flows in packet networks Narrowband Digital Cross-connect System (cross-connects at DS0 rate) A ROADM that can fiber to more than three different ROADMS (also called multidegree ROADM) Method in MPLS FRR that routes around a down link Method in MPLS FRR that routes around a down node Step in network restoration after all failures are repaired to bring the network back to its normal state Network Terminating Equipment Optical Carrier – level n (designation of optical transport of a SONET STS-n) Optical channel Data Unit – protocol data unit in ITU OTN Optical-to-Electrical-to-Optical Optical Internetworking Forum Optical Layer Open Shortest Path First Open Shortest Path First – Traffic Engineering Operations Support System Optical Transponder Optical Transport Network – ITU optical protocol Provider Router Provider Backbone Bridge – Traffic Engineering Provider Backbone Transport Provider-Edge Router Protocol-Independent Multicast Private Line Private Network-to-Network Interface (ATM routing protocol) Point Of Presence Point-to-Point Protocol Point-to-Point Protocol over Ethernet A virtual connection defined in the IETF PWE3 that encapsulates higher-layer protocols Permanent Virtual Circuit Pseudo-Wire Emulation Edge-to-Edge Quality of Service Remote Access Router Route Distinguisher IGP process to update network topology and adjust routing tables 91 92 RIB ROADM RR RSTP RSVP RT RD RTP SD SDH Serving CO SHO SLA SRLG SONET SONET/SDH self-healing rings SPF STS-n SVC TCP TDM UDP UNI Unicast UPSR VHO VLAN VoD VoIP VPLS VPN R.D. Doverspike et al. Router Information Base Reconfigurable Optical Add/Drop Multiplexer Route Reflector Rapid Spanning Tree Protocol Resource Reservation Protocol Route Target (also Remote Terminal in metro TDM networks) Route Distinguisher Real-Time Protocol Standard Definition (television with resolution of about 640 480) Synchronous Digital Hierarchy (a synchronous optical networking standard used outside North America, documented by the ITU in G.707 and G.708) The first metro central office to which a given customer homes Super Hub Office Service Level Agreement Shared Risk Link Group Synchronous Optical Network (a synchronous optical networking standard used in North America, documented in GR-253-CORE from Telcordia) Typically UPSR or BLSR rings Shortest Path First Synchronous Transport Signal – level n (a signal level of the SONET hierarchy with a data rate of n 51.84 Mb/s) Switched Virtual Circuit Transmission Control Protocol Time Division Multiplexing User Data Protocol User-Network Interface Point-to-point flows in packet networks Unidirectional Path-Switched Ring Video Hub Office Virtual Local Area Network Video on Demand Voice-over-Internet Protocol Virtual Private LAN Service (i.e., Transparent LAN Service) Virtual Private Network 2 Structural Overview of ISP Networks WAN Wavelength continuity W-DCS DWDM Wide Area Network A restriction in DWDM equipment that a through connection must be optically cross-connected to the same wavelength on both fibers Wideband Digital Cross-connect System (cross-connects at DS-1, SONET VT-n or higher rate) Wavelength-Division Multiplexing 93 Part II Reliability Modeling and Network Planning Chapter 3 Reliability Metrics for Routers in IP Networks Yaakov Kogan 3.1 Introduction As the Internet has become an increasingly critical communication infrastructure for business, education, and society in general, the need to understand and systematically analyze its reliability has become more important. Internet Service Providers (ISPs) face the challenge of needing to continuously upgrade the network and grow network capacity, while providing a service that meets stringent customer-reliability expectations. While telecommunication companies have long experience providing reliable telephone service, the challenge for an ISP is more difficult because changes in Internet technology, particularly router software, are significantly more frequent and less rigorously tested than was the case in circuit-switched telephone networks. ISPs cannot wait until router technology matures – a large ISP has to meet high reliability requirements for critical applications like financial transactions, Voice over IP, and IPTV using commercially available technology. The need to use less mature technology has resulted in a variety of redundancy solutions at the edge of the network, and in well-thought-out designs for a resilient core network that is shared by traffic from all applications. The reliability objective for circuit-switched telephone service of “no more than 2 hours downtime in 40 years” has been applied to voice communication since 1964 [1]. It has been achieved using expensive redundancy solutions for both switches and transmission facilities. Though routers are less reliable than circuit switches, commercial IP networks have three main advantages when designing for reliability, in comparison with legacy telephone networks. First, packet switching is a far more economically efficient mechanism for multiplexing network resources than circuit switching, given the bursty nature of data traffic. Second, protocols like Multi-Protocol Label Switching (MPLS) support a range of network restoration options that are more economically efficient in restoration from failures of transmission facilities than traditional 1:1 redundancy. Third, commercial Y. Kogan () AT&T Labs, 200 S. Laurel Ave, Middletown, NJ 07748, USA e-mail: yaakovkogan@att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 3, c Springer-Verlag London Limited 2010 97 98 Y. Kogan IP networks can provide different levels of redundancy to different commercial customers, for example, by offering access diversity or multihoming options, pricing the service depending on its reliability. This allows Internet service providers to satisfy customers who are price-sensitive [2] while recovering the high cost of redundancy from customers who require increased reliability to support mission critical applications. The reliability of modern provider edge routers, which have a large variety of interface cards, cannot be accurately characterized by a single downtime or reliability metric because it requires averaging the contributions of the various line cards that may hide the poor reliability of some components. We address this challenge by introducing granular metrics for quantifying the reliability of IP routers. Section 3.2 provides an overview of the main router elements and redundancy mechanisms. In Section 3.3, we use a simplified router reliability model to demonstrate the application of different reliability metrics. In Section 3.4, we define metrics for measuring the reliability of IP routers in production networks. Section 3.5 provides an overview of challenges with measuring end-to-end availability. 3.2 Redundancy Solutions in IP Routers This section provides an overview of the primary elements of a modern router and associated redundancy mechanisms, which are important for availability modeling of services in IP networks. A high-speed IP router is a special multiprocessor system with two types of processors, each with its own memory and CPU: Route Processors (RPs) and Line–Cards (LCs). Each line–card receives packets from other routers via one or more logical interfaces, and performs forwarding operations by sending them to outbound logical interfaces using information in its local Forwarding Information Base (FIB). The route processor controls the operation of the entire router, runs the routing protocols, maintains the necessary databases for route processing, and updates the FIB on each line–card. This separation implies that each LC can continue forwarding packets based on its copy of the FIB when the RP fails. Figure 3.1 provides a simplified illustration of router hardware architecture, where two route processors (active and backup) and multiple line-cards are interconnected through a switch fabric. The Monitor bus is used exclusively for transmission of error and management messages that help one to isolate the fault when a component is faulty and to restore the normal operation of the router, if the failed component is backed up by a redundant unit. Data traffic never goes through Monitor bus but across the switch fabric. These hardware (HW) components operate under the control of an Operating System (OS). Additional details for Cisco and Juniper routers can be found in [3, 4] and [5], respectively. A typical Mean Time Between Failures (MTBF) for both RPs and LCs is about 100,000 h (see, e.g., Table 9.3 in [6]). This MTBF accounts only for hard failures requiring replacement of the failed component, in contrast with soft failures, from which the router can recover, for example, by card reset. A typical example of a soft 3 Reliability Metrics for Routers in IP Networks 99 Line Card 1 Switch fabric Active RP Backup RP Line Card n Monitor Bus Cooling system Power supplies Fig. 3.1 Generic router hardware architecture hardware failure is parity error. Router vendors do not usually provide an MTBF for the OS, as it varies over a wide range. According to our experience, a new OS version may have an MTBF well below 100,000 h as a result of undetected software errors that are first encountered after the OS is deployed to the field. According to our experience, the MTBF for a stable OS is typically above 100,000 h, though even with a stable OS, changes in the operating environment can trigger latent software errors. Without redundancy solutions at the edge of the network, component failures interrupt customer traffic until the failed component is recovered by reset, which may take about a minute, or until it is replaced, which can take hours. To reduce failure impacts, shared HW components whose failure would impact the entire router (e.g., RP, switch fabric, power supply, and cooling system) are typically redundant. In this case, the restoration time (assuming a successful failover to the redundant component) is defined by the failover time. For example, in Cisco 12000 series routers [3] and Juniper T640 router [7], the switch fabric consists of five cards, four of which are active and one provides redundancy with a subsecond restoration time when an active card fails. Failure of one power supply or cooling element does not have any impact on service. RP redundancy is provided by a configuration with two RP cards: primary and backup. A first attempt at reducing the failover time has been made by running the backup RP in standby mode with partial synchronization between the active and standby RPs that enables the standby RP to maintain all Layer 1 and Layer 2 sessions and recover the routing database from adjacent nodes when the primary RP fails. However, when a primary RP fails, BGP adjacencies with adjacent routers go down. The loss of BGP adjacency has the same effect on network routing as failure of the entire router until the standby RP comes on-line and re-establishes BGP adjacencies with its neighbors. During this time, the routing protocols will reconverge to another route and then back again that will cause transient packet 100 Y. Kogan loss – a phenomenon known as “route flapping.” (Route flapping occurs when a router alternately advertises a network destination via one route, then another (or as unavailable, and then available again) in quick sequence [8].) To prevent the adjacent routers from declaring the failed router out of service and removing it from their routing tables and forwarding databases, vendors have developed high availability (HA) routing protocol extensions, which allow a router to restart its routing software gracefully in such a way that packet forwarding is not disrupted when the primary RP fails. If the routers adjacent to a given router support these extensions, they will continue to advertise routes from the restarting router during the grace period. Cisco’s and Juniper’s HA routing protocol extensions are known under the name of Non-Stop Forwarding (NSF) [9] and Graceful Restart (GR) [10], respectively. A detailed description of the Cisco NSF support for BGP, OSPF, IS-IS, and EIGRP routing protocols as well as for MPLS-related protocols can be found in [9]. Here, we describe the BGP protocol extension procedures that follow the implementation specification provided in the IETF proposed standard “Graceful Restart Mechanism for BGP” [11]. Let R1 be the restarting router and R2 be a peer. The goal is to restart a BGP session between R1 and peering routers without redirecting traffic around R1. 1. R1 and R2 signal each other that they understand Graceful Restart in their initial exchange of BGP OPEN messages when the initial BGP connection is established between R1 and R2. 2. An RP failover occurs, and the router R1 BGP process starts on the newly active RP. R1 does not have a routing information base and must reacquire it from its peer routers. R1 will continue to forward IP packets destined for (or through) peer routers (R2) using the last updated FIB. 3. When R2 detects that the TCP session with R1 is cleared, it marks routes, learned from R1, as STALE, but continues to use them to forward packets. R2 also initializes a Restart-timer for R1. Router R2 will remove all STALE routes unless it receives an OPEN message from R1 within the specified Restart-time. 4. R1 establishes a new TCP session with R2 and sends an OPEN message to R2, indicating that its BGP software has restarted. When R2 receives this OPEN message, it resets its own Restart-timer and starts a Stalepath-timer. 5. Both routers re-established their session. R2 begins to send UPDATE messages to R1. R1 starts an Update-delay timer and waits until up to 120 s to receive End-of-RIB (EOR) from all its peers. 6. When R1 receives EOR from all its peers, it will begin the BGP Route Selection Process. 7. When this process is complete, it will begin to send UPDATE messages to R2. R1 indicates completion of updates by EOR and R2 starts its Route Selection Process. 8. While R2 waits for an EOR, it also monitors Stalepath time. If the timer expires, all STALE routes will be removed and “normal” BGP process will be in effect. When R2 has completed its Route Selection Process, then any STALE entries will be refreshed with newer information or removed from the BGP RIB and FIB. The network is now converged. 3 Reliability Metrics for Routers in IP Networks 101 One drawback of NSF/GR is that there is a potential for transient routing loops or packet loss if a restarting router loses its forwarding state (e.g., owing to a power failure). A second drawback of NSF/GR is that it can prolong delays of network-layer re-routing in cases where the service is NOT restored by RP failover. In addition, to be effective in a large ISP backbone, NSF/GR extensions would need to be deployed on all of the peering routers. However, the OSPF NSF extension is Cisco proprietary. The respective drafts were submitted to the IETF but not approved as standards. Since most large ISP networks use routers from multiple vendors, the lack of standardization and universal adoption by vendors limits the usefulness of the NSF and GR extensions. Another approach to router reliability, called Non-Stop Routing (NSR), is free from the drawbacks of graceful restart. It is a self-contained solution that does not require protocol extensions and has a faster failover time. With NSR, the standby RP runs its own version of each protocol and there is continuous synchronization between the active and standby RPs to the extent that it enables the standby RP to take over when the active RP fails without any disruption in the existing peering sessions. The first implementation of NSR was done by Avici Systems [12] in 2003 in the Terabit Switch Router (TSR) router that was used in the AT&T core network. Later, other router vendors implemented their versions of NSR (see, e.g., [13]). It is important to note that router outages can be divided into two categories: planned and unplanned outages. Much of the preceding discussion focused on RP failures or unplanned outages. Planned outages are caused by scheduled maintenance activities, which include software and hardware upgrades as well as card replacement and installation of additional line-cards. Router vendors are developing a software solution on top of NSR to support in-service software upgrade, or ISSU (see, e.g., [13–15]). The goal of ISSU is a significant reduction in downtime due to software upgrades, potentially eliminating this category of downtime if both the old and new SW versions support ISSU. We now turn our attention to line-card failures. Line-card failures are distinct from link failures – while link failures can often be recovered by the underlying transport technology, e.g., SONET ring (see Chapter 2), line-card failures require traffic to be handled by a redundant line-card provisioned on the same or a different router. Line-card redundancy is particularly important for reducing the outage duration of PE (provider-edge) routers that terminate thousands of low-speed customer ports. The first candidate for redundancy is an uplink LC that is used for connection to a P (core) router. Without redundancy, any uplink LC downtime will cause PE router isolation. In addition, a redundant uplink LC allows us to connect a PE router to two P routers using physically diverse transport links. This configuration results in the near elimination of PE router downtime caused by periodic maintenance activities on P routers, under the assumption that maintenance is not performed on these two P routers simultaneously. PE router downtime is nearly eliminated in this case because the probability of PE isolation caused by the failure of the second uplink or the other P router is negligibly small if the maintenance window is short. Restoration from an uplink LC failure is provided at the IP-Layer with restoration time of the order of 10 s as described in Chapter 2. 102 Y. Kogan SONET interfaces on IP routers may support the ability to automatically switch traffic from a failed line-card to a redundant line-card, using a technique called Automatic Protection Switching (APS) [16]. Implementation of APS requires installation of two identical line-cards; one card is designated as primary, the other as secondary. A port on the primary LC is configured as the working interface and the port with the same port number on the secondary LC as the protection interface. The ports form a single virtual interface. Ports on the secondary LC cannot be configured with services; they can only be configured as protection ports for the corresponding ports on the primary LC. The protection and working interfaces are connected to a SONET ADM (Add-Drop Multiplexer), which sends the same signal payload to the working and protection interfaces. When the working interface fails, its traffic is switched to the protection interface. According to our experience, the switchover time is of the order of 1 min. Hitless switchover requires protocol synchronization between the line–cards, which was not available at the time of writing of this chapter. APS is only available in a 1:1 configuration. As a result, it is considered to be expensive. An alternative line-card redundancy approach developed at AT&T [17] is based on a new ISP edge architecture called RouterFarm. RouterFarm utilizes 1:N redundancy, in which a single PE backup router can support multiple active routers. The RouterFarm architecture supports customer access links that connect to PE routers over a dynamically reconfigurable access network. When a PE router fails or is taken out of service for planned maintenance, control software rehomes the customer access links from the affected router to a selected backup router and copies the appropriate router configuration data to the backup router. Service is provided by the backup router once the rehoming is complete. After the primary router is repaired or required maintenance is performed, customers can be rehomed back to the primary router. 3.3 Router Reliability Modeling As described in Section 3.2, router outages can be divided into two categories: planned and unplanned. Planned outages are caused by scheduled maintenance activities. Customers with a single connection to an ISP edge router are notified in advance about planned maintenance. Outages outside of the maintenance window are referred to as unplanned. The common practice is to evaluate router reliability metrics for planned and unplanned outages separately. Table 3.1 provides an example1 of downtime calculation for software (SW) and hardware (HW) upgrades that require the entire router to be taken out of service. The downtime is calculated based on upgrade frequency per year in the second column and mean upgrade duration in the third column. The total mean downtime per year for planned outages is 42 min. 1 All examples are for illustrative purposes only and are not meant to model or describe any network or vendor’s product. 3 Reliability Metrics for Routers in IP Networks 103 Table 3.1 Planned downtime for SW and HW upgrades Activity Freq/year Duration (min) Downtime (min) SW upgrade 2 15 30 HW upgrade 0.2 60 12 The router downtime is close to 0 for unplanned outages if the router supports RP and LC redundancy. If LC redundancy is not supported, unplanned router downtime depends on the ratio rLC =mLC where rLC and mLC denote LC MTTR (Mean Time To Repair) and MTBF, respectively. Using the fact that rLC mLC , one can approximate the downtime probability by rLC =mLC and calculate the average unplanned router downtime per year as dLC D .rLC =mLC/ 525; 600 .min =year/: The factor 525; 600 D 365 24 60 is the number of minutes in a 365-day year. With stable hardware and software, rLC =mLC 4 105 and unplanned downtime dLC is around 21 min, which is less than the planned downtime due to upgrades by a factor of 2. The reliability improvement due to RP and LC redundancy for unplanned outages can be evaluated using the following simplified router reliability model described by a system consisting of two independent components representing the LC and RP. Component 1 corresponds to the LC and component 2 corresponds to the RP. Each component alternates between periods when it is up and periods when it is down. The system is working if both components are up. For nonredundant component i; i D 1; 2, denote MTBF and MTTR by mi and ri , respectively. For a component consisting of primary and backup units, we assume that once a primary unit fails, the backup unit starts to function with probability pi after a random delay with mean i ri . With probability 1 pi , the switchover to the backup unit fails, in which case the mean downtime is ri . Thus, the MTTR for a redundant component is bi D pi i C .1 pi /ri : (3.1) Two important particular cases correspond to pi D 0 (no redundancy) and i D 0 (instantaneous switchover). The MTBF for a redundant component is ci D mi if i > 0 ci D mi =.1 pi / if i D 0: (3.2) The steady state probability that the system (component) is working is referred to as availability. The complementary probability is referred to as unavailability. Based on our assumptions, the availability of component i is Ai D ci ci C bi (3.3) 104 Y. Kogan and the system availability is A D A1 A2 : (3.4) In our case, ri mi that allows us to obtain the following simple approximation for the system unavailability: U D 1 A1 A2 D 1 .1 U1 /.1 U2 / U1 C U2 (3.5) where Ui D bi =.ci C bi / is unavailability of component i . Another important reliability metric is the rate fs at which the system fails. In our case (see, e.g., 7c in [18]) fs 1=c1 C 1=c2 : (3.6) Redundancy without instantaneous switchover decreases the mean component downtime bi and the component and the system unavailability. However, the system failure rate does not decrease because the component uptime ci D mi remains unchanged if i > 0. Instantaneous switchover decreases both the unavailability and the system failure rate. The availability of LCs and RPs with no redundancy is typically better than 0.9999 (four nines) but worse than 0.99999 (five nines). We can compute an estimate of the improvement due to redundancy using Eq. (3.1). If the redundancy of component i is characterized by a probability of successful switchover pi D 0:95 and i =ri D 0:05, then the mean component downtime bi and therefore its unavailability would decrease by about a factor of 10, resulting in a component availability exceeding five nines. The system availability would be limited by the availability of any nonredundant component. 3.4 Reliability Metrics for Routers in Access Networks Figure 3.2 depicts a typical Layer 3 access topology for enterprise customers. It includes n provider-edge routers PE1, : : : , PEn and two core or backbone routers P1 and P2, which are responsible for delivering traffic from customer edge (CE) CE PE1 P1 CE PEn Fig. 3.2 Access network elements Backbone ·· · P2 3 Reliability Metrics for Routers in IP Networks 105 routers at a customer location into the commercial IP network backbone. The service provided by an ISP to an enterprise customer is typically associated with a customer “access port.” An access port is a logical interface on the line-card in a PE, where the link from a customer’s CE router terminates. In general, a PE has a variety of line-cards with different port densities depending on the port speed. For example, a channelized OC-12 card provides up to 336 T1/E1 ports, while a channelized OC-48 card can provide up to either 48 T3 ports, or 16 OC3 ports, or 4 OC12 ports. In Fig. 3.2, each PE is dual-homed to two different P (core) routers using two physically diverse transport links terminating on different line-cards at the PE router. (These transport links are referred to as uplinks.) The links that connect P routers at different nodes are generally provided by an underlying transport network. Dual-homing is used to reduce the impact on the customer due to outages – from a potentially long repair interval to short-duration packet loss caused by protocol reconvergence. Dual-homing is used to address the following outage scenarios: Outage of uplink transport equipment Outage of an uplink line-card at PE routers Outage of an uplink line-card at P routers Outage of one P router or its associated backbone links Customer downtime can be caused by a failure in a PE component, such as a failed interface or line-card, or from a total PE outage. Our goal in this section is to provide a practical way of applying the traditional reliability metrics like availability and MTBF to a large network of edge routers. The calculation of these metrics is straightforward in the case of K identical systems s1 ; : : : ; sK , where each system alternates between periods when it is up and periods when it is down. Assume that k K different systems si1 ; : : : ; sik failed during time interval of length T , and let tj be the total outage duration of system j . The unavailability Uj of system j can be estimated as Uj D tj =T for j D i1 ; : : : ; ik (3.7) and Uj D 0 otherwise. Then, the average unavailability is K P U D j D1 k P Uj K D j D1 ti j KT (3.8) and the average availability is A D 1 U: (3.9) Finally, the average time between failures is estimated as KT=L, where L k is the total number of failures during time interval T . There are two main difficulties with extending these estimates to routers. First, routers experience failures of a single line-card in addition to entire router failures. Second, routers may not be identical. The initial approach to overcome these difficulties was to assign to each failure a weight that represents the fraction of the 106 Y. Kogan access network impacted by the failure. Such an approach is adequate for access networks consisting of the same type routers and line-cards with port speeds in a sufficiently narrow range, which was the case of early access networks with Cisco’s 7500 routers. Modern access networks may consist of several router platforms and high-speed routers may have line-cards with port speed varying in a wide range. For these networks, averaging failures over various router platforms and line-cards with different port speeds is not sufficient. We start with presenting the existing averaging techniques and demonstrating their deficiencies and then describe a granular approach where availability is described by a vector with components representing the availability for each type of access line-cards. Two frequently used expressions for calculating the fraction of the impacted access network are based on different parameterizations of impacted access ports in service and have the following forms [19]: Number of impacted access ports in service Total number of all access ports in service (3.10) Total bandwidth of impacted access ports in service Total bandwidth of all access ports in service (3.11) f D and f D Having the fraction fi of access port impacted and failure duration Di for each failure i; i D 1; : : : ; L during time interval of length T , we can estimate the average access unavailability and availability as U access D L X i D1 fi Di T and Aaccess D 1 U access (3.12) respectively. Formally, one can use Eq. (3.12) with port-weighting or bandwidthweighting fractionsfi for estimating the average unavailability (availability) of any access network with different router platforms. However, there are several problems with these averaging techniques that limit their usefulness: Port-weighted fraction (3.10) emphasizes line-card failures with low-speed ports while failures of high-speed ports are heavily discounted because the port density on a line-card is inversely proportional to the port speed. Bandwidth-weighted fraction (3.11) assigns lower weight to failures of line-cards with low-speed ports because they do not utilize the entire bandwidth of the line-card. Any averaging over different router platforms or even for one router platform with a variety of line-cards that have different quality of hardware and software may hide defects. These issues are illustrated by the following example. Consider an access network consisting of 100 Cisco gigabit switch routers (GSRs) and assume that each router has two access line-cards of each of the following three types: 3 Reliability Metrics for Routers in IP Networks 107 Channelized OC12 with up to 336 T1 ports Channelized OC48: one card is with up to 48 T3 ports while another card is either with up to 16 OC3 ports (50 routers) or with up to 4 OC12 ports (50 routers) 1-port OC48. The total number of ports in service and their respective bandwidth (BW) are shown in Table 3.2. The number of ports in the third column of Table 3.2 is obtained by multiplying the number of ports in service given in the second column of Table 3.3 by the total number of cards with the respective port speed. For T1 and OC48, the total number of cards of each type is 200 D 2100. For T3, OC3, and OC12, the total number of cards is 100, 50, and 50, respectively. In Table 3.3, we use Eqs. (3.10) and (3.11) to calculate port-weight and bandwidth-weight for failure of one linecard depending on the number of ports in service given in the second column. The bandwidth of a line-card is obtained as a product of the number of ports in service, given in the second column of Table 3.3, and the respective speed given in the second column of Table 3.2. One can see that port-weighting practically disregards failures of line-cards with OC48 and OC12 ports, while contribution of failures of line-cards with T3 and OC3 ports is discounted relative to T1 ports by a factor of 6.7 and 20, respectively. As a result, the availability of the access network is dominated by the availability of channelized OC12 card with T1 ports. As one could expect, bandwidth-weighting is biased toward failures of line-cards with an OC48 port. However, failures of other line-cards, except for a channelized OC12 card with T1 ports, become more visible in comparison with port-weighting. As a result of these problems with port and bandwidth-weighting techniques, a more useful approach is to evaluate average availability for each router platform and for each type of access LC separately. The increasing variety of edge routers and access line-cards justifies such an approach, since it allows the ISP to track Table 3.2 Port T1 T3 OC3 OC12 OC48 Total Total number of ports in service and their bandwidth Speed (Mbps) Number of ports BW (Gbps) 1.5 40,000 60.0 45 3,000 135.0 155 500 77.5 622 150 93.3 2,400 200 480.0 43,850 845.8 Table 3.3 Port-weight and bandwidth-weight per line-card Port In service P-weight BW-weight T1 200 0.00456 0.00035 T3 30 0.00068 0.00160 OC3 10 0.00023 0.00183 OC12 3 6.8E-05 0.00221 OC48 1 2.3E-05 0.00284 108 Y. Kogan the reliability with finer granularity. Consider a set of edge routers of the same type with J types of access line–cards, which are monitored for failures during time interval of length T . For each customer impacting failure i; i D 1; : : : ; L, we record the number nij of type j cards affected and the respective failure duration tij . In the case of access line-card redundancy, only failures of active (primary) line-card are counted and then only if the failover to the backup line-card was not hitless. The average unavailability of type j access line-card is calculated as L P Uj D i D1 nij tij (3.13) Nj T where Nj is the total number of type j active cards. The average unavailability can be expressed as Rj (3.14) Uj D Mj where L P Rj D nij tij i D1 L P i D1 (3.15) nij is the average repair time for an LC of type j , and Mj D Nj T L P nij (3.16) i D1 can be interpreted as the average time between router failures impacting customers on access line-cards of type j . Metric Mj can be considered as an extension of the traditional field hardware MTBF. For the field MTBF, only individual line-card failures, which require card replacement, are counted in the denominator. In Mj , we count all failures of type j cards outside the maintenance window, including those caused by reset, software bugs, and all impacted cards of type j in case of entire router failure. This distinction is important since we want a metric that accurately captures customer impact caused by all HW and SW failures. For example, each reset of an active (primary) line-card can cause a protocol reconvergence event resulting in short-duration packet loss. Metrics R; M , and U can also be defined for the entire population of access line-cards without differentiating failure by LC type. Denote L L J J X J X X X X N D Nj ; n D nij ; t D tij : (3.17) j D1 j D1 i D1 j D1 i D1 3 Reliability Metrics for Routers in IP Networks Then RD 109 NT t ; M D n n (3.18) and the average unavailability R : (3.19) M The value of using Mj in addition to the average unavailability is demonstrated by the following example. U D Example 3.1. Consider a set of 400 routers and let T D 1;000 h. Each router has two cards of Type 1, three cards of Type 2, and five cards of Type 3. The number of failures for the entire router and each card type with their duration is given in Table 3.4. In case of single card failures, nij D 1 if LC of type j failed and nij D 0 otherwise. In the case of entire router failure, .ni1 ; ni 2 ; ni 3 / D .2; 3; 5/. In this example, we assume constant failure duration tij D tj of type j cards and a constant duration of the entire router failure. The failure duration is measured in hours. The failure parameters in Table 3.4 are referred to as Scenario 1. We also consider a Scenario 2, in which the only difference with Scenario 1 is that the number of failures of entire routers is increased from 1 to 5. The reliability metrics for two scenarios are given in Table 3.5. The results in columns R and M for LC Type j; j D 1; 2; 3, and for All Cards are calculated using Eqs. (3.15), (3.16), and .3:18/, respectively. The unavailability for LC Type j; j D 1; 2; 3, and for All Cards is calculated using Eqs. (3.14) and (3.19), respectively. The defects per million (DPM) is a commonly used metric that is obtained by multiplying the respective unavailability by 1,000,000. Note that for All Cards, defects per million (DPM) are below 10 in both scenarios, implying a high availability exceeding 99.999% (five nines), while the average time between customer impacting failures M in Scenario 2 is almost half of that in Scenario 1. Therefore, DPM, in contrast with average time between customer impacting failures, is not sensitive to the frequency of short failures of the entire router. Table 3.4 Failures and their duration: Scenario 1 Failure # Failures Router 1 LC type 1 30 LC type 2 6 LC type 3 2 Table 3.5 Reliability metrics Scenario 1 LC type R M DPM 1 0.76 25,000 30.25 2 1.03 133,333 7.75 3 0.21 285,714 0.75 All Cards 0.73 83,333 8.75 Duration 0.1 0.8 1.5 0.5 Scenario 2 R M 0.63 20,000 0.50 57,143 0.13 74,074 0.44 45,455 DPM 31.25 8.75 1.75 9.75 110 Y. Kogan If an ISP were only tracking DPM and router outages increased from one outage per 1,000 h to five outages per 1,000 h, it might miss the significant decrease in reliability as seen from the customer’s perspective. The metrics in the All Cards row hide a low average time between failures and high DPM for LC Type 1 in both scenarios. The average time between customer impacting failures by LC type amplifies the difference between the two scenarios. For example, for LC Type 3, the average time between failures M3 decreased almost by a factor of 4 in Scenario 2, in comparison with Scenario 1. This example illustrates the importance of measuring reliability metrics by the type of access linecards. It also illustrates the significant impact that even short-duration outages of an entire router have on reliability. Furthermore, it shows why nonstop routing and in-service software-upgrade capabilities described in Section 3.2 are considered to be so important by ISPs. 3.5 End-to-End Availability Evaluation of the end-to-end availability requires evaluation of the backbone availability in addition to the access availability discussed in Section 3.4. Given the scale and complexity of a large ISP backbone, there is no generally agreed upon approach for measuring and modeling end-to-end availability. Chapter 4 provides a fairly general approach for performance and reliability (performability) evaluation of networks consisting of independent components with finite number of failure modes. Its application involves the steady state probability distribution that is used for calculation of the expected value of the measure F defined on the set of network states. This section presents a brief overview of some results related to state aggregation and the selection of function F for evaluating the backbone availability. Large ISP backbones are typically designed to ensure that the network stays connected under all single-failure scenarios. Furthermore, the links are designed with enough capacity to carry the peak traffic load under all single-failure scenarios. Therefore, the majority of failures do not cause loss of backbone connectivity. Typically, when a failure happens, P routers detect the failure and trigger a failover to a backup path. If the failover were hitless and the backup path did not increase the end-to-end delay and also had enough capacity to carry all traffic, then the failure would not have any customer impact. Failures impacting customer traffic include the following events: 1. 2. 3. 4. Loss of connectivity Increased end-to-end delay on the backup path Packet loss due to insufficient capacity of the backup path Routing reconvergence triggered by the original failure. Such a reconvergence may cause packet loss during several seconds. Assume that the duration of each event can be measured. Two approaches to measuring the backbone availability are based on knowing the actual point-to-point 3 Reliability Metrics for Routers in IP Networks 111 traffic demand matrix that allows us to calculate the amount of impacted traffic for each event. In the first approach [20], only events 3 and 4 are included. The backbone unavailability is defined as the fraction of traffic lost over a given time period. In the second approach [21], all four events are included. Availability is measured for each origin–destination pair as the percentage of time that the network can satisfy a service-level agreement including 100% connectivity and thresholds on packet loss and delay. The main complexity in the implementation of either approach is in measuring event durations. The determination of event durations requires specially designed network instrumentation involving synthetic (active) measurements. Reference [22] describes a standardized point-to-point approach to path-level measurements and reference [23] describes a novel approach that uses a single measurement host to collect network-wide one-way performance data. These approaches also require a well-thought-out data management infrastructure and computationally intensive processing of their output [24]. Application of edgeto-edge availability distribution to evaluation of VoIP (Voice over IP) reliability [25] is addressed in [26]. References 1. Malec, H., (1998). Communications reliability: A historical perspective. IEEE Transactions on Reliability, 47, 333–345. 2. Claffy, kc., Meinrath, S., & Bradner, S. (2007). The (un)economic Internet? IEEE Internet Computing, 11, 53–58. 3. Bollapragada, V., Murphy, C., & White, R. (2000). Inside Cisco IOS software architecture. Indianapolis, IN: Cisco Press. 4. Schudel, G., & Smith, D. (2008). Internet protocol operations fundamentals. In Router security strategies. Indianapolis, IN: Cisco Press. 5. Garrett, A., Drenan, G., & Morris, C. (2002). Juniper networks field guide and reference. Reading, MA: Addison-Wesley. 6. Oggerino, C. (2001). High availability network fundamentals: A practical guide to predicting network availability. Indianapolis, IN: Cisco Press. 7. T640 Internet router node overview, from http://www.juniper.net/techpubs/software/nog/noghardware/download/t640-router.pdf. 8. Route flapping, from http://en.wikipedia.org/wiki/Route flapping. 9. Cisco nonstop forwarding with stateful switchover (2006). Deployment guide. Cisco Systems, from http://www.cisco.com/en/US/technologies/tk869/tk769/technologies white paper0900 aecd801dc5e2.html. 10. Graceful restart concepts, from http://www.juniper.net/techpubs/software/junos/junos93/ swconfig-high-availability/graceful-restart-concepts.html#section-graceful-restart-concepts. 11. Sangli, S., Chen, E., Fernando, R., & Rekhter, Y. (2007). Graceful restart mechanism for BGP. RFC 4724. Internet Official Protocol Standards, from http://www.ietf.org/rfc/rfc4724.txt. 12. Kaplan, H. (2002). NSR Non-stop routing technology. White paper. Avici Systems Inc., from http://www.avici.com/technology/whitepapers/reliability series/NSRTechnology.pdf. 13. Router high availability for IP networks (2005). White paper. Alcatel, from http://www. telecomreview.ca/eic/site/tprp-gecrt.nsf/vwapj/Router HA for IP.pdf/$FILE/Router HA for IP.pdf. 14. ISSU: A planned upgrade tool (2009). White paper. Juniper Networks, from http://www. juniper.net/us/en/local/pdf/whitepapers/2000280-en.pdf. 112 Y. Kogan 15. Cisco IOS XE In Service Software Upgrade process (2009). Cisco Systems, from http:// www.cisco.com/en/US/docs/ios/ios xe/ha/configuration/guide/ha-inserv updg xe.pdf. 16. Single-router APS for the Cisco 12000 series router, from http://www.cisco.com/ en/US/docs/ios/12 0s/feature/guide/12ssraps.pdf. 17. Agraval, M., Bailey, S., Greenberg, A., et al. (2006). RouterFarm: Towards a dynamic manageable network edge. In: SIGCOMM’06 Workshops, Pisa, Italy. 18. Ross, S. (1989). Introduction to probability models. San Diego, CA: Academic. 19. Access availability of routers in IP-based networks (2003) Committee T1 tech rep T1.TR.78–2003. 20. Kogan, Y., Choudhury, G., & Tarapore, P. (2004). Evaluation of impact of backbone outages in IP networks. In ITCOM 2004, Philadelphia, PA. 21. Wang, H., Gerber, A., Greenberg, A., et al. (2007). Towards quantification of IP network reliability, from http://www.research.att.com/jiawang/rmodel-poster.pdf. 22. Ciavattone, L., Morton, A., & Ramachandran, G. (2003). Standardized active measurements on a Tier 1 IP backbone. IEEE Communications Magazine, 41, 90–97. 23. Burch, L., & Chase, C. (2005). Monitoring link delays with one measurement host. ACM SIGMETRICS Performance Evaluation Review 33, 10–17. 24. Choudhury, G., Eisenberg, M., Hoeflin, D., et al. (2007). New reliability metrics and measurement techniques for IP networks. Proceedings of Distributed computer and communication networks, RAS, Moscow, 126–130. 25. Johnson, C., Kogan, Y., Levy, Y., et al. (2004). VoIP Reliability: A service provider perspective. IEEE Comunications Magazine, 42, 48–54. 26. Lai, W., Levy, Y., & Saheban, F. (2007). Characterizing IP network availability and VoIP service reliability. Proceedings of Distributed computer and communication networks, RAS, Moscow, 126–130. Chapter 4 Network Performability Evaluation Kostas N. Oikonomou 4.1 Introduction This chapter is an introduction to the area of performability evaluation of networks. The term performability, which stands for performance plus reliability, was introduced in the 1980s in connection with the performance evaluation of faulttolerant, degradable computer systems [23].1 In network performability evaluation, we are interested in investigating a network’s performance not only in the “perfect” state, where all network elements are operating properly, but also in states where some elements have failed or are operating in a degraded mode (see, e.g., [8]). The following example will introduce the main ideas. Consider the network (graph) of Fig. 4.1. On the left, the network is in its perfect state, and on the right one node and one edge have failed.2 Node and edge failures occur independently, according to certain probabilities, which we assume to be known. An assignment of “working” or “failed” states to the network elements defines a state of the network. By the independence assumption, the probability of that state is the product of the state probabilities of the elements. There are two traffic flows in this network: one from node 1 to node 5, and the other from 7 to 3. The flows are deterministic, of constant size, and there is no queuing at the nodes. Our interest is in the latency of each flow, defined as the minimum number of hops (edges) that the flow must traverse to get to its destination when it is routed on the shortest path. In each state of the network, a flow has a given latency: in the perfect state, both flows have latency 2 (hops), but in the example failure state the first flow has latency 3 and the second 1. The simplest characterization of the latency metric would be to find its expected value over the possible network states, K.N. Oikonomou () 200 Laurel Ave, Middletown, NJ, 07748 e-mail: ko@research.att.com 1 Unfortunately, the terminology is not completely standard and some authors still use the term “reliability” for what we call performability; see, e.g., [1]. One may also encounter other terms such as “availability” or “dependability”. 2 When a node fails, we consider that all edges incident to it also fail. C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 4, c Springer-Verlag London Limited 2010 113 114 K.N. Oikonomou 2 2 3 1 3 1 4 7 4 5 5 6 6 Perfect state Failure of node 7 and edge (1,6) Fig. 4.1 A 7-node, 10-edge network with 217 possible states. The performance metric is traffic latency, measured in hops of which there are 217 130;000. A more complete characterization would be to find its entire probability distribution. This would allow one to answer questions such as “what is the probability that the latency of flow 1 does not exceed 3?”, and “what upper bound on the latency of flow 2 can be guaranteed with probability 0.999?”. The answers to these questions ( performability guarantees) are useful in setting performance targets for the network, or SLAs. This basic example illustrates several points, all of which will be covered in more detail in later sections. Reliability/Performance Trade-Off in the Analysis A fundamental fact is that the size of the state space is exponential in the number of network elements. In the above example, if the number of network elements is doubled, the number of network states becomes about 17109 , and this is still a small network, with only 34 elements; a network model with several hundred elements would be much more typical. This means that for any realistic network model the state space is practically infinite, so the amount of work that can be done in each state to compute the performance metrics is limited. In other words, in performability, analysis there is a fundamental trade-off between the reliability (state space) and performance aspects. A consequence of this trade-off is that the performance model cannot be as detailed as it would be in a pure performance analysis: in the example, we assumed constant traffic flows and no queuing at nodes. Another aspect of the trade-off is that only the investigation of the steady-state behavior of the model is, in general, feasible: in the example, we treated the network elements as two-valued random variables, not as two-state random processes. However, a mitigating factor is that the network states generally have very different probabilities, so that we may be able to calculate bounds on the performance metrics by computing their values only on a reasonable number of states, those with high probability. With this fundamental trade-off in mind, we now discuss ways in which the simple performability model of the example can be extended. 4 Network Performability Evaluation 115 Enhancements to the Simple Model To make the model presented in the example more useful for a realistic analysis, we could add capacities to the graph’s edges. We could also add sizes to the traffic flows, and have more sophisticated routing that allows only shortest paths that have enough capacity for a flow. Further, for a better latency measure, we could add lengths to the graph edges. Another category of enhancements would be aimed at representing failures more realistically. To begin with, the network elements could be allowed to have more than one failure mode, e.g., an edge could operate at full capacity, half capacity, or zero capacity (fail). We could separate the network elements from the entities that fail by introducing “components” that have failure modes and affect the graph elements in certain ways. For example, such a component could represent an optical fiber over which two graph edges are carried, and whose failure (cut) would fail both of these edges at the same time. In Section 4.2 we describe a hierarchical network model that has all the features mentioned above, among others. Finally, we could allow different types of routing for traffic flows, and also introduce the notion of network restoration into the model. These additions are described in Section 4.3. Network Performability in the Literature A number of network performability studies have appeared in the literature. Levy and Wirth [21] investigate the call completion rate in a communications network. Alvarez et al. [4] study performability guarantees for the time required to satisfy a web request in a network with up to 50 nodes, where only nodes can fail, but without restoration. Levendovszky et al. [19] study the expected lost traffic in the Hungarian backbone SDH network with 52 nodes and 59 links, and no restoration. Carlier et al. [7] use a three-level network model, and study expected lost traffic in a 111-node, 180-link network using k-shortest path restoration. Gomes and Craveirinha [12] study a 46-node, 693-link representation of the Lisbon urban network with a threelevel performability model, and compute blocking probabilities for a Poisson model of the network traffic, with no restoration. Finally, layered specification of a network for the purposes of performability evaluation has been used in [7,12], which separate the network into a “physical” and a “functional” layer, and in [22], which uses a special-purpose separation into “node cluster” and “call-processing path” layers. Some further references are given in Section 4.4.3. Chapter Outline In Section 4.2 we describe a four-level, hierarchical network model, suited for performability analysis, and illustrate it with an IP-over-optical network example. In Section 4.3 we discuss the performability evaluation problem in general, give a mathematical formulation, present the state-generation approach to the performability evaluation of networks, and discuss basic performance measures and 116 K.N. Oikonomou related issues. We also introduce the nperf network performability analyzer, a software package developed in AT&T Labs Research. In Section 4.4 we conclude by presenting two case studies that illustrate the material of this chapter, the first involving an IPTV distribution network, and the second dealing with architecture choices for network access. 4.2 Hierarchical Network Model For the purpose of our performability modeling, we will think of a “real” network as consisting of three layers3 : a traffic layer, a transport layer, and a physical layer. On the other hand, as shown in Fig. 4.2, our performability model is divided into four levels: traffic, graph, component, and reliability. (In terms of the ISO OSI reference model, both models address layers 1 through 3.) To illustrate the correspondence between the three network layers and the four model levels, we use the case of an IP-over-optical “real” network. The four-level performability model applies to many other types of real networks as well: for example, Oikonomou et al. [25] describe its application to a set of satellites that communicate among themselves and a set of ground stations via microwave links, whereas the ground stations are interconnected by a terrestrial network. 4.2.1 IP-Over-Optical Network Example A modern commercial packet network typically consists of IP routers connected by links, which are transported by an underlying optical network. We describe how we model the traffic, transport, and physical layers of such a network, and how we map them to the levels of the performability model in Fig. 4.2. (For more on this topic, see Chapter 2.) Traffic Layer Based on an estimate of the peak or average traffic pattern, we create a matrix giving the demand or “flow” between each pair of routers. (Methods for creating such a traffic matrix from measurements are described in Chapter 5.) A demand has a rate, a unit, and possibly a type or class associated with it. 3 We say “real” because any description is itself at some level of abstraction and omits aspects which may be important if one adopts a different viewpoint. 4 Network Performability Evaluation 117 point -to-po int de man d Traffic level Routing and restoration F Graph level Component level Reliability level λ2 λ1 W μ1 F W μ2 F λ3 W μ3 λ4 F W μ4 F Fig. 4.2 The four-level network performability model used by the nperf performability analyzer. F is the performance measure, discussed in Section 4.3.3 Transport Layer Nodes A network node represents an IP router. At the component level this node expands into a data plane, a control plane, a hardware and software upgrade component, and a number of networking interfaces (line cards/ports). The data plane, or switching fabric, is responsible for routing packets, while the control plane computes routing tables and processes other network signaling protocols, such as OSPF or BGP. When a data plane component fails, all the links incident to its router fail. When a control plane component fails, the router continues to switch packets, but cannot participate in rerouting, including restoration. Failure of a port component fails the corresponding link(s). The “upgrade” component represents the fact that, 118 K.N. Oikonomou periodically, the router is effectively down because it is undergoing an upgrade of its hardware or software. (This is by no means a very sophisticated router reliability model, see Chapter 3, but exemplifies the performance-reliability trade-off discussed in Section 4.1.) Finally, fix one of the above classes of components, say router cards. At the reliability level we think of all these components as independent copies of a continuous-time Markov process (see, e.g., [5] or [6]) with failure transition rate and repair transition rate , which may be specified in terms of MTBF (mean time between failures, D 1=), and MTTR (mean time to repair, D 1=). Transport Layer Links A link between routers fails if either of the port components at its endpoints fails, if a data plane of one of the endpoint nodes fails, or if a lower-layer component over which the link is routed fails (e.g., a network span, discussed next). Two network nodes may be connected by multiple parallel links. These parallel links may be grouped into a type of virtual link called a composite or bundled link, whose capacity is the sum of the capacities of its constituent links. For the purposes of IP routing, the routers see only a single bundled link. When a constituent link fails, the capacity of the bundled link is reduced accordingly. A bundled link fails (or more precisely is “taken out of service”) when the aggregate capacity of its non-failed constituent links falls below a specified threshold. Physical Layer Spans We use the term “span” to refer to the network equipment and media (e.g., optical fiber) at the physical layer that carries the transport-layer links. Failure of a span component affects all transport-layer links which are routed over this span. When modeling an IP-over-optical layered network, the physical layer usually uses dense wavelength division multiplexing (DWDM), and a span consists of a concatenation of point-to-point DWDM systems called optical transport systems (OTS).4 In turn, an OTS is composed of many elements, such as optical multiplexers/demultiplexers, optical amplifiers, and optical transponders. Also, a basic constraint in commercial transport networks is that a span is considered to be working only if both of its directions are working. With this assumption, it is not difficult to compute the failure probability of a span based on the failure probabilities of its individual elements in both directions. Thus, for simplicity, we generally represent a network span by a single “lumped” component whose MTBF and MTTR are calculated as explained in [28]. 4 There are more complex DWDM systems with various optically-transparent “add/drop” capabilities, which, for simplicity, we do not discuss here. 4 Network Performability Evaluation 119 Other Types of Components A set of fibers that is likely to fail together because they are contained in a single conduit/bundle can be represented by a fiber cut component that brings down all network spans (hence all the higher IP-layer links) that include this fiber bundle. Other types of catastrophic failures of sets of graph nodes and edges may be similarly represented. So far we have mentioned only binary components, i.e., with just two modes of operation, “working” or “failed”. We discuss components with more than two modes in Section 4.2.2.2. 4.2.2 More on the Graph and Component Levels 4.2.2.1 Graph Element Attributes The graph is the level of the performability model at which the network routing and restoration algorithms operate. Graph edges have associated capacities and (routing) costs. In general, an edge’s capacity can be a vector, and this vector has a capacity threshold associated with it, such that the edge is considered failed if the sum of the capacities of its non-failed elements falls below the threshold. An edge with vector capacity can directly represent a bundled link. The nperf performability analyzer presented in Section 4.3 also allows many other attributes for edges, such as lengths, latencies, etc., as well as operations on these attributes. These operations are covered in Section 4.2.2.3. 4.2.2.2 Multi-Mode Components Each component, representing an independent failure or degradation mechanism, has a single working mode and an arbitrary number of failure modes. If it has a single failure mode it is referred to as a “binary” component, otherwise it is called “multi-mode”. In the nperf analyzer a component is represented by a star Markov process, as shown in Fig. 4.3. At the reliability level, the i th failure mode of a particular component is defined by its mean time between failures and its mean time to repair by setting i D 1=MTBFi and i D 1=MTTRi . We now give some examples of using multi-mode components in network modeling. Router Upgrades We mentioned in Section 4.2.1 (binary) software and hardware upgrade components for routers. Now suppose that there is an intelligent network maintenance policy in place, by which router upgrades are scheduled so that only one router in the network undergoes a software or hardware upgrade at any time. 120 K.N. Oikonomou μ1 f1 λ1 μ2 f2 λ2 w λm μm . . . λ w μ f fm Fig. 4.3 A multi-mode component with m failure modes f1 ; : : : ; fm (left), and the special case of a binary component (right). The components are continuous-time Markov processes of the “star” form. The i th mode is entered with (failure) rate i and exited with (repair) rate i This policy cannot be modeled by using binary upgrade components associated with the routers, because (independence) there is nothing to prevent more than one of them failing at a time. However, for an n-router network, the mutually exclusive upgrade events can be represented by defining an .n C 1/-mode component whose mode 1 corresponds to no upgrades occurring anywhere in the network, and each of the remaining n modes corresponds to the upgrade of a single router. Traffic Matrix Suppose we want to take into account daily variations in traffic patterns/levels, e.g., for 60% of a typical day the traffic is represented by matrix T1 , for 20% by matrix T2 , and for another 20% by matrix T3 . This can be done by letting the traffic matrix be controlled by a multi-mode component whose modes w; f1 ; f2 have probabilities 0:6; 0:2; 0:2, respectively, and they set the traffic matrix to T1 ; T2 ; T3 , respectively. Restoration Figure 4.2 implicitly assumes that network restoration happens at only one level. However, multi-mode components afford the capability to model restoration occurring at more than one network layer. The details of how this is done, using the example of IP over SONET, can be found in [25]. 4.2.2.3 Failure Mapping Recall that failure of a binary component may affect a whole set of graph-level elements: the spans of Section 4.2.1 are an example. More generally, when a multimode component enters one of its failure modes, the effect on a graph element is to change some of the element’s attributes. For example, the capacity of an edge may decrease, or a node may become unable to perform routing. Depending on the final values of the attributes, e.g., total edge capacity 6 some threshold, the graph element may be considered “failed”. We refer to the effects of the components on the graph as the component-to-graph- level failure mapping. Some of the ways that a component can affect a graph element attribute are to add a constant to it, subtract a constant from it, multiply it by a constant, or set its value to a constant. 4 Network Performability Evaluation 121 4.3 The nperf Network Performability Analyzer In this section, we begin by discussing how the general, i.e., not specific to networks, performability evaluation problem can be defined mathematically, and then discuss various aspects of this definition. We then review the so-called state generation approach to performability evaluation, and some basic ingredients of the performance measures used when evaluating the performability of networks. We finally present an outline of the nperf network performability analyzer, a tool developed in AT&T Labs Research. Useful background on performability in general is in [16] and in [32]. A more extensive reference on the nperf analyzer itself and the material of this section is [28]. 4.3.1 The Performability Evaluation Problem It is useful to understand the mathematical formulation of the network performability evaluation problem. Let C D fc1 ; : : : ; cn g be a set of “components”, each of which is either working or failed. (As already mentioned in Section 4.2.2, components can be in more than two states, called “modes” to distinguish them from network states, but to simplify the exposition here we restrict ourselves to two mode, or “binary” components.) Abstractly, a component represents a failure or degradation mechanism; examples were given in Section 4.2.1. Component ci is in its working mode with probability pi and in its failed mode with probability qi D 1 pi , both assumed known. Our basic assumption is that all components are independent of one another, so that, e.g., the probability that ci is down, cj is up, and ck is down is qi pj qk . A network state is an assignment of a mode to every component in C and can be represented by a binary n-vector. The set of all network states S.C/ has size 2n , and the probability of a particular state is the product of the mode probabilities of the n components. Let F be a vectorvalued performance measure (a function) defined on S.C/, mapping each state to an m-tuple of real numbers; examples are given in Section 4.3.3. The performability evaluation problem consists in computing the expected value of the measure F over the set S.C/ of network states: X FN D F .s/ Pr.s/: (4.1) s2S.C/ There are various points to note here. Complexity It is well known that the exact evaluation of (4.1) is difficult, even if F is very simple. Intuitively this is because the size of the state space S.C/ is exponential in the size of the set of components C. For a more precise demonstration 122 K.N. Oikonomou of the complexity, suppose that each component corresponds to an edge of a graph, the graph’s nodes do not fail, and we want to know the probability that there is a path between two specific nodes a and b of the graph. This is known as the T WO T ERMINAL N ETWORK R ELIABILITY evaluation problem, and in this case F takes only two values: F .s/ is 1 if there is a path from a to b in the graph state s, and 0 otherwise. Despite the very simple F , this problem is known to be #P-complete (see e.g., [15, 32], or [8]). A consequence of this computational complexity is that, in general, only approximate performability evaluation is feasible. We will return to this in Sect. 4.3.2. Performability Guarantees In practice, we are interested in computing more sophisticated characteristics of F than its expectation FN , such as the probability, over the set of network states, that F is less than some number x, or greater than some number y. For example, we may want to claim that “with probability at least 99.9%, at most 2% of the total traffic is down, and with probability at least 90% at most 10% of it is down”. Formally, such claims are statements of the type Pr.F < x1 / > P1 ; Pr.F < x2 / > P2 ; : : : ; Pr.F > y1 / 6 Q1 ; Pr.F > y2 / 6 Q2 ; : : : or (4.2) that hold over the entire network state space; they are known as performability guarantees, and they can, for example, be used to set SLAs. The important point is that the computation of (4.2) reduces easily to just the computation of expectations of the type (4.1); see, e.g., [28]. Network When we are using the formalism leading to (4.1) to evaluate the performability of a network, all the complexity is in the measure F . As Fig. 4.2 shows, F then includes the failure mapping from the component to the graph level, the routing and restoration algorithms, and the traffic level. Time Recalling the reliability level of Fig. 4.2, each ci is in reality a two-state Markov process, whose state fluctuates in time. If so, what is the meaning of the expectation FN of the measure F ? It can be shown that if we average F over a long time as the network moves through its states, this average will approach FN , if we take the probabilities pi and qi associated with ci to be the steady-state probabilities of the working and failed states of the Markov process representing ci . Steady State The reader familiar with the performance analysis of Markov reward models (see, e.g., [5, 11]) will recognize that the definition (4.1) of the performability evaluation problem is based on steady state expectations of measures. In many cases it is transient, also known as finite-time, measures that may be of interest. The evaluation of such measures on very large state spaces is much more difficult than that of steady state measures, and outside the scope of the treatment in this chapter, but it is currently an area of further development of the nperf tool. 4 Network Performability Evaluation 123 4.3.2 State Generation and Bounds A number of approaches to computing the expectation FN in (4.1) approximately have been developed. Without attempting to be comprehensive, they can be classified into (a) upper and lower bounds for certain F such as connectivity (using the notions of cut and path sets), or special network/graph structures (see [16, 32]), (b) “most probable states” methods ([13, 14, 16, 17, 31–33]), (c) Monte Carlo sampling approaches ([7, 16]), and (d) probabilistic approximation algorithms for simple F , e.g., [18]. Methods of types (a) and (b) produce algebraic bounds on FN (i.e., not involving any random sampling), while (c) and (d) yield statistical bounds. Here we will discuss the “most probable states” methods, which are algorithms for generating network states in order of decreasing probability. The rationale is that if the component failure probabilities are small, most of the probability mass is concentrated on a relatively small fraction of the state space. Thus, as these methods generate states one by one and evaluate F on them, they are attempting to update FN with terms of highest value first. The most probable states methods are particularly well suited to evaluating the performability of complex networks because they make no assumptions (at least to first order) about what the performance measure F might be or what properties it might have, which is especially important in view of the fact that the complexity of network routing and restoration schemes is included in F . The classical algorithms of [13, 33] apply to systems of only binary components, whereas the algorithms of [14,17,30] can handle arbitrary multi-mode components. nperf uses a hybrid state-generation algorithm described in [28], which handles arbitrary multi-mode components and is suited especially to “mostly binary” systems, that is systems where the proportion of components with more than two modes is small. We find that such systems dominate performability models for practical networks. To explain what we mean by “at least to first order”, let ! and ˛ be the smallest and largest values of F over S.C/, and suppose we generate the k highestprobability elements of S.C/. If these states have total probability P , we have the algebraic lower and upper bounds on FN FNl D k X i D1 F .si / Pr.si / C .1 P /!; FNu D k X F .si / Pr.si / C .1 P /˛; (4.3) i D1 first pointed out in [20]. The bounds (4.3) are valid for arbitrary F , but may sometimes require the generation of a large number of states to achieve a small enough FNu FNl D .1 P /.˛ !/. Tighter bounds are possible, but only by requiring F to have some special property, such as monotonicity, limited growth, etc. See [27] for further details. 124 K.N. Oikonomou 4.3.3 Performance Measures There are two measures of fundamental importance in network performability analysis, both having to do with lost traffic. These are tlnr .s/ D total traffic lost because of no route in s tlcg .s/ D total traffic lost because of congestion in s (4.4) (We do not mean to imply that these are the only measures of importance. Depending on the application, the focus may shift to considerations other than lost traffic, e.g., to latency, or to many others.) To define terms, we refer to the IP-over-optical example of Section 4.2.1. A demand corresponds to a source-destination pair of routers; we use traffic to mean the size (volume) of a demand, or of a set of demands. The definition of tlnr is straightforward: a demand fails if a link (multi-edge) on its route fails, and a failed demand is lost because of no route if no path for it can be found after the network restoration process completes. tlnr .s/ is the sum of the volumes of all lost demands in state s. Our definition of tlcg is more involved.5 If the network routing allows congestion, a demand is congested if its route includes an edge with utilization that exceeds a threshold Uc . tlcg is a certain function (not the sum) of all congested demands. Suppose we fix a routing R in state s; then we define tlcg to be the total traffic offered to the network minus the maximum possible total flow F that can be carried in state s using routing R without congestion. Here “there is congestion under R” means “there is a (working) edge with utilization above the threshold Uc ”. Equation (4.5) formalizes this definition. Note that if the network uses flow control, such as TCP in an IP network, the flow control will “throttle” traffic as soon as it detects congestion, so that few packets will be really lost; in that case it is more accurate to call our measure loss in network bandwidth. Now using the “link-path” formulation [29], let D be the set of all subdemands (path flows) and D.e; R/ be the set of subdemands using the non-failed edge e under the routing R. Also let fd be the flow corresponding to subdemand d . Then F is the solution of the linear program F D max X fd (4.5) d 2D subject to 8e; X fd Uc ce ; fd vd ; d 2D.e;R/ where ce is the capacity of edge e and vd the volume of demand d . 5 This definition is by no means unique, we claim only that it is useful in a wide variety of contexts. 4 Network Performability Evaluation 125 Consistent with what we noted in Section 4.3.1, the above discussion centered around steady-state expectations of measures as the quantities of interest. In the context of the case study in Section 4.4.2 we will touch on one interesting sub-class of finite-time measures, event counts. 4.3.4 Network Routing and Restoration The presence of network routing and restoration in the performance measure makes the performability analysis of networks different from other such analyses. The nperf analyzer incorporates three main kinds of network routing methods: Uncapacitated Minimum-Cost This is meant to represent routing by, e.g., the OSPF (Open Shortest Path First) protocol [24]. Link costs correspond to OSPF administrative weights. OSPF path computation does not take into account the capacities or utilizations of the links. Another main IP-layer routing protocol, IS-IS (Intermediate System–Intermediate System) behaves similarly for our purposes. “Optimal” Routing This routing is based on multi-commodity flows ([2, 29]). nperf incorporates both integral and non-integral (“real”) multi-commodity flow methods. These methods could be regarded as representing variants of OSPF-TE. Details are in [28]. Multicast Routing This type of routing sends the traffic originating from a source node on a shortest-path tree rooted at this node and spanning a set of destination nodes. The shortest paths to the destinations are determined by so-called reversepath forwarding. These routing methods are not meant to be emulations of real network protocols; they include only the features of these protocols that are important for the kind of analysis that nperf is aimed at. In particular, a lot of details associated with timing and signaling are absent (another instance of the reliability/performance trade-off noted in Section 4.1). 4.3.5 Outline of the nperf Analyzer With the above material in mind, Fig. 4.4 depicts the structure of the core of the nperf tool. At the top we have the most probable state generation algorithms of [13, 28, 33], mentioned in Section 4.3.2. The “routers” at the bottom of the figure are the routing methods discussed in Section 4.3.4: “iMCF” corresponds to integral multi-commodity flow, “rMCF” to non-integral (“real” or “fractional”) multi-commodity flow, and “USP” to uncapacitated shortest paths. The four-level network model is specified by a set of plain text files, listed in Table 4.1. 126 K.N. Oikonomou YK Hybrid GC State generation algorithms Reliability level R Hierarchical network model: definition of F = ( f1 , . . . , fm) Component level C Failure map C → G Graph level G Demand (traffic) level D iMCF router rMCF router USP router F = ( f1 , . . . , fm) Multicast ... tree router Failure map G→D Measure F ...≤ Pr( f i ≤ x i )≤ ... Fig. 4.4 Structure of the core nperf software Table 4.1 Network model specification files net.graph Specifies the network graph (nodes and edges) net.dmd, net.units Specify the traffic demands, if the network has a traffic layer net.comp Specifies the network components and the C ! G failure mapping net.rel Lists (MTBF, MTTR) pairs for the modes of the components net.perf Parameters for the performance measure(s) The MTBFs for the components are typically obtained from a combination of manufacturer data and in-house testing. The MTTRs are usually determined by network maintanance policies, except for some special types of repairs, such as a software reboot. (Of course, one always has the freedom to use hypothetical values when performing a “what-if” analysis.) Uncertainties in the MTBFs and MTTRs may be dealt with by repeating an analysis with different values of MTBFS and/or MTTRs, and nperf has some facilities to ease this task. A more sophisticated 4 Network Performability Evaluation 127 Table 4.2 Publicly-available tools that have some relation to nperf. Web sites valid as of 2009 P TOLEMY Modeling and design of concurrent, real-time, embedded systems http://ptolemy.eecs.berkeley.edu/ TANGRAM II Computer and communication system modeling http://www.land.ufrj.br/tools/tangram2/tangram2.html M OBIUS Model-based environment for validation of system reliability, availability security, and performance http://www.mobius.uiuc.edu/ Probabilistic model checker P RISM http://www.prismmodelchecker.org/ T OTEM Toolbox for Traffic Engineering Methods http://totem.run.montefiore.ulg.ac.be/ alternative is to assign uncertainties (prior probability distributions) to the MTBFs and MTTRs and propagate them to posterior distributions on FN via a Bayesian analysis. However, this is outside the scope of this chapter. 4.3.6 Related Tools Performance and reliability analyses of systems are vast areas with many ramifications. At this point there exist a number of tools that are, in one way or another, related to some of what nperf does. Table 4.2 mentions some of the author’s favorites, all in the public domain; the interested reader may pursue them further. Vis-a-vis these tools, the main distinguishing features of nperf are that it is geared toward networks (hierarchical model, routing, restoration), and represents them by large numbers of relatively simple independent (noninteracting) components. 4.4 Case Studies We conclude by presenting two case studies that, among other things, illustrate the application of the nperf tool. The first study is on a multicast network for IPTV distribution, and the second involves choosing among a set of topologies for network access. 4.4.1 An IPTV Distribution Network In this study we analyzed a design for an IPTV distribution network similar to the one discussed in [9], but with 65 nodes distributed across the continental US. 128 K.N. Oikonomou These nodes are called VHOs (Video Head Offices), and there is an additional node called an SHO (Super Hub Office), which is the source of all the traffic. The traffic stream from the SHO is sent to the VHOs by multicast6 : when a node receives a packet, it puts a copy of it on each of its outgoing links. Thus traffic flows on the edges of a multicast tree rooted at the SHO, and each VHO is a node on this tree. The tree forms a sub-network of the provider’s overall network. The multicast sub-network uses two mechanisms to deal with failures: fast re-route: each edge of the tree has a pre-defind backup path for it, which uses edges of the encompassing network that are not on the tree. tree re-computation: if a tree edge fails, and fast-reroute is unable to protect it because the backup path itself has also failed, a new tree is computed. This computation is done by so-called reverse path forwarding: each VHO computes a shortest path from it to the SHO, and the SHO then sends packets along each such path in the reverse direction. The advantage of fast re-route (FRR) is that it takes much less time, milliseconds instead of seconds, than tree re-computation. Given a properly designed FRR capability, an interesting feature of the multicast network from the viewpoint of performability analysis is that it essentially tolerates any single link failure.7 Therefore, interesting behavior appears only under failures of higher multiplicity. Indeed, it turns out that multiple failures can result in congestion: the backup paths for different links are not necessarily disjoint and so when FRR is used to bypass a whole set of failed links, a particular network link belonging to more than one backup path may receive traffic belonging to more than one flow. If the link capacity is such that this causes congestion, the congestion will last until the failure is repaired, which may take time of the order of hours. One way to deal with this problem is to compute a new multicast tree after FRR is done, and to begin using this new tree as soon as the computation is complete, as suggested in [9]. This retains the speed advantage of FRR and limits the duration of any congestion to the tree re-computation time. For this network, performance must be guaranteed for every VHO (worst case), not just overall. So, in the terms of Section 4.3.3, the multicast performability measures are two 65-element vectors, one for loss due to no path and one for loss due to congestion, whose elements are computed on each network state. We now summarize some of the results of this study. An initial network design, known as design A, was carried out by experienced network designers. Its performance, after normalizing the expectations of the measures by the total traffic and converting the result to time per year,8 is shown in Fig. 4.5, top. Since this was a well-designed network to begin with, its levels of traffic loss were quite low, better than “five 9s”. Within these low levels, Fig. 4.5 shows that the loss due to no path, the tlnr of (4.4), is dominant for most VHOs, but some of them also exhibit 6 Specifically by Protocol Independent Multicast (PIM). By “link” here we mean an edge at the graph level of the model of Fig. 4.2. 8 For example, a traffic loss of 0.01% of the total translates to 1=10; 000 of a year, i.e., about 52 min/year. 7 4 Network Performability Evaluation 129 τOSPF = 1 sec, τFRR = 0.05 sec 2.5 No path Congestion time / yr. 2 1.5 1 0.5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 VHO # τOSPF = 1 sec, τFRR = 0.05 sec 2.5 No path Congestion time / yr. 2 1.5 1 0.5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 VHO # Fig. 4.5 Expected lost traffic, expressed in time per year, because of no path and congestion in design A (top), and in design C (bottom). These are the tlnr and tlcg defined in (4.4). Design C is A with tuned OSPF weights. For the purposes of comparing the two designs, the time unit of the y-axis is irrelevant significant loss because of congestion (tlcg ). Even though the performability of this network was entirely acceptable, we decided to see if the loss due to congestion could be reduced. A detailed study of the network states generated by nperf that led to congestion in Fig. 4.5 top, revealed that they were double and triple failures. Further, we found that for VHOs 30 to 41 congestion could be practically eliminated by tuning a certain set of OSPF link weights. The result, known as design C , performed as shown in Fig. 4.5 bottom. It can be seen that a lot of congestion-induced 130 K.N. Oikonomou losses were eliminated while the loss due to no path remained at the same level throughout, and this was achieved without adding to the cost of the network design at all. See [10] for more details on the subject of reliable IPTV network design. 4.4.2 Access Topology Choices An issue that arose for a major Internet service provider was that traffic in its network was increasing, but the backbone routers had limited expansion capability (numbers of slots in the chassis). To get around this limitation it was proposed to introduce intermediate aggregation routers in the access part of the network, and the question was how this would affect the reliability of the access. The configuration of the provider’s backbone offices before the introduction of aggregation routers is shown in Fig. 4.6 top, and is referred to as “base”; there is a “local” variant in which all routers are located within a single office, and a “remote” variant in which the routers are in different offices. In reality there are many access routers connecting to a pair of backbone routers, but showing just one in Fig. 4.6 is enough for our purposes. There were two proposals for introducing the aggregation routers, called the “box” and the “butterfly” designs, shown in Fig. 4.6 middle and bottom. These had local and remote variants as well. Further, there was a premium “diverse” option in the butterfly remote design in which the links between a backbone router and its two aggregation routers were carried on two separate underlying optical transport (DWDM) systems, instead of the same transport (the “common” option). It was clear that the box alternative was cheaper because of fewer links, but what was the reduction in availability relative to the costlier butterfly design? Also, how did either of these options compare with the existing base design? The failure modes of interest in all these designs were network spans, router ports, and software failures or procedural errors; these failure modes are depicted as components in Fig. 4.7. The metric chosen to compare the availabilities of the various designs was the mean time between access disconnections, i.e., situations where the access router A had no path to any backbone router BB. Note that network restoration is immaterial for such events. nperf models for the designs of Fig. 4.6 were constructed; given the metric of interest, the models did not include a traffic layer. Typical values for the reliability attributes of the components were selected as in Chapter 3. At a high level, note that the longer links between the aggregation and backbone routers in the remote designs are less reliable than the corresponding links in the local designs. The results of the study are summarized in Table 4.3. The mean access disconnection times are separated into two categories, of which “hardware” includes the first three types of components listed in Fig. 4.7. The most notable result in Table 4.3 is that irrespective of the architecture, software and procedural errors are by far the dominant cause for access router isolations. These events are the ones that cannot be helped by redundancy. The second most important 4 Network Performability Evaluation 131 Base local BB1 remote BB2 BB1 BB2 A A Box local remote BB1 BB2 BB1 BB2 AG1 AG2 AG1 AG2 A A Butterf ly remote diverse local remote common BB1 BB2 BB1 BB2 BB1 BB2 AG1 AG2 AG1 AG2 AG1 AG2 A A A Fig. 4.6 “Base”, “box”, and “butterfly” access configurations. Each has a “local” and a “remote” version. The remote versions have routers spread among different offices (the enclosing blue boxes). BB are backbone routers, AG are aggregation routers, and A is an access router feature is that compared to the base case, the introduction of aggregators doubles the risk of access router isolation due to software and procedural errors, again irrespective of the design. With respect to hardware failures in the local case, the box design increases the risk of isolation by a factor of 3 compared to the base case, but the butterfly design is just as good as the base. In the remote case, the box design is about twice as bad as the base, but the butterfly is in fact better, by at least a factor of 2.75. 132 K.N. Oikonomou Z BB1 Z BB2 A network equipment (DWDM) span BB1 BB2 AG1 AG2 A router port (module) pair router port (module) software failure or procedural error Fig. 4.7 Components for the simplest “base” and most complex “butterfly remote common” topologies. A component affects the edges or nodes which it overlaps in the diagram (the connection to the Z router is fictitious, representing the part of the network beyond the backbone routers, which is common to all alternatives) Table 4.3 Mean access disconnection time (years), i.e., time between disconnections of access router A from both backbone routers BB, for the access topologies of Fig. 4.6 Hardware Software & procedural error Local Base 700 10 Box 232 5 Butterfly 699 5 Remote Base 120 10 Box 61 5 Butterfly diverse 676 5 Butterfly common 329 5 Summarizing availability by reporting only means makes comparisons easy, but hides information that is important in assessing the risk. By making the reasonable assumption that the isolation events occur according to a Poisson distribution with means as specified in Table 4.3, we see that the 5-year mean implies that in a single year one isolation event occurs with probability 16% and two events with probability 2%. 4 Network Performability Evaluation 133 4.4.3 Other Studies Besides what was presented above, nperf has been used in a variety of other studies: the performability of a backbone network under two different types of routing was analyzed in [3], the performability of a multimedia distribution network that tolerates any single link failure was studied in [9, 10], two-layer IP-over-SONET restoration in a satellite network was investigated in [25], and techniques for setting thresholds for bundled links in an IP backbone network were studied in [26]. 4.5 Conclusion This chapter presents an overview of analyzing the combined performance and reliability, known as performability, of networks. Performability analysis may be thought of as repeating a performance analysis in many different states (failures or degradations) of the network, and is thus much more difficult than either reliability or performance analysis on its own. Successful analysis rests on finding a point on the reliability–performance spectrum appropriate to the problem at hand. Our particular approach to network performability analysis is based on a four-level hierarchical network model, and on the nperf software tool, which embodies a number of methods known in the literature, some new techniques developed by us, and is under active development in AT&T Labs Research (finite-time measures, qualityof-service additions to the traffic layer, etc.). We illustrated the ideas of analysing performability by two case studies carried out with nperf and gave references to other studies in the literature. References 1. Aven, T., & Jensen, U. (1999). Stochastic models in reliability. New York: Springer. 2. Ahuja, R., Magnanti, T., & Orlin, J. (1998). Network flows. Englewood Cliffs, NJ: PrenticeHall. 3. Agrawal, G. Oikonomou, K. N., & Sinha, R. K. (2007). Network performability evaluation for different routing schemes. Proceedings of the OFC. Anaheim, CA. 4. Alvarez, G., Uysal, M., & Merchant, A. (2001). Efficient verification of performability guarantees. In PMCCS-5: The fifth international workshop on performability modelling of computer and communication systems. Erlangen, Germany. 5. Bolch, G., Greiner, S., de Meer, H., & Trivedi, K. S.(2006). Queueing networks and Markov chains. Wiley, New Jersey. 6. Bremaud, P. (2008). Markov chains, Gibbs fields, Monte Carlo simulation, and queues. New York: Springer. 7. Carlier, J., Li, Y., & Lutton, J. (1997). Reliability evaluation of large telecommunication networks. Discrete Applied Mathematics, 76(1–3), 61–80. 8. Colbourn, C. J. (1999). Reliability issues in telecommunications network planning. In B. Sansó (Ed.), Telecommunications network planning. Boston: Kluwer. 134 K.N. Oikonomou 9. Doverspike, R. D., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., & Wang, D. (2007). IP backbone design for multimedia distribution: architecture and performance. In Proceedings of the IEEE INFOCOM, Alaska. 10. Doverspike, R. D., Li, G., Oikonomou, K. N., Ramakrishnan, K. K., Sinha, R. K., Wang, D., & Chase, C. (2009). Designing a reliable IPTV network. IEEE internet computing, 13(3), 15–22. 11. de Souza e Silva, E., & Gail, R. (2000). Transient solutions for Markov chains. In W. K. Grassmann (Ed.), Computational probability. Kluwer, Boston. 12. Gomes, T. M. S., & Craveirinha, J. M. F. (1997). A case ctudy of reliability analysis of a multiexchange telecommunication network. In C. G. Soares (Ed.), Advances in safety and reliability. Elsevier Science. 13. Gomes, T. M. S., & Craveirinha J. M. F. (April 1998). Algorithm for sequential generation of states in failure-prone communication network. IEE proceedings-communications, 145(2). 14. Gomes, T., Craveirinha, J., & Martins, L. (2002). An efficient algorithm for sequential generation of failures in a network with multi-mode components. Reliability Engineering & System Safety, 77, 111–119. 15. Garey, M., & Johnson, D. (1978). Computers and intractability: a guide to the theory of NP-completeness. San Francisco, CA: Freeman. 16. Harms, D. D., Kraetzl, M., Colbourn, C. C., & Devitt, J. S. (1995). Network reliability: experiments with a symbolic algebra environment. Boca Raton, FL: CRC Press. 17. Jarvis, J. P., & Shier, D. R. (1996). An improved algorithm for approximating the performance of stochastic flow networks. INFORMS Journal on Computing, 8(4). 18. Karger, D. (1995). A randomized fully polynomial time approximation scheme for the allterminal network reliability problem. In Proceedings of the 27th ACM STOC. 19. Levendovszky, J., Jereb, L., Elek, Zs., & Vesztergombi, Gy. (2002). Adaptive statistical algorithms in network reliability analysis. Performance Evaluation, 48(1–4), 225–236. 20. Li, V. K., & Silvester, J. A. (1984). Performance analysis of networks with unreliable components. IEEE Transactions on Communications, 32, 1105–1110. 21. Levy, Y. & Wirth, P. E. (1989). A unifying approach to performance and reliability objectives. In Teletraffic science for new cost-effective systems, networks and services, ITC-12. Elsevier Science. 22. Mendiratta, V. B. (2001). A hierarchical modelling approach for analyzing the performability of a telecommunications system. In PMCCS-5: the fifth international workshop on performability modelling of computer and communication systems. 23. Meyer, J. F. (1995). Performability evaluation: where it is and what lies ahead. In First IEEE computer performance and dependability symposium (IPDS), pp 334–343. Erlangen, Germany. 24. Moy, J. T. (1998). OSPF: anatomy of an internet routing protocol. Reading, MA: Addison Wesley. 25. Oikonomou, K. N. Ramakrishnan, K. K., Doverspike, R. D., Chiu, A., Martinez Heath, M., & Sinha, R. K. (2007). Performability analysis of multi-layer restoration in a satellite network. Managing traffic performance in converged networks, ITC 20 (LNCS 4516). Springer. 26. Oikonomou, K. N., & Sinha, R. K. (2008). Techniques for probabilistic multi-layer network analysis. In Proceedings of the IEEE Globecomm, New Orleans. 27. Oikonomou, K. N., & Sinha, R. K. (February 2009). Improved bounds for performability evaluation algorithms using state generation. Performance Evaluation, 66(2). 28. Oikonomou, K. N., Sinha, R. K., & Doverspike, R. D. (2009). Multi-layer network performance and reliability analysis. The International Journal of Interdisciplinary Telecommunications & Networking (IJITN), 1(3). 29. Pióro, M., & Medhi, D. (2004). Routing, flow, and capacity design in communication and computer networks. Morgan-Kaufmann. 30. Rauzy, A. (2005). An m log m algorithm to compute the most probable configurations of a system with multi-mode independent components. IEEE Transactions on Reliability, 54(1), 156–158. 4 Network Performability Evaluation 135 31. Shier, D. R., Bibelnieks, E., Jarvis, J. P., & Lakin, R. J. (1990). Algorithms for approximating the performance of multimode systems. In Proceedings of IEEE Infocom. 32. Shier, D. R. (1991). Network reliability and algebraic structures. Oxford: Clarendon. 33. Yang, C. L., & Kubat, P. (1990). An algorithm for network reliability bounds. ORSA Journal on Computing, 2(4), 336–345. Chapter 5 Robust Network Planning Matthew Roughan 5.1 Introduction Building a network encompasses many tasks: from network planning to hardware installation and configuration, to ongoing maintenance. In this chapter, we focus on the process of network planning. It is possible (though not always wise) to design a small network by eye, but automated techniques are needed for the design of large networks. The complexity of such networks means that any “ad hoc” design will suffer from unacceptable performance, reliability, and/or cost penalties. Network planning involves a series of quantitative tasks: measuring the current network traffic and the network itself; predicting future network demands; determining the optimal allocation of resources to meet a set of goals; and validating the implementation. A simple example is capacity planning: deciding the future capacities of links in order to carry forecast traffic loads, while minimizing the network cost. Other examples include traffic engineering (balancing loads across our existing network) and choosing the locations of Points-of-Presence (PoPs) though we do not consider this latter problem in detail in this chapter because of its dependence on economic and demographic concerns rather than those of networking. Many academic papers about these topics focus on individual components of network planning: for instance, how to make appropriate measurements, or on particular optimization algorithms. In contrast, in this chapter we will take a system view. We will present each part as a component of a larger system of network planning. In the process of describing how the various components of network planning interrelate, we observe several recurring themes: 1. Internet measurements are of varying quality. They are often imperfect or incomplete and can contain errors or ambiguities. Measurements should not be taken at face value, but need to be continually recalibrated [48], so that we have M. Roughan () School of Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia e-mail: matthew.roughan@adelaide.edu.au C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 5, c Springer-Verlag London Limited 2010 137 138 M. Roughan some understanding of the errors, and can take them into account in subsequent processing. We will describe common measurement strategies in Section 5.2. 2. Analysis and modeling of data can allow us to estimate and predict otherwise unmeasurable quantities. However, in the words of Box and Draper, “Essentially, all models are wrong, but some are useful” [9]. We must be continually concerned with the quality of model-based predictions. In particular, we must consider where they apply, and the consequences of using an inaccurate model. A number of key traffic models are described in Section 5.3, and their use in prediction is described in Section 5.4. 3. Decisions based on quantitative data are at best as good as their input data, but can be worse. The quality of input data and resulting predictions are variable, and this can have consequences for the type of planning processes we can apply. Numerical techniques that are sensitive to such errors are not suitable for network engineering. Discussion of robust, quantitative network engineering is the main consideration of Sections 5.5 and 5.6. Noting all of the above, it should not be surprising that a robust design process requires validation. The strategy of “set and forget” is not viable in today’s rapidly changing networking environment. The errors in initial measurements, predictions, and the possibility for mistakes in deployment mean that we need to test whether the implementation of our plan has achieved our goals. Moreover, actions taken at one level of operations may impact others. For example, Qiu et al. [51] noted that attempts to balance network loads by changing routing can cause higher-layer adaptive mechanisms such as overlay networks to change their decisions. These higher-level changes alter traffic, leading to a change of the circumstances that originally lead us to reroute traffic. Thus, the process of measure!analyze/predict!control!validate should not stop. Once we complete this process, the cycle begins again, with our validation measurements feeding back into the process as the input for the next round of network planning, as illustrated in Fig. 5.1. This cycle allows our planning process to correct problems, leading to a robust process. In many ways this resembles the more formal feedback used in control systems, though robust planning involves a range of tasks not typically modeled in formal control theory. For instance, the lead times for deploying network components such as new routers are still quite long. It can take months to install, configure, and test new equipment when done methodically. Even customers ordering access facilities measurement Fig. 5.1 Robust network planning is cyclic decision/ control analysis / prediction 5 Robust Network Planning 139 can experience relatively long intervals from order to delivery, despite the obvious benefits to both parties of a quick startup. So if our network plan is incorrect, we cannot wait for the planning cycle to complete to redress the problem. We need processes where the cycle time is shorter. It is relatively simple to reroute traffic across a network. It usually requires only small changes to router configurations, and so can be done from day to day (or even faster if automated). Rebalancing traffic load in the short term – in the interim before the network capacities can be physically changed – can alleviate congestion caused by failures of traffic predictions. This process is called traffic engineering. Another aspect of robust planning is incorporation of reliability analysis. Internet switches and routers fail from time to time, and must sometimes be removed from service for maintenance. The links connecting routers are also susceptible to failures, given their vulnerability to natural or man-made accident (the canonical example is the careless back-hoe driver). Most network managers plan for the possibility of node or link failures by including redundant routers and links in their network. A network failure typically results in traffic being rerouted using these redundant pathways. Often, however, network engineers do not plan for overloads that might occur as a result of the rerouted traffic. Again, we need a robust planning process that takes into account the potential failure loads. We call this approach network reliability analysis. We organize this chapter around the key steps in network planning. We first consider the standard network measurements that are available today. Their characteristics determine much of what we can accomplish in network planning. We then consider models and predictions, and then finally the processes used in making decisions, and controlling our network. As noted, robust planning does not stop there, we must continue to monitor our network, but there are a number of additional steps we can perform in order to achieve a robust network plan and we consider them in the final section of this chapter. The focus of this chapter is backbone networks. Though many of the techniques described here remain applicable to access networks, there are a number of critical differences. For instance, access network traffic is often very bursty, and this affects the approaches we should adopt for prediction and capacity planning. Nevertheless, the fundamental ideas of robust planning that we discuss here remain valid. 5.2 Standard Network Measurements Internet measurements are considered in more detail in Chapters 10 and 11, but a significant factor in network planning is the type of measurements available, and so we need some planning-specific discussions. In principle, it is possible to collect extremely good data, but in practice the measurements are often flawed, and the nature of the flaws are important when considering how to use the data. The traffic data we might like to collect is a packet trace, consisting of a record of all packets on a subsection of a network along with timestamps. There are various 140 M. Roughan mechanisms for collecting such a trace, for instance, placing a splitter into an optical fiber, using a monitor port on a router, or simply running tcpdump on one of the hosts on a shared network segment. A packet trace gives us all of the information we could possibly need but is prohibitively expensive at the scale we require for planning. The problem with a packet trace (apart from the cost of installing dedicated devices) is that the amount of data involved can be enormous, for example, on an OC48 (2.5 Gbps) link, one might collect more than a terabyte of data per hour. More importantly, a packet trace is overkill. For planning we do not need such detail, but we do need good coverage of the whole network. Packet traces are only used on lower speed networks, or for specific studies of larger networks. There are several approaches we can use to reduce data to a more manageable amount. Filtering, so that we view only a segment of the traffic (say the HTTP traffic) is useful for some tasks, but not planning. A more useful approach is aggregation, where we only store records for some aggregated version of the traffic, thereby reducing the number of such records needed. A common form of aggregation is at the flow-level where we aggregate the traffic through some common characteristics. The definition of “flow” depends on the keys used for aggregation, but we mean here flows aggregated by the five-tuple formed from IP source and destination address, TCP port numbers, and protocol number. Flow data is typically collected within some time frame, for instance, 15 min periods. What is more, flowlevel collection is often a feature of a router, and so does not require additional measurement infrastructure other than the Network Management Station (NMS) at which the data is stored. However, the volume of data can still be large (one network under study collected 500 GB of data per day), and the collection process may impact the performance of the router. As a result, flow-level data is often collected in conjunction with a third method for data reduction: sampling. Sampling can be used both before the flows are created, and afterward. Prior to flow aggregation, sampling is used at rates of around 1:100–1:500 packets. That is, less than 1% of packets are sampled. This has the advantage that less processing is required to construct flow records (reducing the load on the router collecting the flows) and typically fewer flow records will be created (reducing memory and data transmission requirements). However, sampling prior to flow aggregation does have flaws, most obviously, it biases the data collection toward long flows. These flows (involving many packets) are much more likely to be sampled than short flows. However, this has rarely been seen as a problem in network planning where we are not typically concerned with the flow length distribution. Sampling can also be used after flow aggregation to reduce the transmission and storage requirements for such data. The degree of sampling depends on the desired trade-off between accuracy of measurements, and storage requirements for the data. Good statistical approaches for this sampling, and for estimating the resulting accuracy of the samples are available [16,17], though, as noted above, these are predominantly aimed at preserving details such as flow-length distributions, which are largely inconsequential for the type of planning discussed here, so sampling prior to flow construction is often sufficient for planning. 5 Robust Network Planning 141 Of more importance here is the fact that any type of sampling introduces errors into measurements. Any large-scale flow archives must involve significant sampling, and so will contain errors. An alternative to flow-level data is data collected via the Simple Network Management Protocol (SNMP) [39]. Its advantage over flow-level data collection is that it is more widely supported, and less vendor specific. However, the data provided is less detailed. SNMP allows an NMS to poll MIBs (Management Information Bases) at routers. Routers maintain a number of counters in these MIBs. The widely supported MIB-II contains counters of the number of packets and bytes transmitted and received at each interface of a router. In effect, we can see the traffic on each link of a network. In contrast to flow-level data, SNMP can only see link volumes, not where the traffic is going. SNMP has a number of other issues with regard to data collection. The polling mechanism typically uses UDP (the User Datagram Protocol), and SNMP agents are given low priority at routers. Hence SNMP measurements are not reliable, and it is difficult to ensure that we obtain uniformly sampled time series. The result is missing and error-prone data. Flow-level data contains only flow start and stop times, not details of packet arrivals, and typically SNMP is collected at 5-min intervals. The limit on timescale of both data sets is important in network planning. We can only see average traffic rates over these periods, not the variations inside these interval. However, congestion and subsequent packet loss often occur on much shorter timescales. The result is that such average measurements must always be used with care. Typically some overbuild of capacity is required to account for the sub-interval variations in traffic. The exact overbuild will depend on the network in question, and has typically been derived empirically through ongoing performance and traffic measurements. Values are usually fairly conservative in major backbones resulting in apparent underutilization (though this term is unfair as it concerns average utilizations not peak loads), and more aggressive in smaller networks. In addition to traffic data, network planning requires a detailed view of any existing network. We need to know The (layer 3) topology (the locations of, and the links between routers) The network routing policies (for instance, link weights in a shortest-path proto- col, areas in protocols such as OSPF, and BGP policies where multiple interdomain links exist) The mapping between current layer 3 links and physical facilities (WDM equipment and optical fibers), and the details of the available physical network facilities and their associated costs The topology and routing data is principally needed to allow us to map traffic to links. The mapping is usually expressed through the routing matrix. Formally, A D fAir g is the matrix defined by Air D Fir ; 0; if traffic for r traverses link i otherwise; (5.1) 142 M. Roughan where Fi r is the fraction of traffic from source/destination pair r D .s; d / that traverses link i . A network with N nodes, and L links will have an L N.N 1/ routing matrix. Network data is also used to assess how changes in one component will affect the network (e.g., how changes in OSPF link weights will impact link loads); determine shared risk-of-failure between links; and determine how to improve our network incrementally without completely rebuilding it in each planning cycle. The latter is an important point because although it might be preferable to rebuild a network from scratch, the capital value of legacy equipment usually prevents this option, except at rare intervals. For a small, static network, the network data may be maintained in a database, however, best practice for large, complex, or dynamic networks is to use tools to extract the network structure directly from the network. There are several methods available for discovering this information. SNMP can provide this information through the use of various vendor tools (HP Openview, or Cisco NCM, e.g.), but it is not the most efficient approach. A preferable approach for finding layer 3 information is to parse the configuration files of routers directly, for instance, as described in [22,24]. The technique has been applied in a number of networks [5,38]. The advantages of using configuration files are manifold. The detail of information available is unparalleled in other data sources. For instance, we can see details of the links (such as their composition should a single logical link be composed of more than one physical link). The other major approach for garnering topology and routing information is to use a route monitor. Internet routing is built on top of distributed computations supported by routing protocols. The distribution of these protocols is often considered a critical component in ensuring reliability of the protocols in the face of network failures. The distribution also introduces a hook for topology discovery. If any router must be able to build its routing table from the routing information distributed through these protocols, then it must have considerable information about the network topology. Hence, we can place a dummy router into the network to collect such information. Such routing monitors have been deployed widely over the last few years. Their advantage is that they can provide an up-to-date dynamic view. Examples of such monitors exist for OSPF [61, 62], and IS-IS [1, 30], as well as for BGP (the Border Gateway Protocol) [2, 3]. 5.3 Analysis and Modeling of Internet Traffic 5.3.1 Traffic Matrices We will now consider the analysis and modeling of Internet data, in particular, traffic data. When considering inputs to network planning, we frequently return to the topic of traffic matrices. These are the measurements needed for many network planning tasks, and thus the natural structure around which we shall frame our analysis. 5 Robust Network Planning 143 A Traffic Matrix (TM) describes the amount of traffic (the number of packets or more commonly bytes) transmitted from one point in a network to another during some time interval, and they are naturally represented by a three-dimensional data structure Tt .i; j /, which represents the traffic volume (in bytes or packets) from i to j during a time interval Œt; t C t/. The locations i and j are generally considered to be physical geographic locations making i and j spatial variables. However, in the Internet, it is common to associate i and j with logical structures related to the address structure of the Internet, i.e., IP addresses, or natural groupings of such by common prefix corresponding to a subnet. Origin/Destination Matrices One natural approach to describe traffic matrices is with respect to traffic volumes between IP addresses or prefixes. We refer to this as an origin/destination TM because the IP addresses represent the closest approximation we have for the end points of the network (though HTTP-proxies, firewalls, and NAT and other middle-boxes may be obscuring the true end-to-end semantics). IPv4 admits nearly 232 potential addresses, so we cannot describe the full matrix at this level of granularity. Typically, such a traffic matrix would be aggregated into blocks of IP addresses (often using routing prefixes to form the blocks as these are natural units for the control of traffic). The origin/destination matrix is our ideal input for many network planning tasks, but the Internet is made up of many connected networks. Any one network operator only sees the traffic carried by its own network. This reduced visibility means that our observed traffic matrix is only a segment of the real network traffic. So we can’t really observe the origin/destination TM. Instead we typically observe the ingress/egress traffic matrix. Ingress/Egress versus Origin/Destination A more practical TM, the ingress/ egress TM provides traffic volumes from ingress link to egress link across a single network. Note that networks often interconnect at multiple points. The choice of which route to use for egress from a network can profoundly change the nature of ingress/egress TMs, so these may have quite different properties to the origin/destination matrix. Forming an ingress/egress TM from an origin/destination TM involves a simple mapping of prefixes to ingress/egress locations in a network, but in practice this mapping can be difficult unless we monitor traffic as it enters the network. We can infer egress points of traffic using the routing data described above, but inferring ingress is more difficult [22, 23], so it is better to measure this directly. Spatial Granularity of Traffic Matrices As we have started to see with origin/destination traffic matrices, we can measure them at various levels of granularity (or resolution). The same is true of ingress/egress TMs. At the finest level, we measure traffic per ingress/egress link (or interface). However, it is common to aggregate this data to the ingress/egress router. We can often group routers into larger subgroups. A common such group is a Point-of-Presence (PoP), though there are other sub- and super-groupings (e.g., topologically equivalent edge routers are sometimes 144 M. Roughan grouped, or we may form a regional group). Given subsets S and D of locations, may simply aggregate a TM across these by taking Tt .S; D/ D XX Tt .i; j /: (5.2) i 2S j 2D Typical large networks might have 10s of PoPs, and 100s of routers, and so such TMs are of a more workable size. In addition, as we aggregate traffic into larger groupings, statistical multiplexing reduces the relative variance of the traffic and allows us to perform better estimates of traffic properties such as the mean and variance. Temporal Granularity of Traffic Matrices We cannot make instantaneous measurements of a traffic matrix. All such observations occur over some time interval Œt; t C t/. It would be useful to make the interval t smaller (for instance, for detecting anomalies), but typically we face a trade-off against the errors and uncertainties in our measurements. A longer time interval allows more “averaging-out” of errors, and minimizes the impact of missing data. The best choice of time interval for TMs is typically determined by the task at hand, and the network under study, but a common choice is a 1 hour interval. In addition to being easily understood by human operators, this interval integrates enough SNMP or flow-level data to reduce the impact of (typical) missing data and errors, while allowing us to still observe important diurnal patterns in the traffic. 5.3.2 Patterns in Traffic It is useful to have some understanding of the typical patterns we see in network traffic. Such patterns are only visible at a reasonable level of aggregation (otherwise random temporal variation dominates our view of the traffic), but for high degrees of aggregation (such as router-to-router traffic matrices on a large backbone network) the pattern can be very regular. There are two main types of patterns that have been observed: patterns across time, and patterns in the spatial structure. Each is discussed below. Temporal Patterns Internet traffic has been observed to follow both daily (diurnal) and weekly cycles [33–35,57,64]. The origin of these cycles is quite intuitive. They arise because most Internet traffic is currently generated by humans whose activities follow such cycles. Typical examples are shown in Figs. 5.2 and 5.3. Figure 5.2 shows a RRD Tool graph1 of the traffic on a link of the Australian Academic Research Network (AARNet). Figure 5.3 shows the total traffic entering AT&T’s North American backbone network at a Point of Presence (PoP) over two consecutive 1 RRDTool (the Round Robin Database tool) [47] and its predecessor MRTG (the Multi-Router Traffic Grapher [46]) are perhaps the most common tools for collecting and displaying SNMP traffic data. 5 Robust Network Planning 145 Bits per Second 20.4 M 15.3 M 10.2 M 5.1 M 0.0 M Sat Sun Mon Tue Wed Thu Fri Sat Sun Fig. 5.2 Traffic on one link in the Australian Academic Research Network (AARNet) for just over 1 week. The two curves show traffic in either direction along the link Traffic: 08−May−2001 (GMT) traffic rate traffic rate Traffic: 07−May−2001 (GMT) start 08−May−2001 the following week start 07−May−2001 the following week Mon Tue Wed Thu Fri Sat Sun Mon 09:00 12:00 15:00 18:00 21:00 00:00 03:00 06:00 09:00 time (GMT) Fig. 5.3 Total traffic into a region over 2 consecutive weeks. The solid line is the first week’s data (starting on May 7), and the dashed line shows the second week’s data. The second figure zooms in on the shaded region of the first weeks in May 2001. The figure illustrates the daily and weekly variations in the traffic by overlaying the traffic from the 2 weeks. The striking similarity between traffic patterns from week to week is a reflection of the high level of aggregation that we see in a major backbone network. The observation of cycles in traffic is not new. For many years they have been seen in telephony [13]. Typically telephone service capacity planning has been based on a “busy hour”, i.e., the hour of the day that has the highest traffic. The time of the busy hour depends on the application and customer base. Access networks typically have many domestic consumers, and consequently their busy hour is in the evening when people are at home. On the other hand, the busy hour of business customers is typically during the day. Obviously, time-zones have an effect on the structure of the diurnal cycle in traffic, and so networks with a wide geographic dispersion may experience different busy hours on different parts of their network. In addition to cyclical patterns, Internet traffic has shown strong growth over many years [45]. This long-term trend has often been approximated by exponential growth, although care must be taken because sometimes such estimates have been based on poor (short or erratic) data [45]. Long-term trends should be estimated from multiple years of carefully collected data. 146 M. Roughan traffic (PB/quar ter) 102 101 100 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Fig. 5.4 ABS traffic measurements showing Australian Internet traffic, with an exponential fit to the data from 2000 to 2005. Data is shown by ‘o’, and the fit by the straight line. Note that the line continuing past 2005 is a prediction based on the pre-2005 data, showing also the 95th percentile confidence bounds for the predictions One public example is the data collected by the Australian Bureau of Statistics (ABS)2 who have collected historical data on Australian ISP traffic for many years. Figure 5.4 shows Australia’s network traffic in petabytes per quarter with a log-y axis. Exponential growth appears as a straight line on the log-graph, so we can obtain simple predictions of traffic growth through linear regression. The figure shows such a prediction based on pre-2005 data. It is interesting to note that the most recent data point does not, as one might assume without analysis, represent a significant drop in traffic growth. Relative to the long-term data the last point simply represents a reversion to the long-term growth from rather exceptional traffic volumes over 2007. We will discuss such prediction in more detail in the following sections. Standard time-series analysis [10] can be used to build a model of traffic containing long-term trends, cyclical components (often called seasonal components in other contexts), and random fluctuations. We will use the following notation here: S.t/ D seasonal (cyclical) component; (5.3) L.t/ D long-term trend; (5.4) W .t/ D random fluctuations: (5.5) The seasonal component is periodic, i.e., S.t C kTS / D S.t/, for all integers k, where TS is the period (which is either 24 hour or 1 week). Before we can consider how to estimate the seasonal (and trend) components of the traffic, we must 2 www.abs.gov.au 5 Robust Network Planning 147 model these components.3 At the most basic level, consider the traffic to consist of two components, a time varying (but deterministic) mean m.t/ and a stochastic component W .t/. At this level we could construct the traffic by addition or multiplication of these components (both methods are used in econometric and census data). However, in traffic data, a more appropriate model [43, 56] is x.t/ D m.t/ C p am.t/ W .t/; (5.6) where a is called the peakedness of the traffic, W .t/ is a stochastic process with zero mean, and unit variance, and x.t/ represents the average rate of some traffic (say a particular traffic matrix element) at time t. More highly aggregated traffic is smoother, and consequently would have a smaller value for a. The reason for this choice of model lies in the way network traffic behaves when aggregated. When multiple flows are aggregated onto a non-congested link, we should expect them to obey the same model (though perhaps with different parameters). Our model has this property: for instance, take N traffic streams xi with mean mi , peakedness ai , and stochastic components, which are independent realizations of a (zero mean, unit variance) Gaussian process. The multiplexed traffic stream is xD N X i D1 mi C N X p ai mi Wi : (5.7) i D1 P The mean of the new process is m D N i D1 mi , and the peakedness (derived from PN 1 the variance) is a D m i D1 ai mi , which is a weighted average of the component peakednesses. The relative variance becomes Vx D Varfxg=Efxg D N 1 X ai mi : m2 (5.8) i D1 If we take identical streams, then the relative variance decreases as we multiplex more together, which is to be expected. The result is that in network traffic the level of aggregation is important in determining the relative variance: more highly aggregated traffic exhibits less random behavior. The data in Fig. 5.3 from AT&T shows an aggregate of a very large number of customers (an entire PoP of one of North America’s largest networks). The consequence is that we can see the traffic is very smooth. In contrast the traffic shown in Fig. 5.2 is much less aggregated, and shows more random fluctuations. The model described above is not perfect (none are), but it is useful because it (i) allows us to calculate variances for aggregated traffic streams in a consistent way and to use these when planning our network, and (ii) its parameters are relatively 3 The reader should beware of methods, which do not explicitly model the data, because in these methods there is often an implicit model. 148 M. Roughan easy to measure, and therefore to use in traffic analysis. To do so, however, we find it useful to spilt the mean m.t/ into the cyclic component (which we denote S.t/) and the long-term trend L.t/ by taking the product m.t/ D L.t/S.t/: (5.9) We combine the two components through a product because as the overall load increases the range of variation in the size of cycles also increases. When estimating parameters of our models, it is important to allow for unusual or anomalous events, for instance, a Denial of Service (DoS) attack. These events are rare (we hope), but it is important to separate them from the normal traffic. Such terms can sometimes be very large, but we do not plan network capacity to carry DoS attacks! The network is planned around the paying customers. We separate them by including an impulsive term, I.t/, in the model, so that the complete model is x.t/ D L.t/S.t/ C p aL.t/S.t/ W .t/ C I.t/: (5.10) We will further discuss this model in Section 5.4, where we will consider how to estimate its parameters, and to use it in prediction. Spatial Patterns Temporal models are adequate for many applications: for instance, where we consider dimensioning of a single bottleneck link (perhaps in the design of an access network). However, spatial patterns in traffic provide us with addition planning capabilities. For instance, if two traffic sources are active at different times, then clearly we can carry them both with less capacity than if they activate simultaneously. Spatial patterns refer to the structure of a Traffic Matrix (TM) at a single time interval. It is common that TM elements are strongly correlated because they show similar diurnal (and weekly) patterns. For example, in a typical network (without wide geographic distribution) one will find that the busy hour is almost the same for all elements of the TM, but there is additional structure. For a start, TMs often come from skewed distributions. A common example is where the distribution follows a rough 80–20 law (80% of traffic is generated by the largest 20% of TM elements). Similar distributions have often been observed, though often even more skewed: for instance 90–10 laws are not uncommon. However, the distribution is not “heavy-tailed”. Observed distributions have shown a lighter tail than the log-normal distribution [55]. Consequently, traffic matrix work often concentrates on these larger flows, but traditional (rather than heavy-tailed) statistical techniques are still applicable. Another simple feature one might naively expect of TMs – symmetry – is not present. Internet routing is naturally asymmetric, as is application traffic (a large amount of traffic still follows a client–server model, which results in strongly asymmetric traffic). Hence, the matrix will not (generally) be symmetric [21], i.e., T .i; j / ¤ T .j; i /. We observe some additional structure in these matrices. The simplest model that describes some of the observed structure is the gravity model. In network 5 Robust Network Planning 149 applications, gravity models have been used to model the volume of telephone calls in a network [31]. Gravity models take their name from Newton’s law of gravitation, and are commonly used by social scientists to model the movement of people, goods or information between geographic areas [49,50,63]. In Newton’s law of gravitation the force is proportional to the product of the masses of the two objects divided by the distance squared. Similarly, in gravity models for interactions between cities, the relative strength of the interaction might be modeled as proportional to the product of the cities’ populations, so a general formulation of a gravity model is given by T .i; j / D Ri Aj ; fij (5.11) where Ri represents the repulsive factors that are associated with leaving from i ; Aj represents the attractive factors that are associated with going to j ; and fij is a friction factor from i to j . The gravity model was first used in the context of Internet traffic matrices in [67] where we can naturally interpret the repulsion factor Ri as the volume of incoming traffic at location i , and the attractivity factor Aj as the outgoing traffic volume at location j . The friction matrix fij encodes the locality information specific to different source–destination pairs, however, as locality is not as large a factor in Internet traffic as in the transport of physical goods, it is common to assume fij D const. The resulting gravity model simply states that the traffic exchanged between locations is proportional to the volumes entering and exiting at those locations. Formally, let T in .i / and T out .j / denote the total traffic that enters the network via i , and exits via j , respectively. The gravity model can then be computed by T .i; j / D T in .i /T out .j / ; T tot (5.12) where T tot is the total traffic across the network. Implicitly, this model relies on a conservation assumption, P in i.e., traffic P isoutneither created nor destroyed in the network .k/: The assumption may be violated, for so that T tot D k T .k/ D kT instance, when congestion causes packet loss. However, in most backbones congestion is kept low, and so the assumption is reasonable. In the form just described, the gravity model has distinct limitations. For instance, real traffic matrices may have non-constant fij (perhaps as a result of different time zones). Moreover, even if an origin destination traffic matrix matches the gravity model well, the ingress/egress TM may be systematically distorted [7]. Typically, networks use hot-potato routing, i.e., they choose the egress point closest to the ingress point, and this results in a systematic distortion of ingress/egress traffic matrices away from the simple gravity model. These distortions and others related to the asymmetry of traffic and distance sensitivity may be incorporated in generalizations of the gravity model where sufficient data exists to measure such deviations [13, 21, 67]. 150 M. Roughan The use of temporal patterns in planning is relatively obvious. The use of spatial patterns such as the gravity model is more subtle. The spatial structure gives us the capability to fill in missing values of the traffic matrix when our data is not perfect. Hence we can still plan our network, even in the extreme case where we have no data at all. 5.3.3 Application Profile We have so far discussed network traffic along two dimensions: the temporal and spatial. There is a third aspect of traffic to consider: its application breakdown, or profile. Common applications on the Internet are email, web browsing (and other server-based interactions), peer-to-peer file transfers, video, and voice. Each may have a different traffic matrix, and as some networks move toward differentiated Quality of Service (QoS) for different classes of traffic, we may have to plan networks based on these different traffic matrices. Even where differentiated service is not going to be provided, a knowledge of the application classes in our network can be very useful. For instance Voice traffic is less variable than data, and so can require less overhead for sub- measurement interval variations. Peer-to-peer applications typically generate more symmetric traffic than web traffic, and so downstream capacity (toward customer eyeballs) is likely to be more balanced when peer-to-peer applications dominate. We may be planning to eliminate some types of traffic in future networks (e.g., peer-to-peer traffic has often been considered to violate service agreements that prohibit running servers). The breakdown of traffic on a network is not trivial to measure. As noted, typical flow-level data collection includes TCP/UDP port numbers, and these are often associated with applications using the IANA (Internet Assigned Numbers Authority) list of registered ports.4 However, the port numbers used today are often associated with incorrect applications because: Ports are not defined with IANA for all applications, e.g., some peer-to-peer applications. An application may use ports other than its well-known ports to circumvent access control restrictions, e.g., nonprivileged users often run WWW servers on ports other than port 80, which is restricted to privileged users on most operating systems, while port 80 is often used for other applications (than HTTP) in order to work around firewalls. In some cases server ports are dynamically allocated as needed. For example, FTP allows the dynamic negotiation of the server port used for the data transfer. 4 http://www.iana.org/assignments/port-numbers 5 Robust Network Planning 151 This server port is negotiated on an initial TCP, connection which is established using the well-known FTP control port, but which would appear as a separate flow. Malicious traffic (e.g., DoS attacks) can generate a large volume of bogus traffic that should not be associated with the applications that normally use the affected ports. In addition, there are some incorrect implementations of protocols, and ambiguous port assignments that complicate the problem. Better approaches to classification of traffic exist (e.g., [58]), but are not always implemented on commercial measurement systems. Application profiles can be quite complex. Typical Internet providers will see some hundreds of different applications. However, there are two major simplifications we can often perform. The first is a clustering of applications into classes. QoS sometimes forms natural classes (e.g., real-time vs bulk-transfer classes), but regardless we can often group many applications into similarly structured classes, e.g., we can group a number of protocols (IMAP, POP, SMTP, etc.) into one class “email”. Common groupings are shown in Table 5.1, along with exemplar applications. There may be a larger number of application classes, and often there is a significant group of unknown applications, but a typical application profile is highly skewed. Again, it is common to see 80–20 or 90–10 rules. In these cases, it is common to focus attention on those applications that generate the most traffic, reducing the complexity of the profile. However, care must be taken because some applications that generate relatively little traffic on average may be considered very important, and/or may generate high volumes of traffic for short bursts. There are several such examples in enterprise networks, for instance, consider a CEO’s once-a-week company-wide broadcast, or nightly backups. Both generate a large amount of traffic, but in a relative short-time interval, so their proportion of the overall network traffic may be small. More generally, much of the control-plane traffic (e.g., routing protocol traffic) in networks is relatively low volume, but of critical importance. Table 5.1 Typical application classes grouped by typical use Class Bulk-data Database access Email Information Interactive Measurement Network control News Online gaming Peer-to-peer Voice over IP www Example applications FTP, FTP-Data Oracle, MySQL IMAP, POP, SMTP finger, CDDBP, NTP SSH, Telnet SNMP, ICMP, Netflow BGP, OSPF, DHCP, RSVP, DNS NNTP Quake, Everquest Kazaa, Bit-torrent SIP, Skype HTTP, HTTPS 152 M. Roughan 5.4 Prediction There are two common scenarios for network planning: 1. Incremental planning for network evolution 2. Green-fields planning In the first case, we have an existing network. We can measure its current traffic, and extrapolate trends to predict future growth. In combination with business data, quite accurate assessments of future traffic are possible. Typically, temporal models are sufficient for incremental network planning, though better results might be possible with recently developed full spatio-temporal models [52]. In green-fields planning, we have the advantage that we are not constrained in our network design. We may start with a clean slate, without concerning ourselves with a legacy network. However, in such planning we have no measurements on which to base predictions. All is not lost, however, as we may exploit the spatial properties of traffic matrices in order to obtain predictions. We discuss each of these cases below. There are other scenarios of concern to the network planner. For example Network mergers, for instance when two companies merge and subsequently combine their networks. Network migrations, for instance, as significant services such as voice or frame- relay are migrated to operate on a shared backbone. Addition (or loss) of a large customer (say a broadband access provider, a major content provider, or a hosting center). A change in interdomain routing relationships. For instance, the conversion of a customer to a peer would mean that traffic no longer transits from that peer, altering traffic patterns. The impact of these types of event is obviously dependent on the relative volume of the traffic affected. Such events can be particularly significant for smaller networks, but it is not unheard of for them to cause unexpected demands on the largest networks (for instance, the migration of an estimated half-million customers from Excite@home to AT&T in 20025). However, the majority of such cases can be covered by one or both of the techniques below. 5.4.1 Prediction for Incremental Planning Incremental planning involves extending, or evolving a current network to meet changing patterns of demands, or changing goals. The problem involves prediction of future network demands, based on extrapolation of past and present network 5 http://news.cnet.com/ExciteHome-to-shut-down-ATT-drops-bid/2100-1033 3-276550.html 5 Robust Network Planning 153 measurements. The planning problems we encounter are often constrained by the fact that we can make only incremental changes to our network, i.e., we cannot throw away the existing network and start from a clean slate, but let us first consider the problem of making successful traffic predictions. Obviously, our planning horizon (the delay between our planning decisions and their implementation) is critical. The shorter this horizon, the more accurate our predictions are likely to be, but the horizon is usually determined by external factors such as delays between ordering and delivery of equipment, test and verification of equipment, planned maintenance windows, availability of technical staff, and capital budgeting cycles. These are outside the control of the network planner, so we treat the planning horizon as a constant. The planning horizon also suggests how much historical data is needed. It is a good idea to start with historical data extending several planning horizons into the past. Such a record allows not only better determination of trends, but also an assessment of the quality of our prediction process through analysis of past planning periods. If such data is unavailable, then we must consider green-fields planning (see Section 5.4.2), though informed by what measurements are available. Given such a historical record, our primary means for prediction is temporal analysis of traffic data. That is, we consider the traffic measurements of interest (often a traffic matrix) as a set of time-series. However, as noted earlier the more highly we aggregate traffic, the smaller its relative variance, and the easier it is to work with. As a result, it can be a good idea to predict traffic at a high level of aggregation, and then use a spatial model to break it into components. For instance, we might perform predictions for the total traffic in each region of our network, and then break it into components using the current traffic matrix percentages, rather than predicting each element of the traffic matrix separately. There are many techniques for prediction, but we concentrate here on just one, which works reasonably for a wide range of traffic, but we should note that as in all of the work presented here, the key is not the individual algorithms but their robust application through a process of measurement, planning, and validation. 5.4.1.1 Extracting the Long-Term Trend We will exploit the previously presented temporal model for traffic, and note that the key to providing predictions for use in planning is to estimate the long-term trend in the data. We could form such an estimate simply by aggregating our timeseries over periods of 1 week (to average away the diurnal and weekly cycles) and then performing standard trend analysis. However, knowledge of the cycles in traffic data is often useful. Sometimes we design networks to satisfy the demand during a “busy hour.” More generally though, the busiest hours for different components of the traffic may not match (particularly in international networks distributed over several time-zones), and so we need to plan our network to have sufficient capacity at all hours of the day or night. 154 M. Roughan Hence, the approach we present provides the capability to estimate both the longterm trend, and the seasonal components of the traffic. It also allows an estimate of the peakedness, providing the ability to estimate the statistical variations around the expected traffic behavior. The method is hardly the only applicable time-series algorithm for this type of analysis (for another example see [44]), but it has the advantage of being relatively simple. The method is based on a simple signal processing tool, the Moving Average (MA) filter, which we discuss in detail below. The moving average can be thought of as a simple low-pass filter as it “passes” low-frequencies, or long-term behavior, but removes short-term variations. As such it is ideally suited to extracting the trend in our traffic data. Although there are many forms of moving average, we shall restrict our attention to the simplest: a rectangular moving average sDt Cn X 1 MAx .tI n/ D x.s/; (5.13) 2n C 1 sDt n where n is the width of the filter, and 2n C 1 is its length. The length of the filter must be longer than the period of the cyclic component in order to filter out that component. Longer filters are often used to allow for averaging out of the stochastic variation as well. The shortest filter we should consider for extracting the trend is three times the period, which in Internet traffic data is typically 1 week. For example, given traffic data x.t/, measured in 1 hour intervals, we could form our estimate O L.t/ of the trend by taking a filter of length 3 weeks (e.g., 2n C 1 D 504 D O 24 7 3), i.e., we might take L.t/ D MAx .tI 252/ where MAx is defined in (5.13). Care must always be taken around the start and end of the data. Within n data points of the edges the MA filter will be working with incomplete data, and so these estimates should be discounted in further analysis. Once we have obtained estimates for the long-term trend, we can model its behavior. Over the past decade, the Internet has primarily experienced exponential growth (for instance, see Fig. 5.4 or [45]) i.e., L.t/ D L.0/e ˇ t ; (5.14) where L.0/ is the starting value, and ˇ is the growth rate. If exponential growth is suspected the standard approach is to transform the data using the log function so that we see log L.t/ D log L.0/ C ˇt; (5.15) where we can now estimate L.0/ and ˇ from linear regression of the observed data. Care should obviously be taken that this model is reasonable. Regression provides diagnostic statistics to this end, but comparisons to other models (such as a simple linear model) can also be helpful. Such a model can be easily extrapolated to provide long-term predictions of traffic volumes. Standard diagnostics from the regression can also be used to provide confidence bounds for the predictions, allowing us to predict “best” and “worst” case scenarios for traffic growth, and an example of such predictions is given in Fig. 5.4 using the data from 2000 to 2004 to estimate the trend, and then extrapolating this 5 Robust Network Planning 155 until 2009. The figure shows the extrapolated optimistic and pessimistic trend estimates. We can see that actual traffic growth from 2005 to 2007 was on the optimistic side of growth, but that in 2008 the measured traffic was again close to the long-term trend estimate. This example clearly illustrates that understanding the potential variations in our trend estimate is almost as important as obtaining the estimate in the first place. It also illustrates how instructive historical data can be in assessing appropriate models and prediction accuracy. Often, in traffic studies, managers are keen to know the doubling time, the time it takes traffic to double. This can be easily calculated by estimating the value of t such that L.t/ D 2L.0/, or e ˇ t D 2. Again, taking logs we get the doubling time t D 1 ln 2: ˇ (5.16) The Australian data shown in Fig. 5.4 has a doubling time of 477 days. The trend by itself can inform us of growth rate but modeling the cyclic variations in traffic is also useful. We do this by extending the concept of moving average to the seasonal moving average, but before doing so we broadly remove the long-term O trend from the data (by dividing our measurements x.t/ by L.t/). 5.4.1.2 Extracting the Cyclical Component The goal of a Seasonal Moving Average (SMA) is to extract the cyclic component of our traffic. We know, a priori, the period (typically 7 days) and so the design of a filter to extract this component is simple. It resembles the MA used previously in that it is an average, but in this case it is an average of measurements separated in time by the period. More precisely we form the SMA of the traffic with the estimated trend removed, e.g., N 1 1 X O C nTS /; SO .t/ D x.t C nTS /=L.t N nD0 (5.17) where TS is the period, and N TS is the length of the filter. In effect the SMA estimates the traffic volume for each time of day and week as if they were separate time series. It can be combined with a short MA filter to provide some additional smoothing of the results if needed. The advantage of using an SMA as opposed to a straightforward seasonal average is that the cyclical component of network traffic can change over time. Using the SMA allows us to see such variability, while still providing a reasonably stable model for extrapolation. There is a natural trade-off between the length of the SMA, and the amount of change we allow over time (longer filters naturally smooth out transient changes). Typically, the length of filter desired depends on the planning horizon under which we are operating. We extrapolate the SMA in various ways, 156 M. Roughan but the simplest is to repeat the last cycle measured in our data into the future, as if the cyclical component remained constant into the future. Hence, when operating with a short planning horizon (say a week), we can allow noticeable week-to-week variations, and still obtain reasonable predictions, and so a filter length of three to four cycles is often sufficient. Where our planning horizon is longer (say a year) we must naturally assume that the week-to-week variations in the cyclical behavior are smaller in order to extrapolate, and so we use a much longer SMA, preferably at least of the order of the length of the planning horizon. 5.4.1.3 Estimating the Magnitude of Random Variations Once we understand the periodic and trend components of the traffic, the next thing to capture is the random variation around the mean. Most metrics of variation used in capacity planning do not account for the time-varying component, and so are limited O SO .t/ to busy-hour analysis. In comparison, we now have an estimate of m.t/ O D L.t/ and so can use (5.6) to estimate the stochastic or random component of our traffic p by z.t/ D .x.t/ m.t//= O m.t/. O We can now measure the variability of the random component of the traffic using the variance of z.t/, which forms an estimate aO for the traffic’s peakedness. The estimator for aO including the correction for bias is given in [57]. Note that it is also important to separate the impulsive, anomaly terms from the more typical variations. There are many anomaly detection techniques available (see [66] for a review of a large group of such algorithms). These algorithms can be used to select anomalous data points that can then be excluded from the above analysis. 5.4.1.4 From Traffic Matrix to Link Loads Once we have predictions of a TM, we often need to use these to compute the link loads that would result. The standard approach is to write the TM in vectorized form x, where the vector x consists of the columns of the TM (at a particular time) stacked one on top of another. The link loads y can then be estimated through the equation y D Ax; (5.18) where A is the routing matrix. The equation above can also be extended to project observations or predictions of a TM over time into equivalent link loads. Although there are multiple time-series approaches that can be used to predict future behavior (e.g., Holt-Winters [11]), our approach has the advantage that it naturally incorporates multiplexing. As a result, Eq. 5.18 can be extended to other aspects of the traffic model. For instance, the variances of independent flows are additive (the variance of the multiplexed traffic is the sum of the variances of the components), and so the variance of link traffic follows the same relationship, i.e., vy D Avx ; (5.19) 5 Robust Network Planning 157 where vy and vx are the variances of the link loads and TM, respectively. We can use vy to deduce peakedness parameters for the link traffic using (5.7). So far, we have assumed that the network (at least the location of links, and the routing) is static. In reality, part of network planning involves changing the network, and so the matrix A is really a potential variable. When we consider network planning, A appears implicitly as one of our optimization variables. Likewise, A may change in response to link or router failures. The reason-traffic matrices are so important is that they are, in principle, invariant under changes to A. Hence predictions of link loads under the changes in A can be easily made. For example, imagine a traffic engineering problem where we wish to balance the load on a network’s internal links more effectively. We will change routing in the network in order to balance the traffic on links more effectively. In doing so, the link loads are not invariant (the whole point of traffic engineering is to change these). However, the ingress/egress TM is invariant, and projecting this onto the links (via the routing matrix) will predict the link loads under proposed routing changes. In reality, invariance is an approximation. Real TMs are not invariant under all network changes, for instance, if network capacities are chosen to be too small, congestion will result. However, the Transmission Control Protocol (TCP) will act to alleviate this congestion by reducing the actual traffic carried on the network, thereby changing the traffic matrix. In general, different sets of measurements will have different degrees of invariance. For instance, an origin/destination TM is invariant to changes in egress points (due to routing changes), whereas an ingress/egress TM is not. It is clearly better to use the right data set for each planning problem, but the desired data is not always available. The lack of true invariance is one of the key reasons for the cyclic approach to network planning. We seek to correct any problems caused by variations in our inputs in response to our new network design. 5.4.2 Prediction for Green-Fields Planning The above section assumes that we have considerable historical data to which we apply time-series techniques to extrapolate trends, and hence predict the future traffic demands on our network. This has two major limitations: 1. IP traffic is constrained by the pipe through which it passes. TCP congestion control ensures that such traffic does not overflow by limiting the source transmission rate. In most networks our measurements only provide the carried load, not the offered load. If the network capacities change, the traffic may increase in response. This is a concern if our current network is loaded to near its capacity, and in this case we must discount our measurements, or at least treat them with caution. 2. When we design a new network there is nothing in place for us to measure. 158 M. Roughan We will start by considering available strategies for the latter case. We can draw inspiration from the spatial models previously presented. The fact that the simple gravity model describes, to some extent, the spatial structure of Internet traffic matrices presents us with a simple approach to estimate an initial traffic matrix. The first step is to estimate the total expected traffic for the network, based on demographics and market projections. Let us take a simple example: in Australia the ABS measures Internet usage. Across a wide customer base the average usage per customer was roughly 3 GB/month (since 2006). The total traffic for our network is the usage per customer multiplied by the projected number of customers. We can derive traffic estimates per marketing region in the same fashion. Note that the figure used above is for the broad Australian market and is unlikely to be correct elsewhere (typical Australian ISPs have an tiered pricing structure). Where more detailed figures exist in particular markets these should be used. The second step is to estimate the “busy-hour” traffic. As we have seen previously the traffic is not uniformly distributed over time. In the absence of better data, we might look at existing public measurements (such as presented in Figs. 5.2 and 5.3, or as appears in [44]) where the peak to mean ratio is of the order of 3 to 2. Increasing our traffic estimates by this factor gives us an estimate of the peak traffic loads on the network. The third step is to estimate a traffic matrix. The best approach, in the absence of other information, to derive the traffic matrix is to apply the gravity model (5.12). In the simple case, the gravity model would be applied directly using the local regional traffic estimates. However, where additional information about the expected application profile exists, we might use this to refine the results using the “independent flow model” of [21]. Additional structural information about the network might allow use of the “generalized gravity model” of [68]. Each of these approaches allows us to use additional information, but in the absence of such information the simple gravity model gives us our initial estimate of the network traffic matrix. What about the case where we have historical network traffic measurements, but suspect that the network is congested so that the carried load is significantly below the offered load? In this case, our first step is to determine what parts of the traffic matrix are affected. If a large percentage of the traffic matrix is affected, then the only approach we have available is to go back through the historical record until we reach a point (hopefully) where the traffic is not capacity constrained. This has limitations: for one thing, we may not find a sufficient set of data where capacity constraints have left the measurements uncorrupted. Even where we do obtain sufficient data, the missing (suspect) measurements increase the window over which we must make predictions, and therefore the potential errors in these predictions. However, if only a small part of the traffic matrix is affected we may exploit techniques developed for traffic matrix inference to fill in the suspect values with more accurate estimates. These methods originated due to the difficulties in collecting flow-level data to measure traffic matrices directly. Routers (particularly older routers) may not support an adequate mechanism for such measurements (or suffer a performance hit when the measurements are used), and installation of stand-alone measurement devices can be costly. On the other hand, the Simple Network Management Protocol (SNMP) is almost ubiquitously available, and has little overhead. 5 Robust Network Planning 159 Unfortunately, it provides only link-load measurements, not traffic matrices. However, the two are simply related by (5.18). Inferring x from y is a so-called “network tomography” problem. For a typical network the number of link measurements is O.N / (for a network of N nodes), whereas the number of traffic matrix elements is O.N 2 / leading to a massively underconstrained linear inverse problem. Some type of side information is needed to solve such problems, usually in the form of a model that roughly describes a typical traffic matrix. We then estimate the parameters of this crude model (which we shall call m), and perform a regularization with respect to the model and the measurements by solving the minimization problem argmin ky Axk22 C 2 d .x; m/; x (5.20) where k k2 denotes the l 2 norm, > 0 is a regularization parameter, and d.x; m/ is a distance between the model m and our estimated traffic matrix x. Examples of suitable distance metrics are standard or weighted Euclidean distance and the Kullback–Leibler divergence. Approaches of this type, generally called strategies for regularization of ill-posed problems are more generally described in [29], but have been used in various forms in many works on traffic matrix inference. The method works because the measurements leave the problem underconstrained, thereby allowing many possible traffic matrices that fit the measurements, but the model allows us to choose one of these as best. Furthermore, through the method allows us to tradeoff our belief about the accuracy of the model against the expected errors in the measurements. We can utilize TM structure to interpolate missing values by solving a similar optimization problem argmin kA .x/ Mk22 C 2 d.x; mg /; x (5.21) where A .x/ D M expresses the available measurements as a function of the traffic matrix (whether these be link measurements or direct measurements of a subset of the TM elements we do not care), and mg is the gravity model. This regularizes our model with respect to the measurements that are considered valid. Note that the gravity model in this approach will be skewed by missing elements, so this approach is only suitable for interpolation of a few elements of the traffic matrix. If larger numbers of elements are missing, we can use more complicated techniques such as those proposed in [52] to interpolate the missing data. 5.5 Optimal Network Plans Once we have obtained predictions of the traffic on our network we can commence the actual process of making decisions about where links and routers will be placed, their capacities, and the routing policies that will be used. In this section we discuss how we may optimize these quantities against a set of goals and constraints. 160 M. Roughan The first problem we consider concerns capacity planning. If this component of our network planning worked as well as desired, we could stop there. However, errors in predictions, coupled with the long planning horizon for making changes to a network mean that we need also to consider a short-term way of correcting such problems. The solution is typically called traffic engineering or simply load balancing, and is considered in Section 5.5.2. 5.5.1 Network Capacity Planning There are many good optimization packages available today. Commercial tools such as CPLEX are designed specifically for solving optimization problems, while more general purpose tools such as Matlab often include optimization toolkits that can be used for such problems. Even Excel includes some quite sophisticated optimization tools, and so we shall not consider optimization algorithms in detail here. Instead we will formulate the problem, and provide insight into the practical issues. There are three main components to any optimization problem: the variables, the objective, and the constraints. The variables here are obviously the locations of links, and their capacities. The objective function – the function which we aim to minimize – varies depending on business objectives. For instance, it is common to minimize the cost of a network (either its capital or ongoing cost), or packet delays (or some other network performance metric). The many possible objectives in network design result in different problem formulations, but we concentrate here on the most common objective of cost minimization. The cost of a network is a complex function of the number and type of routers used, and the capacities of the links. It is common, however, to break up the problem hierarchically into inter-PoP, and intra-PoP design, and we consider the two separately here. The constraints in the problem fall into several categories: 1. Capacity constraints require that we have “sufficient” link capacity. These are the key constraints for this problem so we consider these in more detail below. 2. Other technological constraints, such as limited port numbers per router. 3. Constraints arising as a result of the difficulties in multiobjective optimization. For example, we may wish to have a network with good performance and low cost. However, multiobjective optimization is difficult, so instead we minimize cost subjected to a constraint on network performance. 4. Reliability constraints require that the network functions even under network failures. This issue is so important that other chapters of this book have been devoted to this issue, but we shall consider some aspects of this problem here as well. 5 Robust Network Planning 161 5.5.1.1 Capacity Constraints and Safe-Operating Points Unsurprisingly, the primary constraints in capacity planning are the capacity constraints. We must have a network with sufficient capacity to carry the offered traffic. The key issue is our definition of “sufficient.” There are several factors that go into this decision: 1. Traffic is not constant over the day, so we must design our network to carry loads at all times of day. Often this is encapsulated in “busy hour” traffic measurements, but busy hours may vary across a large network, and between customers, and so it is better to design for the complete cycle. 2. Traffic has observable fluctuations around its average behavior. Capacity planning can explicitly allow for these variations. 3. Traffic also has unobservable fluctuations on shorter times than our measurement interval. Capacity planning must attempt to allow for these variations. 4. There will be measurement and prediction errors in any set of inputs. Ideally, we would use queueing models to derive an exact relationship between measured traffic loads, variations, and so determine the required capacities. However, despite many recent advances in data traffic modeling, we are yet to agree on sufficiently precise and general queueing models to determine sufficient capacity from numerical formulae. There is no “Erlang-B” formulae for data networks. As a result, most network operators use some kind of engineering rule of thumb, which comes down to an “over-engineering factor” to allow for the above sources of variability. We adopt the same approach here, but the term “over-engineering factor” is misleading. The factor allows for known variations in the traffic. The network is not over-engineered, it only appears so if capacity is directly compared to the available but flawed measurements. In fact, if we follow a well-founded process, the network can be quite precisely engineered.6 We therefore prefer to use the term Safe Operating Point (SOP). A SOP is defined statistically with respect to the available traffic measurements on a network. For instance, with 5-min SNMP traffic measurements, we might define our SOP by requiring that the load on the links (as measured by 5-min averages) should not exceed 80% of link capacity more than five times per month. The predicted traffic model could then be used to derive how much capacity is needed to achieve this bound. Traffic variance depends on the application profile and the scale of aggregation. Moreover, the desired trade-off between cost and performance is a business choice for network operators. So there is no single SOP that will satisfy all operators. Given the lack of precision in current queueing models and measurements, the SOP needs to be determined by each network operator experimentally, preferably starting from conservative estimates. Natural variations in network conditions often allow enough 6 It is a common complaint that backbone networks are underutilized. This complaint typically ignores the issues described above. In reality, many of these networks may be quite precisely engineered, but crude average utilization numbers are used to defer required capacity increases. 162 M. Roughan scope to see the impact of variable levels of traffic, and from these determine more accurate SOP specifications, but to do this we need to couple traffic and performance measurements (a topic we consider later). A secondary set of capacity constraints arises because there is a finite set of available link types, and capacity must be bought in multiples of these links. For instance, many high-speed networks use either SONET/SDH links (typically giving 155 Mbps times powers of 4) and/or Ethernet link capacities (powers of 10 from 10 Mbps to 10 Gbps). We will denote the set of available link capacities (including zero) by C . Finally, most high-speed link technologies are duplex, and so we need to allocate capacity in each direction, but we typically do so symmetrically (i.e., a link has the same capacity from i ! j as from j ! i even when the traffic loads in each direction are different). 5.5.1.2 Intra-PoP Design We divide the network design or capacity planning problem into two components and first consider the design of the network inside a PoP. Typically this involves designing a tree-like network to aggregate traffic up to regional hubs, which then transit the traffic onto a backbone.7 The exact design of a PoP is considered in more detail in Chapter 4, but note that in each of the cases considered there we end up with a very similar optimization problems at this level. There are two prime considerations in such planning. Firstly, it is typical that the majority of traffic is nonlocal, i.e., that it will transit to or from the backbone. Local traffic between routers within the PoP in the Internet is often less than 1% of the total. There are exceptions to this rule, but these must be dealt with on an individual basis. Secondly, limitations on the number of ports on most high-speed routers mean that we need at least one layer of aggregation routers to bring traffic onto the backbone: for instance, see Fig. 5.5. For clarity, we show a very simple design (see Chapter 4 for more examples). In our example, Backbone Routers (BRs) to backbone BR Fig. 5.5 A typical PoP design. Aggregation Routers (AR) are used to increase the port density in the PoP and bring traffic up to the Backbone Routers (BR) AR AR BR AR customers 7 In small PoPs, a single router (or redundant pair) may be sufficient for all needs. Little planning is needed in this case beyond selecting the model of router, and so we do not include this simple case in the following discussions. 5 Robust Network Planning 163 and the corresponding links to Aggregation Routers (ARs) are assigned in pairs in order to provide redundancy, but otherwise the topology is a simple tree. There are many variations on this design, for instance, additional BRs may be needed, or multiple layers. However, in our simple model, the design is determined primarily by the limitations on port density. The routers lie within a single PoP, so links are short and their cost has no distance dependence (and they are relatively cheap compared to wide-area links). The number of ARs that can be accommodated depends on the number of ports that can be supported by the BRs, so we shall assume that ARs have a single high-capacity uplink to each BR to allow for a maximum expansion factor in a one-level tree. As a result, the job of planning a PoP is primarily one of deciding how many ARs are needed. As noted earlier we do not need a TM for this task. The routing in such a network is predetermined, and so current port allocations and the uplink load history are sufficiently invariant for this planning task. We use these to form predictions of future uplink requirements and the loads on each router. When predictions show that a router is reaching capacity (either in terms of uplink capacity, traffic volume, or port usage) we can install additional routers based on our predictions over the planning horizon for router installation. There is an additional improvement we can make in this type of problem. It is rare for customers to use the entire capacity of their link to our network, and so the uplink capacity between AR and BR in our network need not be the sum of the customers’ link capacities. We can take advantage of this fact through simple measurementbased planning, but with the additional detail that we may allocate customers with different traffic patterns to routers in such a way as to leverage different peak hours and traffic asymmetries (between input and output traffic), so as to further reduce capacity requirements. The problem resembles the bin packing problem. Given a fixed link capacity C for the uplinks between ARs and BRs, and K customers with peak traffic demands fTi gK i D1 , the bin packing problem would be as follows: determine the smallest inteof the customers8 such that ger B, such that we can find a B-partition fSk gB kD1 X Ti C for all k D 1; : : : ; B: (5.22) i 2Sk The number of subsets B gives the number of required ARs, and although the problem is NP-hard, there are reasonable approximation algorithms for its solution [18], some of which are online, i.e., they can be implemented without reorganization of existing allocations. The real problem is more complicated. There are constraints on the number of ports that can be supported by ARs dependent on the model of ARs being 8 A B-partition of our customers is a group of B non-empty subsets Sk f1; 2; : : : ; Kg that are disjoint, i.e., Si \ Sj D for all i ¤ j , and which include all customers, i.e., [BkD1 Sk D f1; 2; : : : ; Kg. 164 M. Roughan deployed, constraints on router capacity, and in addition, we can take advantage of the temporal, and directional characteristics of traffic. Customer demands take the form ŒIi .t/; Oi .t/, where Ii .t/ and Oi .t/ are incoming and outgoing traffic demands for customer i at time t. So the appropriate condition for our problem is to find the minimal number B of ARs such that X X Ii .t/ C and Oi .t/ C for all k; t: (5.23) i 2Sk i 2Sk This is the so-called vector bin packing problem, which has been used to model resource constrained processor scheduling problems, and good approximations have been known for some time [15, 28]. The major advantage of this type of approach is that customers with different peak traffic periods can be combined onto one AR so that their joint traffic is more evenly distributed over each 24-hour period. Likewise, careful distribution of customers whose primary traffic flows into our network (for instance, hosting centers) together with customers whose traffic flows out of the network (e.g., broadband access companies) can lead to more symmetric traffic on the uplinks, and hence better overall utilization. In practice, multiplexing gains may improve the situation, so that less capacity is needed when multiple customers’ traffic is combined, but this effect only plays a dominant role when large numbers (say hundreds) of small customers are being combined. 5.5.1.3 Inter-PoP Backbone Planning The inter-PoP backbone design problem is somewhat more complicated. We start by assuming, we know the locations at which we wish to have PoPs. The question of how to optimize these locations does come up, but it is common that these locations are predetermined by other aspects of business planning. In inter-PoP planning, distance-based costs are important. The cost of a link is usually considered to be proportional to its length, though this is approximate. The real cost of a link has a fixed component (in the equipment used to terminate a line) in addition to distancedependent terms derived from the cost to install a physical line, e.g., costs of cables, excavation and right of ways. Even where leased lines are used (so there are minimal installation costs) the original capital costs of the lines are usually passed on through some type of distance sensitive pricing. In addition, higher speed links generally cost more. The exact model for such costs can vary, but a large component of the bandwidth-dependent costs is in the end equipment (router interface cards, WDM mux/demux equipment, etc.). In actuality-real costs are often very complicated: vendors may have discounts for bulk purchases, whereas cutting-edge technology may come at a premium cost. However, link costs are often approximated as linear with respect to bandwidth because we could, in principle, obtain a link with capacity 4c by combining four links of capacity c. 5 Robust Network Planning 165 In the simple case then, cost per link has the form f .de ; ce / D ˛ C ˇde C ce ; (5.24) where ˛ is the fixed cost of link installation, ˇ is the link cost per unit distance, and is the cost per unit bandwidth. As the distance of a link is typically a fixed property of the link, we often rewrite the above cost in the form fe .ce / D ˛e C ce ; (5.25) where now the cost function depends on the link index e. We further simplify the problem by assuming that BRs are capable of dealing with all traffic demands so that only two (allowing for redundancy) are needed in each PoP, thus removing the costs of the router from the problem. Finally, we simplify our approach by assuming that routes are chosen to follow the shortest possible geographic path in our network. There are reasons (which we shall discuss in the following section) why this might not be the case, however, a priori, it makes sense to use the shortest geographic path. There are costs that arise from distance. Most obviously, if packets traverse longer paths, they will experience longer delays, and this is rarely desirable. In addition, packets that traverse longer paths use more resources. For instance, a packet that traverses two hops rather than one uses up capacity on two links rather than one. As noted earlier, we need to specify the problem constraints, the basic set of which are intended to ensure that there is sufficient capacity in the network. When congestion is avoided, queueing delays will be minimal, and hence delays across the network will be dominated by propagation delays (the speed of light cannot be increased). So ensuring sufficient capacity implicitly serves the purpose of reducing networking delays. As noted, we adopt the approach of specifying an SOP, which we do in the form of a factor 2 .0; 1/, which specifies the traffic limit with respect to capacity. That is, we shall require that the link capacity ce be sufficient that traffic takes up only of the capacity, leaving 1 of the capacity to allow for unexpected variations in the traffic. The possible variables are now the link locations and their capacities. So, given the (vectorized) traffic matrix x, our job is to determine link locations and capacities ce , which implicitly defined the network routes (and hence the routing matrix A), such that we solve X ˛e I.ce > 0/ C ce minimize e2E such that Ax c; ce 2 C; (5.26) where Ax D y, the link loads, c is the vector of links capacities, E is the set of possible links, I.ce > 0/ is an indicator function (which is 1 where we build a link, and 0 otherwise), and C is the set of available link capacities (which includes 0). Implicit in the above formulation is the routing matrix A, which results from the particular choice of links in the network design, so A is in fact a function of the 166 M. Roughan network design. Its construction imposes constraints requiring that all traffic on the network can be routed. The problem can be rewritten in a more explicit form using flow-based constraints, but the above formulation is convenient for explaining the differences and similarities between the range of problems we consider here. There may be additional constraints in the above-mentioned problem resulting from router limitations, or due to network performance requirements. For instance, if we have a maximum throughput on each router, we introduce a set of constraints of the form Bx r, where r are router capacities, and B is similar to a routing matrix in that it maps end-to-end demands to the routers along the chosen path. Port P constraints on a router might be expressed by taking constraints of the form j I.ci;j > 0/ pi , where pi is the port limit on router i . Port constraints are complicated by the many choices of line cards available for high-speed routers, and so have sometimes been ignored, but they are a key limitation in many networks. The issue is sometimes avoided by separation of inter- and intra-PoP design, so that a high port density on BRs is not needed. The other complication is that we should aim to optimize the network for 24 7 operations. We can do so simply by including one set of capacity constraints for each time of day and week, i.e., Axt c. The resulting constraints are in exactly the same form as in (5.26) but their number increases. However, it is common that many of these constraints are redundant, and so can be removed from the optimization (without effect) by a pre-filtering phase. The full optimization problem is a linear integer program, and there are many tools available for solution of such programs. However, it is not uncommon to relax the integer constraints to allow any ce 0. In this case, there is no point in having excess capacity, and so we can replace the link capacity constraint by Ax D c. We then obtain the actual design by rounding up the capacities. This approach reduces the numerical complexity of the problem, but results in a potentially suboptimal design. Note though, that integer programming problems are often NP hard, and consequently solved using heuristics, which likewise can lead to suboptimal designs. Relaxation to a linear program is but one of a suite of techniques that can be used to solve problems in this context, often in combination with other methods. Moreover, it is common, the mathematical community to focus on finding provably optimal designs, but this is not a real issue. In practical network design we know that the input data contains errors, and our cost models are only approximate. Hence, the mathematically optimal solution may not have the lowest cost of all realizable networks. The mathematical program only needs to provide us with a very good network design. The components of real network suffer outages on a regular basis: planned maintenance and accidental fiber cuts are simple examples (for more details see Chapters 3 and 4). The final component of network planning that we discuss here is reliability planning: analyzing the reliability of a network. There are many algorithms aimed at maintaining network connectivity, ranging from simple designs such as rings or meshes, to formal optimization problems including connectivity constraints. Commonly, networks are designed to survive all single link or node outages, though more careful planning would concern all Shared Risk Groups (SRG), i.e., groups of links 5 Robust Network Planning 167 and/or nodes who share fates under common failures. For instance, IP links that use wavelengths on the same fiber will all fail simultaneously if the fiber is cut. However, when a link (or SRG) fails, maintaining connectivity is not the only concern. Rerouted traffic creates new demands on links. If this demand exceeds capacity, then the resulting congestion will negatively impact network performance. Ideally, we would design our network to accommodate such failures, i.e., we would modify our earlier optimization problem (5.26) as follows: minimize X e2E ˛e I.ce > 0/ C ce such that Ax c; and Ai x c; 8i 2 F ; (5.27) where F is the set of all failure scenarios considered likely enough to include, and Ai is the routing matrix under failure scenario i . Naively implemented with D , this approach has the limitation that the capacity constraints under failures can come to dominate the design of the network so that most links will be heavily underutilized under normal conditions. Hence, we allow that the SOPs with respect to normal loads, and failure loads to be different, < < 1, so that the mismatch is somewhat balanced, i.e., under normal conditions links are not completely underutilized, but there is likely to be enough capacity under common failures. For example, we might require that under normal loads, peak utilizations remain at 60%, while under failures, we allow loads of 85%. Additionally, the number of possible failure scenarios can be quite large, and as each introduces constraints, it may not be practical to consider all failures. We may need to focus on the likely failures, or those that are considered to be most potentially damaging. However, it is noteworthy that only constraints that involve rerouting need be considered. In most failures, a large number of links will be unaffected, and hence the constraints corresponding to those links will be redundant, and may be easily removed from the problem. The above formulation presumes that we design our network from scratch, but this is the exception. We typically have to grow our network incrementally. This introduces challenges – for instance, it is easy to envisage a series of incremental steps that are each optimal in themselves, but which result in a highly suboptimal network over time. So it is sometimes better to design an optimal network from scratch, particularly when the network is growing very quickly. In the mean time we can include the existing network through a set of constraints in the form ce le Cce0 , where le is the legacy link capacity on link e, and ce0 is the additional link capacity. The real situation is complicated by some additional issues: (i) typical IP router load balancing is not well suited for multiple parallel links of different capacities so we must choose between increasing capacity through additional links (with capacity equal to the legacy links) or paying to replace the old links with a single higher capacity link; and (ii) the costs for putting additional capacity between two routers may be substantially different from the costs for creating an entirely new link. Some work [40] has considered the problem of evolvability of networks, but without all 168 M. Roughan of the addition complexities of IP network management, so determining long-term solutions for optimal network evolution is still an open problem. 5.5.2 Traffic Engineering In practice, it takes substantial time to build or change a network, despite modern innovations in reconfigurable networks. Typical changes to a link involve physically changing interface cards, wiring, and router configurations. Today these changes are often made manually. They also need to be performed carefully, through a process where the change is documented, carefully considered, acted upon, and then tested. The time to perform these steps can vary wildly between companies, but can easily be 6 months once budget cycles are taken into account. In the mean time we might find that our traffic predictions are in error. The best predictions in the world cannot cope with the convulsive changes that seem to occur on a regular basis in the Internet. For instance, the introduction of peer-to-peer networking both increased traffic volumes dramatically in a very short time frame, and changed the structure of this traffic (peer-to-peer traffic is more symmetric that the previously dominant client–server model). YouTube again reset providers’ expectations for traffic. The result will be a suboptimal network, in some cases leading to congestion. As noted, we cannot simply redesign the network, but we can often alleviate congestion by better balancing loads. This process, called traffic engineering (or just load balancing) allows us to adapt the network on shorter time scales than capacity planning. It is quite possible to manually intervene in a network’s traffic engineering on a daily basis. Even finer time scales are possible in principle if traffic engineering is automated, but this is uncommon at present because there is doubt about the desirability of frequent changes in routing. Each change to routing protocols can require a reconvergence, and can lead to dropped packets. More importantly, if such automation is not very carefully controlled it can become unstable, leading to oscillations and very poor performance. The Traffic Engineering (TE) problem is very similar to the network design problem. The goal or optimization objective is often closely related to that in design. The constraints are usually similar. The major difference is in the planning horizon (typically days to weeks), and as a result the variables over which we have control. The restriction imposed by the planning horizon for TE is that we cannot change the network hardware: the routers and links between them are fixed. However, we can change the way packets are routed through the network, and we can use this to rebalance the traffic across the existing network links. There are two methods of TE that are most commonly talked about. The most often mentioned uses MultiProtocol Label Switching (MPLS) [54], by which we can arbitrarily tunnel traffic across almost any set of paths in our network. Finding a general routing minimizing max-utilization is an instance of the classical multi-commodity flow problem, which can be formulated as a linear program 5 Robust Network Planning 169 [6, Chapter 17], and is hence solvable using commonly available tools. We shall not spend much time on MPLS TE, because there is sufficient literature already (for instance, see [19, 36]). We shall instead concentrate on a simpler, less well known, and yet almost as powerful method for TE. Remember that we earlier argued that shortest-geographic paths made sense for network routing. In fact, shortest-path routing does not need to be based on geographic distances. Most modern Interior Gateway Protocols allow administratively defined distances (for instance, Open Shortest Path First (OSPF) [42] and Intermediate System-Intermediate System (IS-IS) [14]). By tweaking these distances we can improve network performance. By making a link distance smaller, you can make a link more “attractive”, and so route more traffic on this link. Making the distance longer can remove traffic. Configurable link weights can be used, for example, to direct traffic away from expensive (e.g., satellite) links. However, we can formulate the TE problem more systematically. Let us consider a shortest-path protocol with administratively configured link weights (the link distances) we on each link e. We assume that the network is given (i.e., we know its link locations and capacities), and that the variables that we can control are the link weights. Our objective is to minimize the congestion on our network. Several metrics can be used to describe congestion. Network-wide metrics such as that proposed in [25, 26] can have advantages, but we use the common metric of maximum utilization here for its simplicity. In many cases, there are additional “human” constraints on the weights we can use in the above optimization. For instance, we may wish that the resulting weights do not change “too much” from our existing weights. Each change requires reconfiguration of a router, and so reducing the number of changes with respect to the existing routing may be important. Likewise, the existing weights are often chosen not just for the sake of distance, but also to make the network conceptually simpler. For instance, we might choose smaller weights inside a “region” and large weights between regions, where the regions have some administrative (rather than purely geographical) significance. In this case, we may wish to preserve the general features of the routing, while still fine-tuning the routes. We can express these constraints in various ways, but we do so below by setting minimum and maximum values for the weights. Then the optimization problem can be written: choose the weights w, such that we minimize max ye =ce e2E (5.28) such that Ax D y; we wmax and wmin e e ; 8e 2 E where A is the routing matrix generated by shortest-path routing given by link weights we , and the link utilizations are given by ye=ce (the link load divided by its capacity). The wmin and wmax constrain the weights for each link into a e e range determined by existing network policies (perhaps within some bound of the existing weights). Additional constraints might specify the maximum number of weights we are allowed to change, or require that links weights be symmetric, i.e., w.i;j / D w.j;i / . 170 M. Roughan The problem is in general NP-hard, so it is nontrivial to find a solution. Over the years, many heuristic methods [12,20,25,26,37,41,53] have been developed for the solution of this problem. The exciting feature of this approach is that it is very simple. It uses standard IP routing protocols, with no enhancements other than the clever choice of weights. One might believe that the catch was that it cannot achieve the same performance as full MPLS TE. However, the performance of the above shortest-path optimization has been shown on real networks to suffer only by a few percent [59,60], and importantly, it has been shown to be more robust to errors in the input traffic matrices than MPLS optimization [60]. This type of robustness is critical to real implementations. Moreover, the approach can be used to generate a set of weights that work well over the whole day (despite variations in the TM over the day) [60], or that can help alleviate congestion in the event of a link failure [44], a problem that we shall consider in more detail in the following section. 5.6 Robust Planning A common concern in network planning is the consequence of mistakes. Traffic matrices used in our optimizations may contain errors due to measurement artifacts, sampling, inference, or predictions. Furthermore, there may be inconsistencies between our planned network design, and the actual implementation through misconfiguration or last minute changes in constraints. There may be additional inconsistencies introduced through the failure of invariance in TMs used as inputs, for example, caused by congestion alleviation in the new network. Robust planning is the process of acknowledging these flaws, and still designing good networks. The key to robustness is the cyclic approach described in Section 5.1: measure ! predict ! plan ! and then measure again. However, with some thought, this process can be made tighter. We have already seen one example of this through TE, where a short-term alteration in routing is used to counter errors in predicted traffic. In this section we shall also consider some useful additions to our kitbag of robust planning tools. 5.6.1 Verification Measurements One of the most common sources of network problems is misconfiguration. Extreme cases of misconfigurations that cause actual outages are relatively obvious (though still time-consuming to fix). However, misconfigurations can also result in more subtle problems. For instance, a misconfigured link weight can mean that traffic takes unexpected paths, leading to delays or even congestion. One of the key steps to network planning is to ensure that the network we planned is the one we observe. Various approaches have been used for router configura- 5 Robust Network Planning 171 tion validation: these are considered in more detail in Chapter 9. In addition, we recommend that direct measurements of the network routing, link loads, and performance can be made at all times. Routing can be measured through mechanisms such as those discussed in Section 5.2 and in more detail in Chapter 11. When performed from edge node to edge node, we can use such measurements to confirm that traffic is taking the routes we intended it to take in our design. By themselves, routing measurements only confirm the direction of traffic flows. Our second requirement is to measure link traffic to ensure that it remains within the bounds we set in our network design. Unexpected traffic loads can often be dealt with by TE, but only once we realize that there is a problem. Finally, we must always measure performance across our network. In principle, the above measurements are sufficient, i.e., we might anticipate that a link is congested only if traffic exceeds the capacity. However, in reality, the typical SNMP measurements used to measure traffic on links are 5-min averages. Congestion can occur on smaller time scales, leading to brief, but nonnegligible packet losses that may not be observable from traffic measurements alone. We aim to reduce these through choice of SOP, but note that this choice is empirical in itself, and an accurate choice relies on feedback from performance measurements. Moreover, other components of a network have been known to cause performance problems even on a lightly loaded network. For instance, such measurements allowed us to discover and understand delays in routing convergence times [32, 61], and that during these periods bursts of packet loss would occur, from which improvements to Interior Gateway Protocols have been made [27]. The importance of the problem would never have been understood without performance measurements. Such measurements are discussed in more detail in Chapter 10. 5.6.2 Reliability Analysis IP networks and the underlying SONET/WDM strata on which they run are often managed by different divisions of a company, or by completely different companies. In our planning stages, we would typically hope for joint design between these components, but the reality is that the underlying physical/optical networks are often multiuse, with IP as one of several customers (either externally or internally) that use the same infrastructure. It is often hard to prescribe exactly which circuits will carry a logical IP link. Therefore, it is hard in some cases to determine, prior to implementation, exactly what SRG exist. We may insist, in some cases, that links are carried over separate fibers, or even purchase leased lines from separate companies, but even in these cases great care should be taken. For instance, it was only during the Baltimore train tunnel fire (2001) [4] it was discovered that several providers ran fiber through the same tunnel. Our earlier network plan can only accommodate planned network failure scenarios. In robust planning, we must somehow accommodate the SRGs that have arisen in the implementation of our planned network. The first step, obviously, is to 172 M. Roughan determine the SRGs. The required data mapping IP links to physical infrastructure is often stored in multiple databases, but with care it is possible to combine the two to obtain a list of SRGs. Once we have a complete list of failure scenarios we could go through the planning cycle again, but as noted, the time horizon for this process would leave our network vulnerable for some time. The first step therefore is to perform a network reliability analysis. This is a simple process of simulating each failure scenario, and assessing whether the network has sufficient capacity, i.e., whether Ai x c. If this condition is already satisfied, then no action need to be taken. However, where the condition is violated, we must take one of two actions. The most obvious approach to deal with a specific vulnerability is to expedite an increase in capacity. It is often possible to reduce the planning horizon for network changes at an increased cost. Where small changes are needed, this may be viable, but it is clearly not satisfactory to try to build the whole network in this way. The second alternative is to once again use traffic engineering. MPLS provides mechanisms to create failover paths, however, it does not tell you where to route these to ensure that congestion does not occur. Some additional optimization and control is needed. However, we cannot do this after the failure, or recovery will take an unacceptable amount of time. Likewise, it is impractical in today’s networks to change link weights in response failures. However, previous studies have shown that shortest-path link weight optimization can be used to provide a set of weights that will alleviate congestive effects under failures [44], and such techniques have (anecdotally) been used in large networks with success. 5.6.3 Robust Optimization The fundamental issue we deal with is “Given that I have errors in my data, how should I perform optimization?” Not all the news are bad. For instance, once we acknowledge that our data is not perfect, we realize that finding the mathematically optimal solution for our problem is not needed. Instead, heuristic solutions that find a near optimal solution will be just as effective. This chapter is not principally concerned with optimization, and so we will not spend a great deal of time on specific algorithms, but note that once we decide that heuristic solutions will be sufficient, several meta-heuristics such as genetic algorithms and simulated annealing become attractive. They are generally easy to program, and very flexible, and so allow us to use more complex constraints and optimization objective functions than we might otherwise have chosen. For instance, it becomes easy to incorporate the true link costs, and technological constraints on available capacities. The other key aspect to optimization in network planning directly concerns robustness. We know there are errors in our measurements and predictions. We can save much time and effort in planning if we accommodate some notion of these errors in our optimization. A number of techniques for such optimization have been proposed: oblivious routing [8], and Valiant network design [69, 70]. These papers 5 Robust Network Planning 173 present methods to design a network and/or its routing so that it will work well for any arbitrary traffic matrix. However, this is perhaps going too far. In most cases we do have some information about possible traffic whose use is bound to improve our network design. A simple approach is to generate a series of possible traffic matrices by adding random noise to our predicted matrix, i.e., by taking xi D x C ei , for i D 1; 2; : : : ; M . Where sufficient historical data exist, the noise terms ei should be generated in such a way as to model the prediction errors. We can then optimize against the set of TMs, i.e., minimize X e2E ˛e I.ce > 0/ C ce (5.29) such that Axi c; 8i D 1; 2; : : : ; M: Once again this can increase the number of constraints dramatically, particularly in combination with reliability constraints, unless we realize that again many of these constraints will be redundant, and can be pruned by preprocessing. The above approach is somewhat naive. The size of the set of TMs to use is not obvious. Also we lack guidance about the choice we should make for . In principle, we already accommodate variations explicitly in the above optimization and so we might expect D 1. However, as before we need < 1 to accommodate inter-measurement time interval variations in traffic, though the choice should be different than in past problems. Moreover, there may be better robust optimization strategies that can be applied in the future. For instance, robust optimization has been applied to the traffic engineering problem in [65], where the authors introduce the idea of COPE (Common-case Optimization with a Penalty Envelope) where the goal is to find the optimal routing for a predicted TM, and to ensure that the routing will not be “too bad” if there are errors in the prediction. 5.6.4 Sensitivity Analysis Even where we believe that our optimization approach is robust, we must test this hypothesis. We can do so by performing a sensitivity analysis. The standard approach in such an analysis is to vary the inputs and examine the impact on the outputs. We can vary each possible input to detect robustness to errors in this input, though the most obvious to test is sensitivity to variations in the underlying traffic matrix. We can test such sensitivity by considering the link loads under a set of TMs generated, as before, by adding prediction errors, i.e., xi D x C ei , for i D 1; 2; : : : ; M , and then simply calculating the link loads yi D Axi . There is an obvious relationship to robust optimization, in that we should not be testing against the same set of matrices against which we optimized. Moreover, in sensitivity analysis it is common to vary the size of the errors. However, simple linear 174 M. Roughan algebra allows us to reduce the problem to a fixed load component y D Ax and a variable component wi D Aei , which scales linearly with the size of the errors, and which can be used to see the impact of errors in the TM directly. 5.7 Summary “Reliability, reliability, reliability” is the mantra of good network operators. Attaining reliability costs money, but few companies can afford to waste millions of dollars on an inefficient network. This chapter is aimed at demonstrating how we can use robust network planning to attain efficient but reliable networks, despite the imprecision of measurements, uncertainties of predictions, and general vagaries of the Internet. Reliability should mean more than connectivity. Network performance measured in packet delay or loss rates is becoming an important metric for customers deciding between operators. Network design for reliability has to account for possible congestion caused by link failures. In this chapter we consider methods for designing networks where performance is treated as part of reliability. The methodology proposed here is built around a cyclic approach to network design exemplified in Fig. 5.1. The process of measure ! analyze/predict ! control ! validate should not end, but rather, validation measurements are fed back into the process so that we can start again. In this way, we attain some measure of robustness to the potential errors in the process. However, the planning horizon for network design is still quite long (typically several months) and so a combination of techniques such as traffic engineering are used at different time scales to ensure robustness to failures in predicted behavior. It is the combination of this range of techniques that provides a truly robust network design methodology. Acknowledgment This work was informed by the period M. Roughan was employed at AT&T research, and the author owes his thanks to researchers there for many valuable discussions on these topics. M. Roughan would also like to thank the Australian Research Council from whom he receives support, in particular through grant DP0665427. References 1. Python routing toolkit (‘pyrt’). Retrieved from http://ipmon.sprintlabs.com/pyrt/. 2. Ripe NCC: routing information service. Retrieved from http://www.ripe.net/projects/ris/. 3. University of Oregon Route Views Archive Project. Retrieved from www.routeviews.org. 4. CSX train derailment. Nanog mailing list. Retrieved July 18, 2001 from http://www.merit.edu/ mail.archives/nanog/2001-07/msg00351.html. 5. Abilene/Internet2. Retrieved from http://www.internet2.edu/observatory/archive/datacollections.html#netflow. 6. Ahuja, R. K., Magnanti, T. L., & Orlin, J. B. (1993). Network flows: Theory, algorithms, and applications. Upper Saddle River, NJ: Prentice Hall. 5 Robust Network Planning 175 7. Alderson, D., Chang, H., Roughan, M., Uhlig, S., & Willinger, W. (2006). The many facets of Internet topology and traffic. Networks and Heterogeneous Media, 1(4), 569–600. 8. Applegate, D., & Cohen, E. (2003) Making intra-domain routing robust to changing and uncertain traffic demands: Understanding fundamental tradeoffs. In ACM SIGCOMM (pp. 313–324). Germany: Karlsruhe. 2003. 9. Box, G. E. P., & Draper, N. R. (2007). Response surfaces, mixtures and ridge analysis (2nd ed.). New York: Wiley. 10. Brockwell, P., & Davis, R. (1987). Time series: Theory and methods. New York: Springer. 11. Brutag, J. D. (2000). Aberrant behavior detection and control in time series for network monitoring. In Proceedings of the 14th Systems Administration Conference (LISA 2000), New Orleans, LA, USA, USENIX. 12. Buriol, L. S., Resende, M. G. C., Ribeiro, C. C., & Thorup, M. (2002) A memetic algorithm for OSPF routing. In Proceedings of the 6th INFORMS Telecom (pp. 187–188). 13. Cahn, R. S. (1998). Wide area network design. Los Altos, CA: Morgan Kaufman. 14. Callon, R. (1990). Use of OSI IS-IS for routing in TCP/IP and dual environments. Network Working Group, Request for Comments: 1195. 15. Chekuri, C., & Khanna, S. (2004) On multidimensional packing problems. SIAM Journal of Computing, 33(4), 837–851. 16. Duffield, N., & Lund, C. (2003). Predicting resource usage and estimation accuracy in an IP flow measurement collection infrastructure. In ACM SIGCOMM Internet Measurement Conference, Miami Beach, Florida, October 2003. 17. Duffield, N., Lund, C., & Thorup, M. (2004). Flow sampling under hard resource constraints. SIGMETRICS Performance Evaluation Review, 32(1), 85–96. 18. Coffman, J. E. G., Garey, M. R., & Johnson, D. S. (1997). Approximation algorithms for bin packing: A survey. In D. Hochbaum (Ed.), Approximation algorithms for NP-hard problems. Boston: PWS Publishing. 19. Elwalid, A., Jin, C., Low, S. H., & Widjaja, I. (2001). MATE: MPLS adaptive traffic engineering. In INFOCOM (pp. 1300–1309). 20. Ericsson, M., Resende, M., & Pardalos P. (2002). A genetic algorithm for the weight setting problem in OSPF routing. Journal of Combinatorial Optimization, 6(3), 299–333. 21. Erramilli, V., Crovella, M., & Taft, N. (2006). An independent-connection model for traffic matrices. In ACM SIGCOMM Internet Measurement Conference (IMC06), New York, NY, USA, ACM (pp. 251–256). 22. Feldmann, A., Greenberg, A., Lund, C., Reingold, N., & Rexford, J. (2000). Netscope: Traffic engineering for IP networks. IEEE Network Magazine, 14(2), 11–19. 23. Feldmann, A., Greenberg, A., Lund, C., Reingold, N., Rexford, J., & True, F. (2001). Deriving traffic demands for operational IP networks: Methodology and experience. IEEE/ACM Transactions on Networking, 9, 265–279. 24. Feldmann, A., & Rexford, J. (2001). IP network configuration for intradomain traffic engineering. IEEE Network Magazine, 15(5), 46–57. 25. Fortz, B., & Thorup, M. (2000). Internet traffic engineering by optimizing OSPF weights. In Proceedings of the 19th IEEE Conference on Computer Communications (INFOCOM) (pp. 519–528). 26. Fortz, B., & Thorup, M. (2002). Optimizing OSPF/IS-IS weights in a changing world. IEEE Journal on Selected Areas in Communications, 20(4), 756–767. 27. Francois, P., Filsfils, C., Evans, J., & Bonaventure, O. (2005). Achieving sub-second IGP convergence in large IP networks. SIGCOMM Computer Communication Review, 35(3), 35–44. 28. Garey, M., Graham, R., Johnson, D., & Yao, A. (1976). Resource constrained scheduling as generalized bin packing. Journal of Combinatorial Theory A, 21, 257–298. 29. Hansen, P. C. (1997). Rank-deficient and discrete ill-posed problems: Numerical aspects of linear inversion. Philadelphia, PA: SIAM. 30. Iannaccone, G., Chuah, C.-N., Mortier, R., Bhattacharyya, S., & Diot, C. (2002). Analysis of link failures over an IP backbone. In ACM SIGCOMM Internet Measurement Workshop, Marseilles, France, November 2002. 176 M. Roughan 31. Kowalski, J., & Warfield, B. (1995). Modeling traffic demand between nodes in a telecommunications network. In ATNAC’95. 32. Labovitz, C., Ahuja, A., Bose, A., & Jahanian, F. (2000). Delayed Internet routing convergence. In Proceedings of ACM SIGCOMM. 33. Lakhina, A., Crovella, M., & Diot, C. (2004). Characterization of network-wide anomalies in traffic flows. In ACM SIGCOMM Internet Measurement Conference, Taormina, Sicily, Italy. 34. Lakhina, A., Crovella, M., & Diot, C. (2004). Diagnosing network-wide traffic anomalies. In ACM SIGCOMM. 35. Lakhina, A., Papagiannaki, K., Crovella, M., Diot, C., Kolaczyk, E. D., & Taft, N. (2004). Structural analysis of network traffic flows. In ACM SIGMETRICS/Performance. 36. Lakshman, U., & Lobo, L. (2006). MPLS traffic engineering. Cisco Press. Available from http://www.ciscopress.com/articles/article.asp?p=426640, 2006. 37. Lin, F., & Wang, J. (1993). Minimax open shortest path first routing algorithms in networks supporting the SMDS services. In Proceedings of the IEEE International Conference on Communications (ICC), 2, 666–670. 38. Maltz, D., Xie, G., Zhan, J., Zhang, H., Hjalmtysson, G., & Greenberg, A. (2004). Routing design in operational networks: A look from the inside. In ACM SIGCOMM, Portland, OR, USA. 39. Mauro, D. R., & Schmidt, K. J. (2001) Essential SNMP. Sabastopol, CA: O’Reilly. 40. Maxemchuk, N. F., Ouveysi, I., & Zukerman, M. (2000). A quantitative measure for comparison between topologies of modern telecommunications networks. In IEEE Globecom. 41. Mitra, D., & Ramakrishnan, K. G. (1999). A case study of multiservice, multipriority traffic engineering design for data networks. In Proceedings of the IEEE GLOBECOM (pp. 1077–1083). 42. Moy, J. T. (1998). OSPF version 2. Network Working Group, Request for comments: 2328, April 1998. 43. Norros, I. (1994). A storage model with self-similar input. Queueing Systems, 16, 387–396. 44. Nucci, A., & Papagiannaki, K. (2009) Design, measurement and management of large-scale IP networks. New York: Cambrigde University Press. 45. Odlyzko, A. M. (2003). Internet traffic growth: Sources and implications. In B. B. Dingel, W. Weiershausen, A. K. Dutta, & K.-I. Sato (Eds.), Optical transmission systems and equipment for WDM networking II (Vol. 5247, pp. 1–15). Proceedings of SPIE. 46. Oetiker, T. MRTG: The multi-router traffic grapher. Available from http://oss.oetiker.ch/mrtg//. 47. Oetiker, T. RRDtool. Available from http://oss.oetiker.ch/rrdtool/. 48. Paxson, V. (2004). Strategies for sound Internet measurement. In ACM Sigcomm Internet Measurement Conference (IMC), Taormina, Sicily, Italy. 49. Potts, R. B., & Oliver, R. M. (1972). Flows in transportation networks. New York: Academic Press. 50. Pyhnen, P. (1963). A tentative model for the volume of trade between countries. Weltwirtschaftliches Archive, 90, 93–100. 51. Qiu, L., Yang, Y. R., Zhang, Y., & Shenker, S. (2003). On selfish routing in internet-like environments. In ACM SIGCOMM (pp. 151–162). 52. Qui, L., Zhang, Y., Roughan, M., & Willinger, W. (2009). Spatio-Temporal Compressive Sensing and Internet Traffic Matrices”, Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qui, ACM Sigcomm, pp. 267–278, Barcellona, August 2009. 53. Ramakrishnan, K., & Rodrigues, M. (2001). Optimal routing in shortest-path data networks. Lucent Bell Labs Technical Journal, 6(1), 117–138. 54. Rosen, E. C., Viswanathan, A., & Callon, R. (2001). Multiprotocol label switching architecture. Network Working Group, Request for Comments: 3031, 2001. 55. Roughan, M. (2005). Simplifying the synthesis of Internet traffic matrices. ACM SIGCOMM Computer Communications Review, 35(5), 93–96. 56. Roughan, M., & Gottlieb, J. (2002). Large-scale measurement and modeling of backbone Internet traffic. In SPIE ITCOM, Boston, MA. 57. Roughan, M., Greenberg, A., Kalmanek, C., Rumsewicz, M., Yates, J., & Zhang, Y. (2003). Experience in measuring Internet backbone traffic variability: Models, metrics, measurements and meaning. In Proceedings of the International Teletraffic Congress (ITC-18) (pp. 221–230). 5 Robust Network Planning 177 58. Roughan, M., Sen, S., Spatscheck, O., & Duffield, N. (2004). Class-of-service mapping for QoS: A statistical signature-based approach to IP traffic classification. In ACM SIGCOMM Internet Measurement Workshop (pp. 135–148). Taormina, Sicily, Italy. 59. Roughan, M., Thorup, M., & Zhang, Y. (2003). Performance of estimated traffic matrices in traffic engineering. In ACM SIGMETRICS (pp. 326–327). San Diego, CA. 60. Roughan, M., Thorup, M., & Zhang, Y. (2003). Traffic engineering with estimated traffic matrices. In ACM SIGCOMM Internet Measurement Conference (IMC) (pp. 248–258). Miami Beach, FL. 61. Shaikh, A., & Greenberg, A. (2001). Experience in black-box OSPF measurement. In Proceedings of the ACM SIGCOMM Internet Measurement Workshop (pp. 113–125). 62. Shaikh, A., & Greenberg, A. (2004). OSPF monitoring: Architecture, design and deployment experience. In Proceedings of the USENIX Symposium on Networked System Design and Implementation (NSDI). 63. Tinbergen, J. (1962). Shaping the world economy: Suggestions for an international economic policy. The Twentieth Century Fund. 64. Uhlig, S., Quoitin, B., Balon, S., & Lepropre, J. (2006). Providing public intradomain traffic matrices to the research community. ACM SIGCOMM Computer Communication Review, 36(1), 83–86. 65. Wang, H., Xie, H., Qiu, L., Yang, Y. R., Zhang, Y., & Greenberg, A. (2006). COPE: Traffic engineering in dynamic networks. In ACM SIGCOMM (pp. 99–110). 66. Zhang, Y., Ge, Z., Roughan, M., & Greenberg, A. (2005). Network anomography. In Proceedings of the Internet Measurement Conference (IMC ’05), Berkeley, CA. 67. Zhang, Y., Roughan, M., Duffield, N., & Greenberg, A. (2003). Fast accurate computation of large-scale IP traffic matrices from link loads. In ACM SIGMETRICS (pp. 206–217). San Diego, CA. 68. Zhang, Y., Roughan, M., Lund, C., & Donoho, D. (2003). An information-theoretic approach to traffic matrix estimation. In ACM SIGCOMM (pp. 301–312). Karlsruhe, Germany. 69. Zhang-Shen, R., & McKeown, N. (2004). Designing a predictable Internet backbone. In HotNets III, San Diego, CA, November 2004. 70. Zhang-Shen, R., & McKeown, N. (2005). Designing a predictable Internet backbone with Valiant load-balancing. In Thirteenth International Workshop on Quality of Service (IWQoS), Passau, Germany, June 2005. Part III Interdomain Reliability and Overlay Networks Chapter 6 Interdomain Routing and Reliability Feng Wang and Lixin Gao 6.1 Introduction Routing as the “control plane” of the Internet plays a crucial role on the performance of data plane in the Internet. That is, routing aims to ensure that there are forwarding paths for delivering packets to their intended destinations. Routing protocols are the languages that individual routers speak in order to cooperatively achieve the goal in a distributed manner. The Internet routing architecture is structured in a hierarchical fashion. At the bottom level, an Autonomous System (AS) consists of a network of routers under a single administrative entity. Routing within an AS is achieved via an Interior Gateway Protocol (IGP) such as OSPF or IS-IS. At the top level, an interdomain routing protocol glues thousands of ASes together and plays a crucial role in the delivery of traffic across the global Internet. In this chapter, we provide an overview of the interdomain routing architecture and its reliability in maintaining global reachability. Border Gateway Protocol (BGP) is the current de-facto standard for interdomain routing. As a path vector routing protocol, BGP requires each router to advertise only its best route for a destination to its neighbors. Each route includes attributes such as AS path (the sequence of ASes to traverse to reach the destination), and local preference (indicating the preference order in selecting the best route). Rather than simply selecting the route with the shortest AS path, routers can apply complex routing policies (such as setting a higher local preference value for a route through a particular AS) to influence the best route selection, and to decide whether to propagate the selected route to their neighbors. Although BGP is a simple path vector protocol, configuring BGP routing policies is quite complex. Each AS typically F. Wang School of Engineering and Computational Sciences, Liberty University e-mail: fwang@liberty.edu L. Gao () Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, Amherst, MA01002, USA e-mail: lgao@ecs.umass.edu C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 6, c Springer-Verlag London Limited 2010 181 182 F. Wang and L. Gao configures its routing policy according to its own goals, such as load-balancing traffic among its links, without coordinating with other networks. However, arbitrary policy configurations might lead to route divergence or persistent oscillation of the routing protocol. That is, although BGP allows flexibility in routing policy configuration, BGP itself does not guarantee routing convergence. Arbitrary policy configurations, such as unintentional mistakes or intentional malicious configuration, can lead to persistent route oscillation [9, 11]. Besides being a policy-based routing protocol, BGP has many features that aim to scale a large network such as the global Internet. One feature is that BGP sends incremental updates upon routing changes rather than sending complete routing information. BGP speaking routers send new routes only when there are changes. Related with the incremental update feature, BGP uses a timer, referred to as the Minimum Route Advertisement Interval (MRAI) timer, to determine the minimum amount of time that must elapse between routing updates in order to limit the number of updates for each prefix. Therefore, BGP does not react to changes in topology or routing policy configuration immediately. Rather, it controls the frequency in which route changes can be made in order to avoid overloading router CPU cycles or reduce route flap. While MRAI timers can be effective in reducing routing update frequency, the slow reaction to changes can delay route convergence. More importantly, during the delayed route convergence process, routes among neighboring routers might be inconsistent. This can lead to transient routing loops or transient routing outages (referred to as transient routing failures) caused by the delay in discovering alternate routes. The goal of this chapter is to provide an overview of BGP, to give practical guidelines for configuring BGP routing policy and offer a framework for understanding how undesirable routing states such as persistent routing oscillation and transient routing failures or loops can arise. We also present a methodology for measuring the extent to which these undesirable routing states can affect the quality of end-to-end packet delivery. We will further describe proposed solutions for reliable interdomain routing. Toward this end, we outline this chapter as follows. We begin with an introduction to BGP in Section 6.2. We first describe interdomain routing architecture, and then illustrate the details of how BGP enables ASes to exchange global reachability information and various BGP route attributes. We further present routing policy configurations that enable each individual AS to meet its goal of traffic engineering or commercial agreement. In Section 6.3, we introduce multihoming technology. Multihoming allows an AS to have multiple connections to upstream providers in order to survive a single point of failure. We present various multihoming approaches, such as multihoming to multip le upstream providers or single upstream provider to show the redundancy and load-balancing benefits associated with being multihomed. In Section 6.4, we highlight the limitations of BGP. For example, the protocol design does not guarantee that routing will converge to a stable route. We further show how incentive compatible routing policies can prevent routing oscillation, and how transient routing failures or loops can occur even under incentive compatible routing configuration or redundant underlying infrastructure. 6 Interdomain Routing and Reliability 183 Having understood the potential transient routing failures and routing loops, we describe a measurement methodology, and measurement results that quantify the impact of transient routing failures and routing loops on end-to-end path performance in Section 6.5. This illustrates the severity that routing outages can affect the quality of packet delivery. In Section 6.6, we present a detailed overview of the existing solutions to achieve reliable interdomain routing. We show that both protocol extensions and routing policies can enhance the reliability of interdomain routing. Finally, we conclude the chapter by pointing out possible future research directions in Section 6.7. 6.2 Interdomain Routing This section introduces the interdomain routing architecture, the interdomain routing protocol, BGP, and BGP routing policy configuration. 6.2.1 Interdomain Routing Architecture The Internet consists of a large collection of hosts interconnected by networks of links and routers. The Internet is divided into thousands of ASes. Examples range from college campuses and corporate networks to global Internet Service Providers (ISPs). An AS has its own routers and routing policies, and connects to other ASes to exchange traffic with remote hosts. A router typically has very detailed knowledge of the topology within its AS, and limited reachability information about other ASes. Figure 6.1 shows an example of the Internet topology, where there are large transit ISPs such as MCI or AT&T, and stub ASes, such as the University of Massachusetts’ network, which does not provide transit service to other ASes. Google.com Sprint AS 15169 AS 1249 Servers Umass.edu MCI AT & T Fig. 6.1 An example topology of interconnection among Internet service providers and stub networks 184 F. Wang and L. Gao Note that the topologies of the transit ISPs and stub ASes shown in this example are much simpler than those in reality. Typically, a large transit ISP consists of hundreds or thousands of routers. ASes interconnect at public Internet exchange points (IXPs) such as MAE-EAST or MAE-WEST, or dedicated point-to-point links. Public exchange points typically consist of a shared medium such as a Gigabit Ethernet, or an ATM switch, that interconnects routers from several different ASes. Physical connectivity at the IXP does not necessarily imply that every pair of ASes exchanges traffic with each other. AS pairs negotiate contractual agreements that control the exchange of traffic. These relationships include provider-to-customer, peer-to-peer, and backup, and are discussed in more detail in Section 6.4.1. Each AS has responsibility for carrying traffic to and from a set of customer IP addresses. The scalability of the Internet routing infrastructure depends on the aggregation of IP addresses in contiguous blocks, called prefixes, each consisting of a 32-bit IP address and a mask length (e.g., 1:2:3:0=24). An IP address is generally shown as four octets of numbers from 0 to 255 represented in decimal form. The mask length is used to indicate the number of significant bits in the IP address. That is, a prefix aggregates all IP addresses that match the IP address in the significant bits. For example, prefix 1:2:3:0=24 represents all addresses between 1:2:3:0 and 1:2:3:255. An AS employs an intradomain routing protocol (IGP) such as OSPF or ISIS to determine how to reach routers and networks within itself, and employs an interdomain routing protocol, i.e., Border Gateway Protocol (BGP) in the current Internet, to advertise the reachability of networks (represented as prefixes) to neighboring ASes. 6.2.2 IGP Each AS uses an intradomain routing protocol or IGP for routing within the AS. There are two classes of IGP: (1) distance vector and (2) link state routing protocol. In distance-vector routing, every routing message propagated by a router to its neighbors contains the length of the shortest path to a destination. In link-state routing, every router learns the entire network topology along with the link costs. Then it computes the shortest path (or the minimum cost path) to each destination. When a network link changes state, a notification, called link state advertisement (LSA), is flooded throughout the network. All routers note the change and recompute their routes accordingly. 6.2.3 BGP The interdomain routing protocol, BGP, is the glue that pieces together the various diverse networks or ASes that comprise the global Internet today. It is used among 6 Interdomain Routing and Reliability 185 ASes to exchange network reachability information. Each AS has one or more border routers that connect to routers in neighboring ASes, and possibly a number of internal BGP speaking routers. BGP is a path-vector routing protocol that facilitates routers to exchange the path used for reaching a destination. By including the path in the route update information, one can avoid loops by eliminating any path that traverses the same node twice. Using a path vector protocol, routers running BGP distribute reachability information about destinations (network prefixes) by sending route updates – containing route announcements or withdrawals – to their neighbors in an incremental manner. BGP constructs paths by successively propagating advertisements between pairs of routers that are configured as BGP peers. Each advertisement concerns a particular prefix and includes the list of ASes along the path (the AS path) to the network containing the prefix. By representing the path to be traversed by the ASes, BGP hides the details of the topology and routing information inside each AS. Before accepting an advertisement, the receiving router checks for the presence of its own AS number in the AS path to discard routes with loops. Upon receiving an advertisement, a BGP speaking router must decide whether or not to use this path and, if the path is chosen, whether or not to propagate the advertisement to neighboring ASes (after adding its own AS number of the AS path). BGP requires that a router simply advertise its best route for each destination to its neighbors. A BGP speaking router withdraws an advertisement when the prefix is no longer reachable with this route, which may lead to a sequence of withdrawals by upstream ASes that are using this path. When there is an event affecting a router’s best route to a destination, that router will compute a new best route and advertise the routing change to its neighbors. If the router no longer has any route to the destination, it will send a withdrawal message to neighbors for that destination. When an event causes a set of routers to lose their current routing information, the routing change will be propagated to other routers. To limit the number of updates that a router has to process within a short time period, a rate-limiting timer, called the Minimum Route Advertisement Interval (MRAI) timer, determines the minimum amount of time that must elapse between routing updates to a neighbor [26]. This has the potential to reduce the number of routing updates, as a single routing change might trigger multiple transient routes during the path exploration or route convergence process before the final stable route is determined. If new routes are selected multiple times while waiting for the expiration of the MRAI timer, the latest selected route shall be advertised at the end of MRAI. To avoid long time loss of connectivity, RFC 4271 [26] specifies that the MRAI timer is applied to only BGP announcements, not to explicit withdrawals. However, some router implementations might apply the MRAI timer to both announcements and withdrawals. BGP sessions can be established between router pairs in the same AS (we refer the BGP session as iBGP session) or different ASes (we refer the BGP session as eBGP session). Figure 6.2 illustrates examples of iBGP and eBGP sessions. Each BGP speaking router originates updates for one or more prefixes, and can send the updates to the immediate neighbors via an iBGP or eBGP session. iBGP sessions 186 F. Wang and L. Gao AS 1 iBGP GP iB GP eB iBGP P iBGP iBG P iBG eBGP P iBGP iBG P iBGP iBG P GP iB iBGP AS 2 iBG P iBG eBGP iBGP P G iB iBGP P iBG GP GP GP iB iB iBGP GP eB eBGP iB iBGP AS 3 Fig. 6.2 Internal BGP (iBGP) versus external BGP (eBGP) are established between routers in the same AS in order for the routers to exchange routes learned from other ASes. In the simplest case, each router has an iBGP session with every other router (i.e., fully meshed iBGP configuration). In the fullymeshed iBGP configuration, a route received from an iBGP router cannot be sent to another iBGP speaking router, since a route via an iBGP peer should be directly received from the iBGP peer. In practice, an AS with hundreds or thousands of routers may need to improve scalability using route reflectors to avoid a fully-meshed iBGP configure. These optimizations are intended to reduce iBGP traffic without affecting the routing decision. Each route reflector and its clients (i.e., iBGP neighbors that are not route reflectors themselves) form a cluster. Figure 6.3 shows an example of route reflector cluster, where cluster 1 contains route reflector RR1 and its three clients. Typically, route reflectors and their clients are located in the same facility, e.g., in the same Point of Presence (PoP). Route reflectors themselves are fully meshed. For example, in Fig. 6.3, the three route reflectors RR1, RR2 and, RR3 are fully meshed. A route reflector selects the best route among the routes learned via clients in the cluster, and sends the best route to all other clients in the cluster except the one from which the best route is learned, as well as to all other route reflectors. Similarly, it also reflects routes learned from other route reflectors to all of its own clients. 6.2.4 Routing Policy and Route Selection Process The simplest routing policy is the shortest AS path routing, where each AS selects a route with the shortest AS path. BGP, however, allows much more flexible routing 6 Interdomain Routing and Reliability 187 Clu client sterclient 1 AS 1 client client Cluster 2 client RR1 RR2 client client RR3 client Cl us ter 3 client client Fig. 6.3 An example of route reflector configuration for scaling iBGP BGP Updates Import Policies Best Route Selection BGP Updates Export Policies Fig. 6.4 Import policies, route selection, and export policies policies than the shortest AS path routing. An AS can favor a path with a longer AS path length by assigning a higher local preference value. BGP also allows an AS to send a hint to a neighbor on the preference that should be given to a route by using the community attribute. BGP also enables an AS to control how traffic enters its network by assigning a different multiple exit discriminator (MED) value to the advertisements it sends on each link to a neighboring AS. Otherwise, the neighboring AS would select the link based on the link cost within its own intradomain routing protocol. An AS can also discourage traffic from entering its network by performing AS prepending, which inflates the length of the AS path by listing an AS number multiple times. Processing an incoming BGP update involves three steps as shown in Fig. 6.4: 1. Import policies that decide which routes to consider 2. Path selection that decides which route to use 3. Export policies to decide whether (and what) to advertise a neighboring AS An AS can apply both implicit and explicit import policies. Every eBGP peering session has an implicit import policy that discards a routing update when the receiving BGP speaker’s AS already appears in the AS path; this is essential to avoid 188 Table 6.1 Steps in the BGP path selection process F. Wang and L. Gao 1. 2. 3. 4. 5. 6. Highest local preference Shortest AS path Lowest origin type Smallest MED Smallest IGP path cost to egress router Smallest next-hop router id introducing a cycle in the AS path. The explicit import policy includes denying or permitting an update, and assigning a local-preference value. For example, an explicit import policy could assign local preference to be 100 if a particular AS appears in the AS path or deny any update that includes AS 2 in the path. After applying the import policies for a route update from an eBGP session, each BGP speaking router then follows a route selection process that picks the best route for each prefix, which is shown in Table 6.1. The BGP speaking router picks the route with the highest local preference, breaking ties by selecting the route with the shortest AS path. Note that local preference overrides the AS-path length. Among the remaining routes, the BGP speaking router picks the one with the smallest MED, breaking ties by selecting the route with the smallest cost to the BGP speaking router that passes the route via an iBGP session. Note that, since the tiebreaking process draws on intradomain cost information, two BGP speaking routers in the same AS may select different best routes for the same prefix. If a tie still exists, the BGP speaking router picks the route with the smallest next hop router ID. Each BGP speaking router sends only its best route (one best route for each prefix) via BGP sessions, including eBGP and iBGP sessions. The BGP speaking router applies implicit and explicit export policies on each eBGP session to a neighboring BGP speaker. Each BGP speaking router applies an implicit policy that sets MED to default values, assigns next hop to interface that connects the BGP session, and prepends the AS number of the BGP speaking router to the AS path. Explicit export policies include permitting or denying the route, assigning MED, assigning community set, and prepending the AS number one or more times to the AS path. For example, an AS could prepend its AS number several times to the AS path for a prefix. Although the BGP route selection process aims to select routes based mostly on BGP attributes, it is not totally independent from IGP. In fact, IGP cost can influence route selection when the best path is based on the comparison of the IGP cost to the egress routers. We refer to this tie-break BGP route selection as hotpotato routing, since with all other BGP attributes being equal, each AS selects the route with the shortest path to exit its network. For example, in Fig. 6.5, AS 3 learns BGP routes to destination, originated by AS 0 at egress routers C1 and C2 from AS 1 and AS 2, respectively. The value on each link within AS 3 represents the corresponding IGP cost. Suppose that the two learned routes to the destination have identical local preferences. We see that the AS path lengths of the two routes are equal. Router C3 learned two routes from C1 and C2, respectively, and selects the one learned from C1 as the best route because the IGP cost of path (C3 C1) is smaller 6 Interdomain Routing and Reliability 189 Fig. 6.5 An example illustrating hot-potato routing at AS 3. The value around a link represents an IGP weight AS 3 C3 C4 8 6 9 14 C2 C1 AS 2 AS 1 AS 0 1.1.1.1 Set local pref 100 12.1.1.0/24 2.2.2.1 Set local pref 90 12.1.1.0/24 1.1.1.2 RTA 4.4.4.1 2.2.2.2 4.4.4.2 RTB Fig. 6.6 Local preference configuration than that of path (C3 C2). Similarly, router C4 will select the route learned from C2 as the best route because the path has smaller IGP cost than path (C4 C2). However, hot-potato routing means that changing IGP weight can cause BGP speaking routers to select a different best rout and therefore, shift egress routers. For instance, by changing the IGP link cost between router C1 and C3 from 8 to 10, router C3 will change its egress router from C1 to C2. BGP routing policy configuration is typically indicated by a router configuration file. A BGP routing policy can be assigned based on the destination prefix or the next hop AS. For example, in Fig. 6.6, AS 0 advertises a prefix “10.1.1.0/24” to the Internet. AS 3 connects to AS 1 and AS 2, and will get routing updates about the destination “10.1.1.0/24” from the two ASes. AS 3 decides what path its outbound 190 F. Wang and L. Gao traffic to the destination is going to take. Suppose that AS 3 prefers to use the connection via AS 1 to reach the destination. As shown in the following configuration based on Cisco IOS commands, Router RTA at AS 3 sets an explicit import policy that assigns a local preference value 100 to the route from AS 1: router bgp 3 neighbor 1.1.1.1 remote-as 1 neighbor 1.1.1.1 route-map AS1-IN in neighbor 4.4.4.2 remote-as 3 access-list 1 permit 0.0.0.0 255.255.255.255 route-map AS1-IN permit match ip address 1 set local-preference 100 We describe the commands in the above configuration as follows. The first command starts a BGP process with an AS number of 3 at router RTA. The second command sets up an eBGP session with router at AS 1. The route-map command associated with the neighbor statement applies route map AS1-IN to inbound updates from AS 1. Just like the first neighbor command, the fourth command sets up an iBGP session with router RTB. The access-list command creates an access list named 1 to permit all advertisements. The route-map command creates a route map named AS1-IN that uses the access list 1 to identify routes to be assigned local preference of 100. 6.2.5 Convergence Process of BGP In this section, we illustrate how BGP routing processes converge to stable routes. Figure 6.7 shows an example of a routing policy configuration of a simple topology. In this chapter, we simplify the representation of the network using graph theoretical notations of nodes and edges, where a node represents either an AS or a BGP speaking router, and an edge represents the link between two nodes. In this example, we use a node to represent an AS. Furthermore, throughout this chapter, we focus on one destination prefix, d , which is always originated from AS 0. The figure indicates the export policy by showing all AS paths that an AS can receive from the adjacent Fig. 6.7 An example of policy configuration that converges. The paths around a node represents its permissible AS paths and the paths are ordered in the descending order of preference 2 230 20 0 10 120 1 3 310 30 6 Interdomain Routing and Reliability 191 router on the associated interface (referred to as permissible AS paths). The figure also indicates the import policy by ordering the paths in the descending order of local preference. The BGP routing process converges as follows. 1. Destination prefix d is announced to ASes 1, 2, 3 via direct links. 2. ASes 1, 2, and 3 all choose its direct path as their best route since those are the only route they received, and announce these direct paths to neighbors. 3. AS 1 now has two paths, (1 0) and (1 2 0), since these are only permissible paths. AS 2 now has two paths, (2 0) and (2 3 0). AS3 now has two paths, (3 0) and (3 1 0). According to the local preference of each AS, AS 1 ends up choosing (1 0) as its best route, AS 3 chooses (3 1 0) as its best route, and AS 2 chooses (2 3 0) as its best route. 4. AS 3 announces its best path (3 1 0), and therefore, implicitly withdraws its route announcement of (3 0) from AS 2. Now, with (2 0) as its only path, AS 2 chooses (2 0) as its best path. 5. AS 2 announces its best path to both AS 1 and AS 3. However, such an announcement does not change the route that AS 1 or AS 3 chooses. Therefore, all ASes choose a stable route where no routers need to send new update messages, and hence the BGP process converges. Note that during the convergence process, each AS selects and/or announces its best route in an asynchronous manner that is determined by the expiration of MRAI timers. We simplify the process by assuming that route announcements are performed in “a lock step”. Nevertheless, it can be proved that in this example, no matter what the exact steps of the convergence process are, the stable route reached by each AS is the same. 6.3 Multihoming Technology In this section, we provide an overview of the current multihoming technology, which is widely used to provide redundant connection. Multihoming refers to the technology where an AS connects to the Internet through multiple connections via one or more upstream providers. It is intended to enhance the reliability of the Internet connectivity. When one of the connections fails or is in maintenance, the AS can still connect to the Internet via other connections. Multihoming configuration can be achieved using BGP configuration, static routes, Network Address Translation (NAT), or a combination of the above. In this section, we focus on describing multihoming with BGP configuration. The redundancy provided by multihoming can bring additional complexity to the network configuration. First of all, it is imperative to designate primary and backup connections in such a manner so that when the primary connection fails, it can automatically fall back to the backup connection. Second, it is desirable to distribute traffic across multiple connections. Traffic can be classified into inbound and outbound traffic. Outbound traffic is the traffic originating within the multihomed AS or its customers destined to other ASes; inbound traffic is the traffic destined to the AS or its customers coming from other ASes. 192 F. Wang and L. Gao A multihomed AS can be multihomed to a single provider, or to multiple providers. We will describe how multihoming to a single provider and multiple providers can be configured in the next two Sections 6.3.1 and 6.3.2. 6.3.1 Multihoming to a Single Provider The simplest way for an AS to connect to the Internet is by setting up a single connection with a provider. However, the AS has only one connection to send and receive data. This single-homed configuration cannot be resilient to a single point of failure such as link or router failure or maintenance. To address this issue, the AS can set up multiple connections to the provider. Four types of connections can be established between an AS and its provider. We describe each type of the connections as follows: Multiple Connections Between a Single Customer Router and Single Provider Access Router (SSA) An AS has a single border router connected to its provider’s access router with multiple links. As illustrated in Fig. 6.8a, AS 0 has a single (a) SSA (c) MMA Fig. 6.8 Four types of multihoming connections (b) SMA (d) MMB 6 Interdomain Routing and Reliability 193 border router BoR1, which connects to AS 1’s access router, AR1, via two links. If one of the links fails, the other link can be used. Multiple Connections Between a Single Customer Router and Multiple Provider Access Routers (SMA) An AS has a single router connected to its provider’s multiple access routers. For example, in Fig. 6.8b, BoR1 connects to AS 1 at both AR1 and AR2. This configuration can maintain connectivity with a single point of failure of links or the access routers, but cannot do so with a failures of the customer router. Multiple Connections Between Multiple Customer Routers and Multiple Provider Access Routers (MMA) An AS has multiple routers connected to its provider’s multiple access routers. Note that those multiple access routers at the provider are connected to the same backbone router. For example, in Fig. 6.8c, AS 0 has two routers: BoR1 and BoR2. Each border router connects to an access router (AR) in AS 1. This configurations can maintain connectivity with a single point of failure of access routers or border routers. However, the two access routers connect to the same backbone router, BaR1. A failure at BaR1 can cause both the connections to become unavailable. Multiple Connections Between Multiple Customer Routers and Multiple Provider Backbone Routers (MMB) An AS has multiple connections between its multiple border routers and multiple backbone routers as its provider. This configuration can achieve higher reliability than that of MMA. For example, in Fig. 6.8d, AS 0 has two border routers, BoR1 and BoR2, which are connected to geographically separate backbone routers at AS 1. AS 0’s BoR1 connects to AS 1’s access router AR 1, and they are at the same geographical location, while the border router BoR2 is connected to another backbone router BaR1. A private physical connection connects the customer AS’s border router BoR2 and the backbone router BaR1. This method can maintain connectivity even under a failure of the backbone router. Next, we describe how an AS can control traffic over the primary and backup link. First, we discuss the control of outbound traffic. A multihomed AS can assign different local preference values to the routes learned from its provider to control its outgoing traffic. For example, in Fig. 6.8b, BoR1 will receive two identical routes for each destination prefix. AS 0 can assign higher local preference values to prefer the routes received through one particular connection over other routes for the same destination received through the other connection. Multihomed configurations of SSA, MMA or MMB can apply the same method to control outbound traffic over the primary link. In addition, an AS multihomed to a single provider with SSA, can use another method – setting the next hop to a virtual address to control outbound traffic. For example, in Fig. 6.8a, AR1 can be assigned a virtual address – a loopback interface. BoR1 will set up a connection with the loopback address. As a result, all routes that BoR1 receives from AR1 will have the same next hop 20.10.10.1. Since next hop 20.10.10.1 can be reached via two connections, outbound traffic can be distributed over the two links. 194 F. Wang and L. Gao Second, we discuss how an AS multihomed to a single provider can control its inbound traffic. In this case, the multihomed AS can tweak the BGP attribute values, such as AS path length or MED, to influence route selection at the providers’ router. For example, an AS can prepend its AS number on the AS path of the route update announced via the backup link, or send the route update via the backup link with a higher MED value than that via the primary link. As a result, the primary link is used in normal situations since it has a shorter AS path or lower MED value. When the primary link is down, the backup link will be used. 6.3.2 Multihoming to Multiple Providers The availability of the Internet connectivity provided by upstream providers is very important for an AS. Multihoming to more than one provider can ensure that the AS maintains the global Internet connectivity even if the connection to one of its providers fails [1]. For example, in Fig. 6.9. AS 0 is multihomed to two upstream providers: AS 1 and AS 2. AS 0 may use one of its providers as its primary provider, and the other as a backup provider. When connectivity through the primary provider fails, AS 0 still has its connectivity to the Internet through the backup provider. A multihomed AS can be configured to direct its outbound traffic through the primary provider. Only when the connection through the primary provider fails, its outbound traffic can use the connection through the backup provider. To achieve this goal, a multihomed AS can use the same approach described for the AS multihomed to a single provider. That is, an AS may assign a higher local preference for the route through the primary provider than that through the backup. For its outbound Fig. 6.9 An example of an AS multihomed to two upstream providers 6 Interdomain Routing and Reliability 195 traffic, an AS multihomed to multiple providers can use the same approach as those described for an AS multihomed to a single provider. A multihomed AS might control which provider its inbound traffic can use. There are several approaches to control the route used for inbound traffic. The simplest approach is to advertise its prefixes only to the primary provider so that inbound traffic can use the primary provider. For example, in Fig. 6.9, AS 0 can advertise its prefix to its primary provider, say, AS 1. However, such selective advertisement cannot provide the redundancy afforded by multihoming. In the above example, if the link between AS 1 and AS 0 fails, AS 0 becomes unreachable until AS 0 notices the failure and advertises its prefixes to the backup provider, AS 2. In this case, the time it takes to fail over to the backup provider depends on how fast the multihomed AS detects the failure and determines to announce its profixes to the backup provider, and how fast the announcement propagates to the global Internet. Alternatively, an AS can control the route taken by the inbound traffic by splitting its prefix into several specific prefixes, and advertise the more specific prefixes to the primary providers. For example, in Fig. 6.10, AS 0 has a prefix, “12.0.0.0/19”. AS 0 splits the prefix into two more specific prefixes: “12.0.0.0/20” and “12.0.16.0/20”. AS 0 can announce “12.0.0.0/20” to AS 1, and “12.0.16.0/20” to AS 2. At the same time, AS 0 can advertise its prefix, “12.0.0.0/19” to both providers. As a result, inbound traffic to “12.0.0.0/20” comes from AS 1, while inbound traffic to “12.0.16.0/20” comes from AS 2. This approach can balance the traffic load between the two providers by designating each one as the primary provider for a specific prefix. At the same time, the approach can tolerant failure of links to providers. For example, if the link between AS 0 and AS 1 fails, destinations within prefix “12.0.0.0/20” can still be reached via AS 2 since prefix “12.0.0.0/19” is announced via AS 2. Despite the advantage of load balancing and fault tolerance, this approach has the drawback of potentially increasing the number of prefixes announced to the global Internet. Fig. 6.10 An example of splitting prefixes 196 F. Wang and L. Gao Another approach to control the route of inbound traffic is via AS prepend. An AS can prepend its AS number, one or several times when announcing to the backup provider. This can “discourage” other AS to select the route via the backup provider. Note that this approach cannot ensure that all inbound traffic will go through the primary provider. It is possible for an AS to use the longer backup path rather than the shorter primary path if the backup path has a higher local preference. In fact, most providers prefer customers over providers. Consider the example network in Fig. 6.9, AS 2 learns paths to reach prefixes in AS 0 from both the direct and its upstream connections, but AS 2 will prefer the direct connection, although AS 0 intends it to be a backup path. In summary, multihoming techniques aim to provide redundant connectivity. Nevertheless, the extent that these multihoming techniques can ensure continuous connectivity is hinged on how long it takes for the routing protocol, BGP, to failover to backup routes. In Section 6.4.2, we will discuss how BGP can recover from a failure and how long it takes BGP to discover alternate routes. 6.4 Challenges in Interdomain Routing Failures and changes in topology or routing policy are fairly common in the Internet due to various causes such as maintenance, router crash, fiber cuts, and misconfiguration [4, 17, 18]. Ideally, when such changes occur, routing protocols should be able to quickly react to those failures to find alternate paths. However, BGP is a policy-based routing protocol, and is not guaranteed to converge to a stable state, in which all routers agree on a stable set of routes. Persistent route oscillation can significantly degrade the end-to-end performance of the Internet. Furthermore, even if BGP converges, it has been known to be slow to react and recover from network changes. During routing convergence, there are three potential routing states from the perspective of any given router: path exploration during which an alternate route instead of the final stable route is used, transient failures during which there is no route to a destination but a route will be eventually discovered, and transient forwarding loops in which routes to a destination form a forwarding loop and the forwarding loop will eventually disappear. Path exploration does not lead to packet drops, while transient failures or transient loops do. In this chapter, we describe how persistent route oscillation, routing failures, and routing loops can occur. 6.4.1 Persistent Route Oscillation BGP routing protocol provides great flexibility in routing policies that can be set by each AS. However, arbitrary setting of routing policies can lead to persistent route oscillation. For example, Fig. 6.11 shows the “bad gadget” example used in [9]. In this example and all of the following examples, we focus on a single destination 6 Interdomain Routing and Reliability Fig. 6.11 An example of BGP routing policy that leads to persistent route oscillation. The AS paths around a node represent a set of permissible paths, which are ordered in the descending order of local preference 197 2 230 20 0 120 10 1 3 310 30 prefix that originates from AS 0, without losing generality. In this example, ASes 1, 2, and 3 receive only the direct path to AS 0 and indirect path via their clockwise neighbor, and prefer to route via their clockwise neighbor over the direct path to AS 0. For example, AS 2 receives only paths (2 1 0) and (2 0) and prefers route (2 1 0) over route (2 0). This routing policy configuration will lead to persistent route oscillation. In fact, it can be proved that no matter what route an AS chooses initially [9], it will keep changing its route and never reach a stable route. For example, the following sequence of route changes shows how a persistent route oscillation can occur. 1. Initially, ASes 1, 2, and 3 choose paths (1 2 0), (2 0), and (3 0), respectively. 2. After AS 2 receives path (3 0) from AS 3, it changes from its current path (2 0) to the higher preference path (2 3 0), which in turn forces AS 1 to change its path from (1 2 0) to (1 0) because path (1 2 0) is no longer available. 3. When AS 3 notices that AS 1 uses path (1 0), it changes its path (3 0) to (3 1 0). This in turn forces AS 2 to change its path to (2 0). 4. After AS 2 sends path (2 0) to AS 1, AS 1 changes its path (1 0) to (1 2 0), which in turn forces AS 3 to change its path (3 1 0) to (3 0), and the oscillation begins again. In practice, however, routing policies are typically set according to commercial contractual agreements between ASes. Typically, there are two types of AS relationship: provider-to-customer and peer-to-peer. In the first case, a customer pays the provider to be connected to the Internet. In the second case, two ASes agree to exchange traffic on behalf of their respective customers free of charge. Note that contractual agreement between peering ASes typically requires that traffic via both directions of the peering link has to be within a ratio negotiated between peering ASes. In addition to these two common types of relationship, an AS may have a backup relationship with a neighboring AS. Having a backup relationship with a neighbor is important when an AS has limited connectivity to the rest of the Internet. For example, two ASes could establish a bilateral backup agreement for providing the connection to the Internet in the case that one AS’ link to its primary provider fails. Typically, provider-to-customer relationships among ASes are hierarchical. The hierarchical structure arises because an AS typically selects a provider with a network of larger size and scope than its own. An AS serving a metropolitan area is likely to have a regional provider, and a regional AS is likely to have a national provider as its provider. It is very unlikely that a nationwide AS would be a customer of a metropolitan-area AS. 198 F. Wang and L. Gao It is common for an AS to adopt an import routing policy, referred to as prefer customer routing policy, where routes received from an AS’ customers are always preferred over those received from its peers or providers. Such a partial order on the set of routes is compatible with economic incentives. Each AS has economic incentives to prefer routes via a customer link to those via peer or provider links, since it does not have to pay for the traffic via customer links. On the other hand, the AS has to pay for traffic via provider links, and traffic sent to its peer has to be “balanced out” with traffic from its peer. It is also common for an AS to adopt an export routing policy, referred to as no-valley routing policy, where an AS does not announce a route from a provider or peer to another provider or peer. For example, in Fig. 6.12, and the following examples, an arrowed line between two nodes represents a provider-to-customer relationship, with the arrow ending at the customer. A dashed line represents a peer-to-peer relationship. We visualize a sequence of customer-to-provider links as an uphill path, for example, path (1 3 5) is an uphill path. We define a sequence of provider-to-customer links as a down hill path, for example, path (5 4 1) is a down hill path. A peer-to-peer link is defined as a horizontal path. The no-valley routing policy ensures that no path contains a valley where a downhill path is followed by either a peer-to-peer link or uphill path, or a peer-to-peer link is follower by an uphill path or a peer-to-peer link. That is, an AS path may take one of the following forms: (1) an uphill path followed by one or no peer-to-peer link, (2) a downhill path, (3) a peer-to-peer link followed by a downhill path, (4) an uphill path followed by a downhill path, or (5) a uphill path followed by a peering link, followed by a downhill path. For example, in Fig. 6.12, paths (3 5 4) and (1 3 5 6 4 2) are no-valley paths while AS paths (3 1 4) and (3 1 2 6) are not no-valley paths. ASes adopt these rules since there is no economic incentive for an AS to transit traffic between its providers and peers. Note that we name it no-valley routing policy since such an export policy ensures that no route traverses a provider-to-customer link and then a customer-to-provider link, or a provider-to-customer link and then a AS 5 AS 3 AS 6 AS 4 Provider-to-customer Peer-to-peer AS 1 AS 2 Fig. 6.12 Paths (3 5 4) and (1 3 5 6 4 2) are no-valley paths while AS paths (3 1 4) and (3 1 2 6) are not no-valley paths 6 Interdomain Routing and Reliability 199 peer-to-peer link, or a peer-to-peer link and then another peer-to-peer link, or peerto-peer link and then customer-to-provider link, all of which are valley paths if there is a hierarchical structure in provider-to-customer relationships. It has been proved that under the hierarchical provider-to-customer relationships, these common routing policies can indeed ensure route convergence [8]. Furthermore, these policies ensure route convergence under router or link failures, and changes in routing policy. Note that each AS can configure its routers with the prefer customer routing policy without knowing the policies applied in other ASes. Therefore, each AS has an economic incentive to follow the preferred customer routing policy. In addition, it is practical to implement the policy since ASes can set their routing policies without coordinating with other ASes. In addition to local preference setting, it has been observed that certain iBGP configuration may result in persistent route oscillation [2, 10]. Figure 6.13 shows an example of route reflector and policy configuration that can lead to persistent route oscillation. AS 1 consists of two route reflectors, A and B. A has two clients, C1 and C2, while B has one client, C3. The IGP cost of the link between two nodes is indicated beside the link, and the MED value of the routes is indicated in parentheses. It can be proved that no matter what the initial route is for each router, it is not possible for the routers to reach a stable route. As an example, we show below a possible sequence of route changes that lead to persistent oscillation. 1. Route reflector A selects path p2 and route reflector B selects path p3 . 2. Route reflector A receives p3 and selects p1 because p3 has a lower MED than p2 and p1 has lower IGP metric than p3 . 3. Route reflector B receives p1 and selects p1 as the best path (due to a lower IGP cost) and withdraws p3 . 4. Route reflector A selects p2 over p1 (due to a lower IGP cost) and withdraws p1 . 5. Route reflector B selects p3 over p2 (due to lower MED). Now both A and B return back to their initial routes. Fig. 6.13 An example route reflector configuration that leads to persistent oscillation 200 F. Wang and L. Gao One of the reasons that this route reflector configuration can lead to persistent route oscillation is that MED is compared only among links in the same AS. It is possible to enforce a rule that MED is always compared even when they come from links to different ASes. Other guidelines have also been proposed to prevent route reflector configuration from persistent oscillation. These guidelines include exploiting the hierarchical structure of route reflector configuration [10] similar to that proposed in [8]. That is, if a route reflector configuration ensures that a route reflector chooses a route from its client over that from another route reflector (e.g. with IGP cost setting), then it can ensure route convergence. 6.4.2 Transient Routing Failures Even when BGP eventually converges to a set of stable routes, network failures, maintenance events, and router configuration changes can cause BGP to reconverge. Ideally, when such an event occurs, routing protocols should be able to react quickly to those failures to find alternate paths. However, BGP is known to be slow in reacting and recovering from network events. Previous measurement studies have shown that BGP may take tens of minutes to reach a consistent view of the network topology after a failure [17–19]. During the convergence period, a router might contain routing information that lags behind the state of the network. For example, it is possible for a router to eventually discover an alternate path when one of the links in its original path fails. However, during the discovery process, the router might lose all of its paths before an alternate path is discovered. Such a transient loss of reachability is referred to as a transient routing failure. Figure 6.14 shows an example of policy configuration and link failure scenario that can lead to a transient routing failure. In this example, AS 1 and AS 2 are providers of AS 3, AS 0 is a customer of AS 1, and AS 1 is a peer of AS 2. Note that the import and export policies are realistic in the sense that it follows the prefercustomer and no-valley routing policy. When the link between AS 3 and AS 0 fails, AS 3 temporarily loses its connection to the destination AS 0. AS 3 has to send a withdrawal message to cause its neighbor AS 1 to select a new best path. Before AS 3 receives the new path from AS 1, it will experience transient loss of reachability to AS 0. In addition, the timing of sending withdrawal and announcement Fig. 6.14 An example illustrating routing failure at AS 3. The text around a node represents a set of permissible paths and their ordering in local preference (higher preference first) 130 10 1 1230 2 3 0 230 210 2130 30 310 3210 Provider−to−customer Peer−to−peer 6 Interdomain Routing and Reliability Fig. 6.15 Transient routing failures take place in a typical eBGP system. The AS paths around a node represent a set of permissible paths, which are ordered in the descending order of local preference 201 76310 7850 2 7 6 6310 67850 3 4 26310 267850 8 850 876310 5 50 5876310 310 46310 367850 467850 1 10 1367850 0 messages are determined by the expiration of MRAI timers, which can take several seconds to tens of seconds. During this period, all packets destined to AS 0 at AS 3 will be dropped. In a typical AS where the prefer-customer and no-valley routing policies are followed, it is quite likely to have ASes experience transient failures. In fact, when an event causes an AS to change from a customer route to a provider route and all of its providers use it to reach a destination, the AS will definitely experience a transient failure. This is because the AS has to withdraw the customer route first before its provider can discover an alternate path and send the path to it. Please refer to [30] for a proof. Figure 6.15 shows an example to illustrate this point. Suppose that before the link between AS 1 and AS 0 fails, AS 1, AS 3, and AS 6 all have only one path via their customers to reach the destination. When the link failure occurs, the ASes will experience transient failure before they can learn the route via their providers. AS 2 may experience the failure (depending on whether the withdrawal from AS 6 is suppressed the MRAI timer), but AS 7 does not experience any transient routing failure. In previous section, we have shown that multihoming technology can provide redundant underlying connections. Here, we use several examples to discuss whether BGP can fully exploit the redundancy to quickly recover from failures. In fact, BGP fails to take advantage of this redundancy to provide high degree of path diversity. The reason is due to the iBGP configuration. A typical hierarchical iBGP system consists of a core with fully meshed core routers, i.e., route reflectors, and the edge routers which are the clients of the relevant route reflectors. Transient routing failures can occur within a hierarchical iBGP system. Figure 6.16 shows an example that illustrates how routing failures can occur due to iBGP configuration. A multihoming AS AS 0 has two providers: AS 1 and AS 2. AS 1 can reach a destination originated at AS 0 via one of two access routers, AR1 or AR2. According to the prefer-customer routing policy, the path via AR1 is assigned higher local preference value than those via AR2. As a result, all routers inside AS 1 will use the path via AR1 to reach the destination except the access router AR2. Once the link between AR1 and AS 2 fails, all routers except AR2 might experience transient routing failures, before failover to the path via AR2. 202 F. Wang and L. Gao 10 10 120 10 BaR3 10 AR2 BaR2 BaR1 10 AR1 12.1.1.0/24 12.1.1.0/24 12.1.1.0/24 Fig. 6.16 An AS with a hierarchical iBGP configuration can experience transient failures 10 10 BaR3 BaR2 1 0 via AR1 BaR1 10 1000 10 AR1 AR2 12.1.1.0/24 with AS path (0) 12.1.1.0/24 with AS path (0 0 0) BoR1 BoR2 12.1.1.0/24 Fig. 6.17 An AS with multiple connections to a destination prefix can experience transient failures Our second example, shown in Fig. 6.17, is used to show the reliability issue for an AS with multiple connections to a single provider. In this example, AS 0 has two connections to AS 1. Suppose that AS 0 considers the connection via AS 1’s AR1 as the primary link, and the other connection via AR2 as the backup link. Suppose that AS 0 uses AS path prepending to implement this configuration. AS 0’s BoR2 advertises its prefix with AS path (0 0 0). As a result, all routers inside AS 1 except router AR2 have only one single route to reach the destination. If the link between AS 0’s BoR1 and AS 1’s AR1 fails, all routers within AS 1 except AR2 will experience transient failures. 6 Interdomain Routing and Reliability 203 Our third example, shown in Fig. 6.18, is used to show the reliability issue for an AS with multiple geographical connections to a single provider. In this example, we assume that AS 0 considers the connection via AS 1’s AR2 as the primary link, and the connection via AR1 as the backup link. Just like the previous example, suppose that AS 0 uses AS path prepending to implement this configuration. As a result, all routers inside AS 1 except router AR2 has only one single route to reach the destination. If the link between AS 0’s BoR2 and AS 1’s AR2 fails, all routers within AS 1 except AR2 will experience transient failures. Our last example used to show load balancing can avoid transient routing failures. In Fig. 6.19, AS 0 distributes its inbound traffic among the two connections by applying hot-potato routing policy. That is, the backbone routers within AS 1 select the best route according to IGP costs to the egress routers, AR1 and AR2. Fig. 6.18 An AS with geographical connections to a destination prefix can experience transient failures Fig. 6.19 Load balancing configuration can avoid transient failures 204 F. Wang and L. Gao Fig. 6.20 A transient failure experienced by router RT1 when the link between AS 0 and AS 1 is added or recovered As a result, all backbone routers have two different routes to reach the destination. This configuration can avoid single points of failures for backbone routers and link failures between AS 1 and AS 0. So far we have focused on scenarios that lose a route. In fact, when gaining a route, it is still possible to experience transient routing failures. For example, Fig. 6.20 shows a scenario where a router can experience transient routing failure due to iBGP configuration. In this example, AS 1 and AS 2 are providers of AS 0, and AS 1 and AS 2 have peer-to-peer relationship. When the link between AS 1 and AS 0 is added or recovered from a failure, AS 1 prefers direct path to destination AS 0. Before the link is recovered, all routers within AS 1 select the path via AS 2 as their best paths. After the recovery event, all routers within AS 1 use the path through the recovered link. During the route convergence process, router RT3 first selects the direct path to AS 0 and then sends the new route to router RT2 and router RT1. Once router RT2 receives the direct route from router RT3, it selects the route and withdraws its route through AS 2 from router RT1, since it cannot announce its currently selected route via router RT3 to router RT2 (due to the fact that a fully meshed iBGP session cannot reflect a route learned from one peer to another). If router RT1 receives the withdraw message from router RT2 before receiving the announcement message from router RT3, it will experience transient routing failures. 6.4.3 Transient Routing Loops During the route convergence process, it is possible to have not only transient routing failures, but also transient routing loops. A topology or routing policy change can lead the routers to recompute their best routes and update forwarding tables. During this process, the routers can be in an inconsistent forwarding state, causing 6 Interdomain Routing and Reliability 205 Fig. 6.21 An example of transient routing loop between AS 2 and AS 3. The list of AS paths shown beside each node is the set of permissible paths for the node, and the permissible paths are ordered in the descending order of local preference transient routing loops. Measurement studies have shown that the transient loops can last for more than several seconds [13, 29, 31]. Figure 6.21 shows a scenario where a transient routing loop can occur. In this example, when the link between AS 1 and AS 0 fails, AS 2 and AS 3 receive a withdrawal message from AS 1. These two ASes will each select the path via the other to reach the destination because the local preference value of a path via a peer is higher than that of a path via a provider. As a result, there is a routing loop. After AS 2 and AS 3 exchange their new routes, AS 2 will remove the path from AS 3 and select the path from AS 4 as the best path. Finally, all ASes will use the path via AS 4. 6.5 Impact of Transient Routing Failures and Loops on End-to-End Performance In this section, we aim to understand the impact that transient routing failures and loops have on end-to-end path performance. We describe an extensive measurement study that involves both controlled routing updates of a prefix and active probes from a diverse set of end hosts to the prefix. 6.5.1 Controlled Experiments The infrastructure for the controlled experiments is shown in Fig. 6.22. The infrastructure includes a BGP Beacon prefix from the Beacon routing experiment infrastructure [21]. The BGP Beacon is multihomed to two tier-1 providers to which we refer to as ISP1 and ISP 2. We control routing events by injecting well-designed routing updates from BGP Beacon at scheduled times to emulate link failures and recoveries. To understand the impact of routing events on the data plane performance, we select geographic and topologically diverse probing locations from the PlanetLab experiment testbed [25] to conduct active probing while routing changes are in effect. 206 F. Wang and L. Gao Fig. 6.22 Measurement infrastructure Fig. 6.23 Time schedule (GMT) for injecting routing events from BGP beacon Every 2 hours, the BGP Beacon sends a route withdrawal or announcement to one or both providers according to the time schedule shown in Fig. 6.23. Each circle denotes a state, indicating the providers offering transit service to the Beacon. Each arrow represents a routing event and state transition, marked by the time that the routing event (either a route announcement or a route withdrawal) occurs. For example, at midnight Beacon withdraws the route through ISP 1, and at 2:00 a.m., Beacon announces the route through ISP 1. There are 12 routing events every day. Only eight routing events keep the Beacon connected to the Internet; the other four serve the purpose of resetting the Beacon connectivity. These eight beacon events are classified into two categories: failover beacon event and recovery beacon event. In a failover beacon event, the Beacon changes from the state of using both providers to the state of using only a single provider. In a recovery beacon event, the Beacon changes from the state of using a single provider for connectivity to the state of using both providers. These two classes of routing changes emulate the control plane changes that a multihomed site may experience in terms of losing and restoring a link to one or more of its providers. For example, between midnight and 2:00 a.m., 6 Interdomain Routing and Reliability 207 the BGP Beacon is in a state that is only connected to ISP 2; at 2:00 a.m., it announces the Beacon prefix to ISP 1, leading to connectivity to both ISPs. This event emulates a link recovery event. At 4:00 a.m., the Beacon sends a withdrawal to ISP 1 so that the Beacon is in a state that is only connected to ISP2. This event emulates a failover event. A set of geographically diverse sites in the PlanetLab infrastructure probe a host within the Beacon prefix by using three probing methods: UDP packet probing, ping, and traceroute. Probing is performed every hour during injected routing events and when there are no routing events, so as to calibrate the results. At every hour, every probing source sends a UDP packet stream marked by sequence numbers to the BGP Beacon host at 50 ms interval. The probe starts 10 min before each hour and ends 10 min after that hour (i.e., the probing duration is 20 min for each hour). Upon the arrival of each UDP packet, the Beacon host records the timestamp and sequence number of the UDP packet. In addition, ping and traceroute are sent from the probe hosts toward the Beacon host, for measuring round-trip time (RTT) and IP-level path information during the same 20 min time period. Both ping and traceroute are run as soon as the previous ping or traceroute probe completes. Thus, their probing frequency is limited by the round-trip delay and the probe response time from routers. 6.5.2 Overall Packet Loss In this section, we present data plane performance during failover and recovery beacon events. Packet loss and loss burst length are used to measure the impact of routing events on end-to-end path performance. We refer to a series of consecutively lost packets during a routing event as a loss burst. Loss burst length is the maximum number of consecutive lost packets during a routing event. Since several lost bursts can be observed during a routing event, we consider the one with the maximum number of consecutive lost packets, which represents the worst-case scenario during the event. Figure 6.24a shows the number of loss bursts over all probing hosts during failover beacon events for the entire duration of measurement. The x-axis represents the start time of a loss burst, which is measured (in second) relative to the injection of withdrawal messages. We observe that the majority of loss bursts occur right after time 0, i.e., the time when a withdrawal message is advertised. Figure 6.24b shows the number of loss bursts during recovery beacon events across all probe hosts undergoing path changes. We observe that loss bursts occur right after time 0, and can last for 10 s. Figure 6.25a shows the distributions of loss burst length before, during, and after a path change for failover beacon events. The x-axis is shown in log scale. We find that the packet loss burst length during path change can have as many as 480 consecutive packets. Compared with the loss burst length during a path change, the packet loss burst size before and after a path change are quite short. Figure 6.25b F. Wang and L. Gao 200 180 160 140 120 100 80 60 40 20 0 –600 –400 –200 0 200 400 Starting time (seconds) 200 Number of loss bursts Number of loss burst 208 150 100 50 0 –600 –400 –200 0 200 400 Starting time (seconds) 600 (a) Failover 600 (b) Recovery Fig. 6.24 Number of loss bursts starting at each second [31] (Copyright 2006 Association for Computing Machinery, Inc. Reprinted by permission) 1 1 0.95 0.8 CDF CDF 0.9 0.6 0.4 0 1 10 100 Loss burst length (a) Failover 0.8 0.75 during path change before path change after path change 0.2 0.85 during path change before path change after path change 0.7 1000 0.65 1 10 Loss burst length 100 (b) Recovery Fig. 6.25 The cumulative distribution of loss burst length [31] (Copyright 2006 Association for Computing Machinery, Inc. Reprinted by permission) shows the loss burst length during recovery beacon events. We observe that the loss burst length during routing change does not show a significant difference compared with those before or after routing change. In addition, loss burst length can be as long as 140 packets for recovery beacon events. Such loss is most likely caused by routing failures. 6.5.3 Packet Loss Due to Transient Routing Failures or Loops From the measurement results, we see that during both events, many packet loss bursts occur. Packet loss can be attributed to network congestion or routing failures. In order to identify routing failures, ICMP response messages, as measured by traceroutes and pings, are used. After deriving loss burst, unreachable responses from traceroutes and pings are correlated with the loss bursts. Since hosts in PlanetLab are NTP time synchronized, the loss bursts are correlated with ICMP 6 Interdomain Routing and Reliability 209 messages using the time window [1 s, 1s]. When a router does not have a route entry for an incoming packet, it will send an ICMP network unreachable error message back to the source to indicate that the destination is unreachable if it is allowed to do so. Based on the ICMP response message, we can determine when and which router does not have a route entry to the Beacon host. Loss bursts that have corresponding unreachable ICMP messages are attributed to routing failures. In addition, if a packet is trapped in forwarding loops, its TTL value will decrease until the value reaches 0 at some router. The router will send a “TTL exceeded” message back to the source. Thus, from traceroute data, we can observe forwarding loops. Table 6.2 shows the number of failover beacon events, the number of loss bursts, and the number of lost packets that can be verified as caused by routing failures or loops. We verify that 23% of the loss bursts, corresponding to 76% of lost packets, are caused by routing failures or loops. We are unable to verify the remaining 77% of loss bursts, which correspond to only 24% of packet loss. These loss bursts may be caused by either congestion or routing failures for which traceroute or ping is not sufficient (due to either insufficient probe frequency or lack of ICMP messages) for the verification. Similar to our analysis on failover events, we correlate ICMP unreachable messages with loss bursts occurring during recovery events. Table 6.3 shows that 26% of packet loss is verified to be caused by routing failures. Since routers in the Internet may filter out ICMP packets, it is possible that some loss packets do not have corresponding ICMP messages even if those loss bursts might be caused by routing failures or routing loops. As a result, we may underestimate the number of loss bursts due to routing failures or routing loops. Therefore, the number of loss bursts caused by routing failures or routing loops might be more than what can be identified by our methodology. Table 6.2 Overall packet loss caused by routing failures or loops during failover events Failover Loss Lost Causes beacon events bursts packets Routing failures Routing loops Unknown 451 (38%) 208 (18%) 539 (44%) 607 (16%) 239 (7%) 2,875 (77%) 37,751 (42%) 30,592 (34%) 21,948 (24%) Table 6.3 Packet loss caused by routing changes during recovery events Recovery Loss Loss Causes beacon events bursts packets Routing failures Routing loops Unknown 17 (5%) 24 (7%) 290 (88%) 39 (2%) 37 (2%) 1,714 (96%) 480 (11%) 640 (15%) 3,266 (74%) 210 F. Wang and L. Gao 1 1 0.8 0.8 0.6 0.6 CDF CDF We measure the duration of a loss burst as the time interval between the latest received packets before the loss and the earliest one after the loss. Figure 6.26a shows the duration of loss bursts that can and cannot be verified as caused by routing failures or routing loops during failover events. Again, we observe that the loss bursts that are verified as caused by routing failures or routing loops last longer than those unverified loss bursts. Figure 6.26b further shows that loss bursts caused by routing loops last longer than those caused by routing failures. Figure 6.27a shows the cumulative distribution of the duration of loss bursts that are verified and unverified as caused by routing failures or routing loops during recovery events. We observe that verified loss bursts on average are longer than those unverified. In addition, during recovery events, more than 98% of routing failures or routing loops last less than 5 seconds, while during failover events, about 80% of routing failures or routing loops last less than 5 seconds as shown in Fig. 6.26. This means that loss bursts caused by routing failures during recovery events last much shorter than those during failover events. We also observe that unverified loss bursts 0.4 0.2 0 0 Unverified loss bursts Verified loss bursts 5 10 15 20 25 Duration (seconds) 0.4 0.22 0 30 (a) Loss burst verified vs. unverified Routing failures Routing loops 0 5 10 15 20 Duration (seconds) 25 30 (b) Routing loops vs. routing failures 1 1 0.8 0.8 0.6 0.6 CDF CDF Fig. 6.26 Duration for verified vs. unverified loss bursts during failover events [31] (Copyright 2006 Association for Computing Machinery, Inc. Reprinted by permission.) 0.4 0.2 0 0.2 Unverified loss bursts Verified loss bursts 0 2 4 6 8 Duration (seconds) 0.4 10 (a) Loss bursts verified vs. unverified 0 Routing failures Routing loops 0 2 4 6 8 Duration (seconds) 10 (b) Routing loops vs. routing failures Fig. 6.27 Duration of verified loss bursts during recovery events [31] (Copyright 2006 Association for Computing Machinery, Inc. Reprinted by permission.) 6 Interdomain Routing and Reliability 211 last less than 4 seconds. Figure 6.27b shows the duration of verified loss bursts that are caused by routing failures and loops during recovery events. We observe that 57% of packet loss is due to forwarding loops, which is slightly higher than that for failover events (47%). This implies that forwarding loops are also quite common during recovery events. 6.6 Research Approaches We have seen from the measurement study in the previous section that routing failures and routing loops contribute to degraded end-to-end path performance significantly. Several approaches have been proposed to address the problem of routing failures and routing loops. These approaches can be broadly classified into three categories: convergence-based solution, path protection-based solution, and multiple path-based solution. Convergence-Based Solutions These approaches focus on reducing BGP convergence delay. In particular, they aim to reduce convergence delay by eliminating invalid routes quickly. Reducing convergence delay may indirectly shrink the periods of routing failures or routing loops since it takes less time to converge to a stable route. Path Protection-Based Solutions These approaches focus on preestablishing recovery paths before potential network events. These preestablished paths supplement the best path selected by BGP. When there is a routing outage, the recovery path is used to route traffic. The recovery path could be a preestablished protection tunnel, or an alternate AS path. Multipath-Based Solutions The goal of these approaches is to exploit path diversity to provide fault tolerance. To increase path diversity, multipath routes are discovered. For example, multiple routing trees can be created on the same underlying topology. When one of the routes fails, other routes can be probed and then used if valid to route traffic. 6.6.1 Convergence Based Solutions BGP is a path vector protocol. Each BGP speaking router has to rely on its neighbors’ announcements to select its best route. Since each BGP speaking router does not have the topology information, it is possible that an AS explores many AS paths before eventually reaching the final stable path. Figure 6.28 shows an example of the path exploration process during BGP convergence. Suppose the link between AS 1 and AS 0 fails. This failure event makes the destination unreachable at each AS. We refer to this type of events as fail-down events. The following potential sequence of route changes shows how path exploration can occur. 212 Fig. 6.28 An example of path exploration during BGP convergence. The list of AS paths shown beside each node is the set of permissible paths for the node, and the permissible paths are ordered in the descending order of local preference F. Wang and L. Gao 4 4310 4210 42310 210 2310 2 24310 3 1 310 3210 34210 10 0 1. AS 1 sends a withdrawal message to AS 2 and AS 3, respectively. 2. As AS 2 receives the withdrawal, it removes path (2 1 0) from its routing table, selects path (2 3 1 0) as its new best path, and advertises the new path to all neighbors. 3. After AS 3 receives the withdrawal from AS 1, it will use path (3 2 1 0), and advertise it to its neighbors. 4. When AS 2 and AS 3 learn the new paths (2 3 1 0) and (3 2 1 0) from each other, they will remove their best paths, and use path (2 4 3 1 0) and path (3 4 2 1 0), respectively. 5. Since both AS 2 and AS 3 use the paths from AS 4, they will send AS 4 withdrawal messages to withdraw their previously advertised paths. As a result, AS 4 loses its all paths, and sends a withdrawal message to AS 2 and AS 3, respectively. 6. After AS 2 and AS 3 receive the withdrawals from AS 4, their routing tables do not have any route to the destination. This example shows that each node literally has to try several AS paths that traverse the failed link/node before it finally chooses the best valid path or determines that there is no best path. For instance, AS 2 might explore the sequence AS paths (2 1 0) ! (2 3 1 0) ! (2 4 3 1 0) before it removes all paths from its routing table. Previous measurement studies have shown that BGP may take tens of minutes to reach a consistent view of the network topology after a failure [17–19]. Note that although this example shows a fail-down scenario, we can indeed extend it to show a fail-over scenario in which an AS has to explore many invalid paths before finalizing to a stable valid path. Several solutions have been proposed to rapidly indicate and remove invalid routes to suppress the exploration of obsoleted paths [5, 7, 23, 24]. Consistency Assertions (CA) [24] tries to achieve this goal by examining path consistency based solely on the AS path information carried in BGP announcements. Suppose that an AS has learned two paths to a destination from neighbor N1 and neighbor N2 , respectively. N1 advertises path (N1 A B C 0) and neighbor N2 advertises (N2 B X Y 0). CA assumes that each AS can only use one path. Thus, by comparing these two paths, it can detect that the two paths advertised by AS B ((B C 0) and (B X Y 0)) are not consistent. We use an example shown in Fig. 6.28 to show how 6 Interdomain Routing and Reliability 213 an AS can take advantage of consistency checking to accelerate route convergence. A router can use a withdrawal received directly from a neighbor to check path consistency. When the link between AS 1 and AS 0 fails, AS 1 sends withdrawals to AS 2 and AS 3. Once AS 2 and AS 3 notice that their neighbor AS 1 withdraws its path to the destination, they check whether AS 1 appears in any existing path. Since the two path (2 3 1 0) and (2 4 3 1 0) contains path (1 0), neither can be selected and AS 2 removes them from its routing table. Similarly, AS 3 removes path (3 2 1 0) and (3 4 2 1 0). Eventually, AS 2 and AS 3 will withdraw their paths to the destination. As a result, CA eliminates the paths to be explored. However, the AS path consistency might not contain sufficient information about invalid paths. It is hard to accurately detect invalid routes based solely on the AS path information. For example, in Fig. 6.28, after AS 2 and AS 3 receive the withdrawals sent by AS 1 due to link (1 0) failure, AS 2 and AS 3 send withdrawals to AS 4 since all of their paths go through AS 1. Now suppose that AS 2’s withdrawal reaches AS 4 before AS 3 does. In this case, AS 4 cannot consider path (4 3 1 0) as an invalid path since the path does not contain the withdrawn path (2 1 0). AS 4 cannot determine if the withdrawal of path (2 1 0) is due to the failure of link (2 1) or link (1 0). To accurately identify invalid paths, Ghost Flushing [5] reduces convergence delay by aggressively sending explicit withdrawals to quickly remove invalid paths. Whenever an AS’s current best path is replaced by a less preferred route, Ghost Flushing allows the AS to immediately generate and send explicit withdrawal messages to all its neighbors before sending the new path. The withdrawal messages is to flush out the path previously advertised by the AS. For example, in Fig. 6.28, after AS 2 receives the withdrawal sent by AS 1 due to link (1 0) failure, AS 2 will use less preferred path (2 3 1 0). Before sending the path (2 3 1 0) to its neighbors, AS 2 sends extra withdrawal messages to its neighbors AS 3 and AS 4. Because BGP withdrawal messages are not subjected to the MRAI timer, invalid paths can potentially be quickly deleted from the AS’s neighbors. For example, the withdrawal sent by AS 2 will help AS 3 to remove the invalid path (3 2 1 0). From this example, we know that Ghost Flushing does not really prevent path exploration, but instead attempts to speed up the process. To further identify invalid routes quickly, additional information can be incorporated into BGP route updates. BGP-RCN and EPIC [7, 23] propose to use with location information about failures, or root cause information, to identify invalid routes. When a link failure occurs, the nodes adjacent to the link will detect the change. The node, referred to as the root cause node (RCN), will attach its name to the routing update it sends out. The RCN is propagated to other ASes along each impacted path. Thus, an AS can use the RCN to remove all the invalid paths at once. For example, Fig. 6.28 illustrates the basic idea of BGP-RCN. When the link between AS 1 and AS 0 fails, root cause notification is sent with a withdrawal by AS 1. When AS 2 receiving the withdrawal, it uses the root cause notification to find invalid paths that contain AS 1. Thus, path (2 3 1 0) is considered as an invalid path and will be removed. Similarly, at AS 3, path (3 2 1 0) is detected as an invalid route. AS 2 and AS 3 send withdrawals to AS 4, and piggyback the root cause in the 214 F. Wang and L. Gao Table 6.4 Properties of convergence-based solutions. M is the MRAI timer value. n is the number of ASes in the network. D is the diameter of the network. jEj is the number of AS level links. h is the processing delay for a BGP update message to traverse an AS hop Modification Convergence delay Messages Modification to to BGP route eBGP iBGP Protocols (fail-down) (fail-down) BGPs messages selection Standard BGP M n jEj n N/A N/A N/A N/A CA M n jEj No Yes Yes No No Yes Yes Yes Ghost Flushing h n 2jEjn Mh BGP-RCN hD jEj n C 1 Yes Yes Yes No EPIC hD jEj 1 Yes Yes Yes Yes withdrawals. After receives the withdrawal messages with root cause, AS 4 removes all its routes because all paths contain the root cause node AS 1. EPIC [7] further extends the idea of root cause notification so that it can be applied to a router rather than an AS. In general, a failure can occur to a router or a link between a pair of routers. A failure on a link between two ASes does not necessarily mean that all links between the two ASes fail. The root cause notification in BGPRCN can only indicate failures on an AS or links between a pair of ASes. EPIC further allows routing information that contains failure information about router or link between a pair of routers. We summarize important properties of the four approaches in Table 6.4. We consider the upper bound of convergence time and the number of messages during a fail-down event. We also compare those approaches in term of the modifications need from the standard BGP. For example, we consider if an approach needs to modify to BGP’s messages format or BGP route selection, and if those approaches can be applied to eBGP or iBGP. 6.6.2 Path Protection-Based Solutions The convergence based-approaches focus on rapidly removing invalid routes to accelerate BGP convergence process. They are efficient in reducing convergence delay. However, simply applying those methods might not necessarily lead to reliable routing. In fact, accelerating the process of identifying invalid routes might sometimes exacerbate routing outages. Figure 6.29 shows such an example. We first consider the case of running the standard BGP. When the link between AS 1 and AS 0 fails, AS 1 sends a withdrawal to AS 2 and AS 3 immediately, and AS 2 sends a withdrawal to AS 3 right after. Upon receiving the withdrawal, AS 3 will quickly switch to the path (3 4 0). At the same time, when AS 2 receives the withdrawal message, it selects path (2 3 1 0). Even though this path is invalid, AS 2 still reroutes traffic to a valid next hop AS, which has a valid path. Therefore, in this case, AS 2 can reroute traffic to the destination before it receives the valid path (3 4 0). 6 Interdomain Routing and Reliability 215 4 40 4310 43210 310 3210 340 3 2 1 210 2310 2340 10 1340 12340 0 Fig. 6.29 An example showing transient routing failures at AS 2 when RCN is used. The list of AS paths shown beside each node is the set of permissible paths for the node, and the permissible paths are ordered in the descending order of local preference On the contrary, if the root cause information is sent with the withdrawal by AS 1. AS 2 will remove path (2 3 1 0), and temporarily lose its reachability to AS 0 until receiving the new path from AS 3. The duration of temporary loss of reachability could last longer than that in the case of the standard BGP. The duration that AS 2 loses its reachability depends on the delay to get the alternate path from AS 3, which is determined by the time it takes to receive the announcement of path (3 4 0) from AS 3, which is subjected to MRAI timer. Without using the root cause information, the duration that AS 2 loses its reachability depends on the propagation delay of the withdrawal from AS 1 to AS 2, which is not subjected to MRAI timer [26]. The path protection-based solutions are designed specifically for improving the reliability of interdomain routing. The major idea is that local protection paths are identified before failures. When the primary path fails, local protection paths are temporarily used. Many approaches have been proposed for link-state intradomain routing protocols to protect intradomain link failures [6, 14, 16, 27, 33]. However, the BGP speaking routers do not have the knowledge of the global network topology. They have routing information from neighbors only. Therefore, there are two challenges in implementing path protection in BGP; first, one needs to find local preplanned protection paths; second, one needs to decide how and when to use the protection paths. Next, we present several path protection-based approaches. We first focus on how they address the first challenge. We then discuss how they address the second challenge. Bonaventure et al. [3] have proposed a fast reroute technique, referred to as R-Plink, to protect direct interdomain links. The basic idea is that each router precomputes recovery path for each of its BGP peering links, which is used to reroute traffic when the protected BGP link fails. In order to discover an appropriate recovery path, each edge router inside an AS advertises its currently active eBGP sessions by using a new type of iBGP update message. After having other routers’ routing information, an edge router chooses a path to protect its current active eBGP session from all recovery routes. Figure 6.30 shows an example to illustrate this approach. In this example, AS 2 advertises the same destination to AS 1’s two routers A and C. Suppose that the routing policies on AS 1 are configured to select the path via router A as the best path. However, router A cannot learn any route via router C through BGP because of the local-preference settings on this router. 216 F. Wang and L. Gao Fig. 6.30 A precomputed protection path is used to protect the interdomain link between AS 1 and AS 2 To automatically discover the alternate path, routers A and C advertise their active eBGP sessions. Thus, router A will know an alternate path via routers C and E, and choose the path to protect its current path to the destination. Once the link (A D) fails, router A can forward the packets affected by the failure through the alternate path via (C E) link. In contrast of R-Plink, R-BGP aims to solve the transient routing failures problem for any interdomain link failure, not just for the failure of a direct neighboring interdomain link [15]. R-BGP precomputes an alternate path for each AS to protect interdomain links. In particular, an AS first checks all paths it knows, and then selects the one most disjoint from its current best path, which is defined as the failover path. Finally, the AS advertises the failover path only to the next-hop AS along its best path. Note that in the standard BGP, an AS should not advertise its best path to the neighbor currently used to reach that destination, since this path would generate a loop. Advertising a failover path guarantees that, whenever a link goes down, the AS immediately upstream of the down link knows a failover path and can avoid unnecessary packet drops. One limitation of this approach is that it guarantees to avoid routing failures only under the hierarchical provider-customer relationships and the common routing policy, i.e., the no-valley and prefer-customer routing policy. Further, it does not address the routing failures caused by iBGP configuration. Backup Route Aware Routing Protocol (BRAP) is to achieve fast transient failure recovery considering both eBGP routing policy and iBGP configurations [28]. To achieve this, BRAP requires that a router should be enabled to advertise an alternate path if its best path is not allowed to be advertised due to loop prevention or routing policies. The general idea for BRAP is as follows: a router should advertise following policy compliant paths in addition to the best path: (1) a failover 6 Interdomain Routing and Reliability 217 Table 6.5 Comparing path protection-based solutions. jEj is the number of AS level links, jEr j is the number of router level links Messages Modification to Modification to eBGP iBGP Protocols (failover) BGPs messages other part of BGP R-Plink R-BGP BRAP N/A jEj jEr j Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes path to the nexthop router along the best path; and (2) a loop-free alternate path, defined as a temporary backup path, to its upstream neighbors. BRAP extends BGP to distribute the alternate routes along eBGP and iBGP sessions. Now, we describe how to use a protection path. When a router needs to use a protection path, the router needs to inform the other routers along the path of the change. Otherwise, redirecting traffic to the protection path could cause forwarding loops. For example, in Fig. 6.30, when router A sends traffic along the alternate path via routers B and C, their routing tables still consider router A as the next hop. Protection tunnels on the data plane is proposed to avoid such forwarding loops [3]. Protection tunnels can be implemented by using encapsulation schemes such as MPLS over IP. With MPLS over IP, only the ingress border router consults its BGP routing table to forward a packet, and encapsulates IP header with the destination set to the IP address of the egress border router. All the other routers inside the AS will rely on their IGP routing tables or their label forwarding table to forward the packet. R-BGP utilizes “virtual” connections to avoid forwarding loops. There are two “virtual” connections between each pair of BGP-speaking routers, one for the primary path traffic, and the other for the failover traffic. The virtual connection can be implemented by using virtual interfaces when the two routers are physically connected, or MPLS or IP tunnels if they are not. Similarly, BRAP uses a protection path through MPLS or IP tunnels. We summarize the features of the three path protection-based solutions in Table 6.5. We consider the upper bound of the number of messages during a failover event, modification to BGP, and whether those approaches can be applied to eBGP or iBGP. 6.6.3 Multiple Path-Based Solution A straightforward solution to improve the route reliability is to discover multiple paths. There are two proposals for multiple path interdomain routing. The first one is MIRO [32] that allows routers to inform their neighbors multiple routes instead of only the best one. Thus, MIRO can allow ASes to have more control over the flow of traffic in their networks, as well as enable quick reaction to path failures. The second one is Path Splicing [22], which aims to take advantage of alternate paths in BGP routing table to discover multiple paths. Instead of using only the best 218 F. Wang and L. Gao path in the BGP routing table, a packet can select any path in the BGP routing table by indicating which one to use in its header. Clearly, probing has to be deployed before multiple paths can be discovered since arbitrary selection of alternate paths can lead to routing loops. 6.7 Conclusion and Future Directions Interdomain routing is the glue that binds thousands of networks in the Internet together. Its reliability plays determinable role on the end-to-end path performance. In this chapter, we have presented the challenges in designing and implementing a reliable interdomain routing protocol. Specifically, through measurement studies, we present a clear overview of the impact of transient routing failures and transient routing loops on the end-to-end path performance. Finally, we have critically reviewed the existing proposals in this field, highlighting pros and cons of those approaches. While certain efforts have been made to enhance interdomain routing reliability, this issue remains open. We believe that the development of new routing infrastructure, for example, multipath routing is one promising direction of future research. Reliability enhancement through multiple path advertisement is not a new idea. Many efforts have been been made to extend BGP to allow the advertisement of multiple paths [12, 20]. However, designing scalable interdomain routing through multiple path advertisement is challenging. One of those challenges is to understand the degree of path diversity provided by multiple path advertisement is sufficient to overcome network failures. At the same time, this challenge highlights the need for designing new path diversity metrics. Path diversity metrics such as the number of node-disjoint and link-disjoint links can be used to compute the inter-AS path diversity. However, new path diversity metrics needs to be devised to take into account the performance, reliability, and stability. Acknowledgments The authors would like to thank the editors, Chuck Kalmanek and Richard Yang, for their comments and encouragement. This work is partially supported by NSF grants CNS-0626617 and CNS-0626618. References 1. Akella, A., Maggs, B., Seshan, S., Shaikh, A., & Sitaraman, R. (2003). A measurement-based analysis of multihoming. In Proceedings of ACM SIGCOMM, August 2003. 2. Basu, A., Ong, L., Shepherd, B., Rasala, A., & Wilfong, G. (2002). Route oscillations in I-BGP with route reflection. In Proceedings of the ACM SIGCOMM. 3. Bonaventure, O., Filsfils, C., & Francois, P. (2007). Achieving sub-50 milliseconds recovery upon BGP peering link failures. IEEE/ACM Transactions on Networking (TON), 15(5), 1123– 1135. 4. Boutremans, C., Iannaccone, G., Bhattacharyya, S. C., Chuah, C., & Diot, C. (2002). Characterization of failures in an IP backbone. In Proceedings of ACM SIGCOMM Internet Measurement Workshop, November, 2002. 6 Interdomain Routing and Reliability 219 5. Bremler-Barr, A., Afek, Y., & Schwarz, S. (2003). Improved BGP convergence via ghost flushing. In Proceedings of IEEE INFOCOM 2003, vol. 2, San Francisco, CA, Mar. 30-Apr. 3, 2003, pp. 927–937. 6. Bryant, S., Shand, M., Previdi, S. (2009). IP fast reroute using not-via addresses. Draft-ietfrtgwg-ipfrr-notvia-addresses-04. 7. Chandrashekar, J., Duan, Z., Zhang, Z. L., & Krasky, J. (2005). Limiting path exploration in BGP. In Proceedings of IEEE INFOCOM 2005, Miami, Florida, March 13–17 2005, Volume: 4, 2337–2348. 8. Gao, L., & Rexford, J. (2001). A stable internet routing without global coordination. IEEE/ACM Transactions on Networking, 9(6), 681–692. 9. Griffin, T. G., & Willfong, G. (1999). An analysis of BGP convergence properties. In Proceedings of ACM SIGCOMM, pp. 277–288, Boston, MA, September 1999. 10. Griffin, T. G., & Willfong, G. (2002). On the correctness of IBGP configuration. In Proceedings of ACM SIGCOMM, pp. 17–29, Pittsburgh, PA, August 2002. 11. Griffin, T. G., Shepherd, B. F., & Wilfong, G. (2002). The stable paths problem and interdomain routing. IEEE/ACM Transactions on Networking (TON), 10(2) pp. 232–243. 12. Halpern, J. M., Bhatia, M., & Jakma, P. (2006). Advertising Equal Cost Multipath routes in BGP. Draft-bhatia-ecmp-routes-in-bgp-02.txt 13. Hengartner, U., Moon, S., Mortier, R., & Diot, C. (2002). Detection and analysis of routing loops in packet traces. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurement, Marseille, France, pp. 107–112. 14. Iselt, A., Kirstdter, A., Pardigon, A., Schwabe, T. (2004). Resilient routing using ECMP and MPLS. In Proceedings of HPSR 2004, Phoenix, Arizona, USA April 2004, pp. 345–349. 15. Kushman, N., Kandula, S., Katabi, D.,& Maggs, B. (2007). R-BGP: staying connected in a connected world. In 4th USENIX Symposium on. Networked Systems Design & Implementation, Cambridge, MA, April 2007, pp. 341–354. 16. Kvalbein, A., Hansen, A. F., Cicic, T., Gjessing, S., & Lysne, O. (2006). Fast IP network recovery using multiple outing configurations. In Proceedings IEEE INFOCOM, pp. 23–26, Barcelona, Spain, Mar. 2006. 17. Labovitz, C., Malan, G. R., & Jahanian, F. (1998). Internet routing instability. IEEE/ACM Transactions on Networking 6(5): 515–528 (1998). 18. Labovitz, C., Ahuja, A., Bose, A., et al. (2001). Delayed internet routing convergence. IEEE/ACM Transactions on Networking, Publication Date: June 2001, 9(3), pp. 293–306. 19. Labovitz, C., Ahuja, A., Wattenhofer, R., et al. (2001). The impact of internet policy and topology on delayed routing convergence. In Proceedings of IEEE INFOCOM’01, Anchorage, AK, USA, April 2001, pp. 537–546. 20. Mohapatra, P., Fernando, R., Filsfils, C., & Raszuk, R. (2008). Fast connectivity restoration using BGP add-path. Draft-pmohapat-idr-fast-conn-restore-00. 21. Morley Mao, Z., Bush, R., Griffin, T., & Roughan, M. (2003). BGP Beacons. In Proceedings of IMC, October 27–29, 2003, Miami Beach, Florida, USA, pp. 1–14. 22. Motiwala, M., Feamster, N., & Vempala, S. (2008). Path splicing. SIGCOMM 2008. Seattle, WA: August. 23. Pei, D., Azuma, M., Massey, D., & Zhang, L. (2005). BGP-RCN: improving BGP convergence through root cause notification. Computer Networks, 48(2), 175–194. 24. Pei, D., Zhao, X., Wang, L., Massey, D., Mankin, A., Wu, S. F., & Zhang, L. (2002). Improving BGP convergence through consistency assertions. In Proceedings of the IEEE INFOCOM 2002, vol. 2, New York, NY, June 23–27, 2002, pp. 902–911. 25. PlanetLab, http://www.planet-lab.org 26. Rekhter, Y., Li, T., Hares, S. (2006). A border gateway protocol 4 (BGP-4). RFC 4271. 27. Stamatelakis, D., & Grover, W. D. (2000). IP layer restoration and network planning based on virtual protection cycles. IEEE Journal on Selected Areas in Communications, 18(10), Oct 2000, pp. 1938–1949. 28. Wang, F., & Gao, L. (2008). A backup route aware routing protocol – fast recovery from transient routing failures. Proceedings of IEEE INFOCOM Mini-Conference, April 2008. Arizona: Phoenix. 220 F. Wang and L. Gao 29. Wang, F., Gao, L., Spatscheck, O., & Wang, J. (2008). STRID: Scalable trigger-based route incidence diagnosis. Proceedings of IEEE ICCCN 2008, St. Thomas, U.S. Virgin Islands, August 3–7, 2008, pp. 1–6. 30. Wang, F., Gao, L., Wang, J., & Qiu, J. (2009). On understanding of transient interdomain routing failures. IEEE/ACM Transactions on Networking, 17(3), June 2009, pp. 740–751. 31. Wang, F., Mao, Z. M., Gao, L., Wang, J., & Bush, R. (2006). A measurement study on the impact of routing events on end-to-end internet path performance. Proceedings of ACM SIGCOMM 2006, September 11–15. Pisa, Italy, pp. 375–386. 32. Xu, W., & Rexford, J. (2006). MIRO: multi-path interdomain routing. In Proceedings of ACMSIGCOMM 2006, pp. 171–182, Pisa, Italy. 33. Zhong, Z., Nelakuditi, S., Yu, Y., Lee, S., Wang, J., & Chuah, C.-N. (2005). Failure inferencing based fast rerouting for handling transient link and node failures. In Proceedings of IEEE Global Internet, Miami, Fl, USA, Mar. 2005, pp. 2859–2863. Chapter 7 Overlay Networking and Resiliency Bobby Bhattacharjee and Michael Rabinovich 7.1 Introduction An “overlay” is a coordinated collection of processes that use the Internet for communication. The overlay uses the connectivity provided by the network to form any overlay topologies and information flows fitting its applications, irrespective of the topology of the underlying network infrastructure. In a broad sense, every distributed system and application forms an overlay. Certainly, routing protocols form overlays as does the interconnection of NNTP servers that form the Usenet. We use the term “overlay networks” in a narrower sense: an application uses an overlay only if processes on end-hosts are used for routing and relaying messages. The overlay network is layered atop the physical network, which enables additional flexibility. In particular, the overlay topology can be tailored to application requirements (e.g., overlay topologies can be set up to provide low-latency lookup on flat names spaces), overlay routing may choose application-specific policies (e.g., overlay routing meshes can find paths in contradiction of policies exported by BGP), and overlay networks can emulate functionality not supported by the underlying network (e.g., overlays can implement application-layer multicast over an unicast network). The flexibility enabled by overlay networks can be both a blessing and a curse. On the one hand, it gives application developers the control they need to implement sophisticated measures to improve the resilience of their application. On the other hand, overlay networks are built over end-hosts, which are inherently less stable, reliable, and secure than lower-layer network components comprising the Internet fabric. This presents significant challenges in overlay network design. B. Bhattacharjee Department of Computer Science, University of Maryland, College Park, MD 20742, USA e-mail: bobby@cs.umd.edu M. Rabinovich Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, Ohio 44106–7071, USA e-mail: misha@eecs.case.edu C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 7, c Springer-Verlag London Limited 2010 221 222 B. Bhattacharjee and M. Rabinovich In this chapter, we concentrate on the former aspect of overlay networks and present a survey of overlay applications with a focus on how they are used to increase network resilience. We begin with a high-level overview of some issues that can hamper the network operation and how overlay networks can help address these issues. In particular, we consider how overlay networks can make a distributed application more resilient to flash crowds and overload, to component failures and churn, network failures and congestion, and to denial of service attacks. 7.1.1 Resilience to Flash Crowds and Overload The emergence of the Web has led to a new phenomenon where Internet resources are exposed to potentially unlimited demand. It is difficult (and indeed inefficient) for content providers to provision sufficient capacity for the worst-case load (which is often hard to predict). Inability to predict worst-case load leaves content providers susceptible to flash crowds: rapid surges of demand that exceed the provisioned capacity. Approaches to address flash crowds differ by resource type. It is useful to distinguish the following types of Internet resources: Large files, exemplified by software packages and media files, with file sizes on the order of megabytes for audio tracks, going up to tens or even hundreds of megabytes for software packages and gigabytes for full-length movies. Web objects, consisting of typical text and pictures on Web pages, with sizes ranging from one to hundreds of kilobytes. Streaming media, where the download (often at bounded bit rates) continues over the duration of content consumption. Internet applications, where a significant part of service demand to process a client request is due to the computation at the server rather than delivering content from the server to the client. IP multicast is a mechanism at the IP level that could potentially address the flash crowd problem in the first three of these resource types. At a high level, IP multicast creates a tree with the content source as the root, and the content consumers as the leaves. The source sends only one copy of a packet, and routers inside the network forward and duplicate packets as necessary to implement forwarding to all receivers. IP multicast decouples the resources requirements at the source from the number of simultaneous receivers of identical data. However, IP multicast cannot help when different contents need to be sent to different clients, or when the same content needs to be sent at different times, or when one needs to scale up an Internet application. Furthermore, although IP multicast is widely implemented, access to the IP multicast service is enabled only in the confine of individual ISPs to selected applications. 7 Overlay Networking and Resiliency 223 Overlay networks can help overcome these limitations. Content delivery networks are an overlay-based approach widely used for streaming, large file, and Web content delivery. A content delivery network (CDN) is a third-party infrastructure that content providers employ to deliver their data. In a sense, it emulates multicast at the application level, with content providers’ sites acting as roots of the multicast trees and servers within the CDN infrastructure as internal multicast tree nodes. What distinguishes a CDN from IP multicast is that, as with any overlays, its deployment does not rely on additional IP services beyond the universal IP unicast service, and that CDN nodes have long-term storage capability, allowing the distribution trees to encompass clients consuming content at different times. A CDN derives economy of scale from the fact that its infrastructure is shared among multiple content providers who subscribe to the CDN’s service. Indeed, because flash crowds are unlikely to occur at the same time for multiple content providers, a CDN needs much less overprovisioning of its infrastructure than an individual content provider: a CDN can reuse the same capacity slack to satisfy peak demands for different content at different times. Another overlay approach, called peer-to-peer (P2P) delivery, provides resilience to flash crowds by utilizing client bandwidth in delivering content. By integrating clients into the delivery infrastructure, P2P approaches promise the ability to organically scale with the demand surge: the more clients want to obtain certain content, the more resources are added to the delivery infrastructure. The P2P paradigm has been explored in various contexts, but most widely used are P2P approaches to large-file downloads and streaming content. Peer-to-peer or peer-assisted delivery of streaming content is particularly compelling because streaming taxes the capacity of the network and at the same time imposes stringent timing requirements. Consider, for example, a vision for a future Internet TV service (IPTV), where viewers can seamlessly switch between tens of thousands of live broadcast channels from around the world, millions of video-on-demand titles, and tens of millions of videos uploaded by individual users using capabilities similar to those provided by today’s YouTube-type applications. Consider a global carrier providing this service in high-definition to 500 million subscribers, with 200 million simultaneous viewers at peak demand watching different streams – either distinct titles or the same titles shifted in time. Assume conservatively that a high-definition stream requires a streaming rate of 6 Mbps (it is currently close to 10 Mbps but is projected to reduce with improvements in coding). The aggregate throughput to deliver these streams to all the viewers is 1.2 Petabits per second. Even if a video server could deliver 10 Gbps of content, the carrier would need to deploy 120,000 video servers to satisfy this demand through naive unicast. Given these demands on the network and server capacities, overlay networks – in particular peer-to-peer networks – are important technologies to enable IPTV on a massive scale. 224 B. Bhattacharjee and M. Rabinovich 7.1.2 Resilience to Component Failures and Churn A distributed application needs to be able to operate when some of its components fail. For example, we discussed how P2P networks promise resiliency to flash crowds. However, because they integrate users’ computers into the content delivery infrastructure, they are especially prone to component failures (e.g., when a user kills a process or terminates a program) and to peer churn (as users join and leave the P2P networks). The flexibility afforded by overlay networks can be exploited to incorporate a range of redundancy mechanisms. These mechanisms allow system designers to utilize many failure prone components (often user processes on end-hosts) to craft highly resilient applications. Existing P2P networks have proven this resiliency by functioning successfully despite constant peer churn. Besides traditional file-sharing P2P networks, other examples of churn-resistant overlay network designs include a peer-to-peer Web caching system [36] and a churn-resistant distributed hash table [52]. 7.1.3 Resilience to Network Failures and Congestion Overlay networks can mitigate the effects of network outages and hotspots. Two end-hosts communicating over an IP network have little control over path selection or quality. The end-to-end path is a product of the IGP routing metrics used within the involved domains, and the BGP policies (set by administrators of these domains) across the domains. These metrics and policies are often entirely nonresponsive to transient congestion; in some case, two nodes may fail to find a path (due to BGP policies) even when a path exists. Overlay networks allow end-users finer-grained control over routing and thus can be agile in reacting to the underlying network conditions. Consider a hypothetical voice-over-IP communication between hosts at the University of Maryland (in College Park, Maryland) and Case Western Reserve University (in Cleveland, Ohio). The default path may traverse an Internet2 router in Pennsylvania. However, if this router is congested, an overlay-based routing system that is sensitive to path latency could try to route around the congestion. For instance, the routing overlay could tunnel the packets through overlay nodes at the University of Virginia and the University of Illinois, which might bypass the temporary congestion on the default path. Systems such as RON [4], Detour [55, 56], and Peerwise [38] create such routing overlays that route around adverse conditions in the underlying IP network. These systems build meshes for overlay routing and make autonomous routing decisions. RON builds a fully connected mesh and continually monitors all edges. When the direct path between two nodes fails or has shown degraded performance, communication is rerouted through the other overlay nodes. Not all systems build a fully connected mesh: Nakao et al. [44] use topology information and geographybased distance prediction to build a mesh that is representative of the underlying 7 Overlay Networking and Resiliency 225 physical network. Peerwise creates overlay links only between nodes that can provide shortcuts to each other. Experiments with all of these systems show that it is indeed possible to reduce end-to-end latency and improve connectivity using routing overlays. 7.1.4 Resilience to DoS Attacks Overlay networks can be used to protect content providers from Distributed Denialof-Service (DDoS) attacks. During a DDoS attack, an attacker directs a set of compromised machines to flood the victim’s incoming links. DDoS attacks are effective because (1) the content provider often cannot distinguish an attacking connection from a legitimate client connection, (2) the number of attacking hosts can be large enough that it is difficult for the victim’s network provider to set up static address filters, and (3) the attackers may spoof their source IP addresses. Over the last decade, DDoS attacks have interrupted service to many major Internet destinations, and in some cases, have been the root cause for the termination of service [31]. Networking researchers have developed many elegant approaches to mitigating the effect of and tracing the root of DDoS attacks; unfortunately, almost all of them require changes to the core Internet protocols. Overlay services can be used to provide resiliency without changing protocols or infrastructure. SOS [28] and Mayday [3] are overlay services that “hide” the address of the content-providing server. Instead the server is “protected” by an overlay, and access to the server may require strong authentication or captchas (that can distinguish attackers from legitimate clients). The protective overlay is large enough that it is not feasible or profitable to attack the entire overlay. The content provider’s ISP blocks all access to the server except by a small set of (periodically changing) trusted nodes who relay legitimate requests to the server. 7.1.5 Chapter Organization We have discussed various ways in which overlay networks can improve resiliency of networked applications. In the rest of this chapter we discuss some of these applications in more detail. We begin by introducing a foundational concept used in many overlay applications – a distributed hash table – in Section 7.2. We then discuss representative overlay applications including streaming media systems in Section 7.3 and Web content delivery networks in Section 7.4. Section 7.5 describes an overlay approach to improving the resiliency of Web services against DDoS attacks. We discuss swarming protocols for bulk transfer in Section 7.6, and conclude in Section 7.7. 226 B. Bhattacharjee and M. Rabinovich 7.2 A Common Building Block: DHTs Distributed applications often maintain large sets of identifiers or keys, such as names of files, IDs of game players, or addresses of chat rooms. For scalability, resilience, and load-balance, the task of maintaining these keys is divided amongst the nodes participating in the system. This approach scales since each node only deals with a limited subset of keys, it is resilient since a single key can be replicated onto more than one node, and finally it balances load since lookups and storage overhead are distributed (relatively) evenly over all the participants. A node responsible for a key may perform various application-specific actions related to this key: store the corresponding data, act as a control server for a named group, and so forth. A fundamental capability such a system must support is to allow each participating node to identify the node(s) responsible for a given key. Once a seeking node locates the node(s) that store a key, it may initiate corresponding actions. Distributed Hash Tables (DHTs) are a technique for efficiently distributing keys among nodes. DHTs provide this capability while limiting the knowledge each node must maintain about the other nodes in the system: instead of directly determining a responsible node (as would be the case with regular hashing), a node can only determine some nodes that are “closer” (by some metric) to the responsible node. The node then sends its request to one of the closer nodes, which in turn would forward the request toward a responsible node until the request reaches its target. Good DHTs ensure that requests must traverse only a small number of overlay hops en route to a responsible node. In a system with n nodes, many DHT protocols limit this hop count to O.log n/ while storing only O.log n/ routing state at each node for forwarding requests. Newer designs reduce some of the overheads to constants [23, 41, 50]. DHTs are a common building block for many types of distributed services, including distributed file systems [18], publish–subscribe systems [14, 58], cooperative Web caching [25], and name service [6]. They have even been proposed as a foundation for general Internet infrastructures [58]. DHTs can be built using a structured network, in which the DHT protocol chooses which nodes in the network are linked (and uses the structure inherent in these connections to reduce lookup time) or an unstructured network, in which the node interconnection is either random or an external agent specifies which nodes may be connected (as can be the case if links are constrained as in a wireless network or have specific semantics such as trust). We next describe prototypical DHT systems that are designed for cooperative environments. 7.2.1 Chord: Lookup in Structured Networks Chord [59] was one of the first DHTs that routed requests in O.log n/ overlay hops while requiring each node to store only O.log n/ routing state. The routing state at 7 Overlay Networking and Resiliency 227 each node contains pointers to some other nodes and is called a node’s finger table. Nodes responsible for a key store a data item associated with this key; the DHT can be used to lookup data items by key. Chord assigns an identifier (uniformly at random) to each node from a large ID space (2N IDs, N is usually set at 64 or 128). Each item to be stored in the DHT is also assigned an ID from the same space. Chord orders IDs onto a ring modulo 2N . An item is mapped to the node with the smallest ID larger than the item’s ID modulo 2N . Using this definition, we say that each item is mapped onto the node “closest” to the item in the ID space. A node with ID x stores a “finger table”, which consists of references to nodes closest to IDs x C 2i ; i 2 f0; N 1g. The successor of i , denoted as s.i /, is the node whose ID is immediately greater than i ’s ID modulo 2N . Likewise, the predecessor of i , p.i /, is the node whose ID is immediately less than n’s (Fig. 7.1). Each Chord node is responsible for the half-open interval consisting of its predecessor’s ID (noninclusive) and its own ID (inclusive). When a new node joins, it finds its “place” on the ring by routing to its own ID (say x), and can populate its own routing table by successively querying for nodes with the appropriate IDs (x C 1; x C 2; x C 4; : : : ). In the worst case, this incurs O.log2 n/ overhead. A node returns the data (if any) upon receiving a lookup for a key in the range of IDs it stores. For other lookups, it “routes” (forwards) the query to the node in its finger table with the highest ID (modulo 2N ) smaller than the key. This process iterates until the item is found or it is determined that there is no item corresponding to the lookup. Figure 7.2 shows two examples of lookups in Chord. In the first case, the data corresponding to key value 3 is looked up (starting from node 52); in the 62 2 4 55 2+20 3 maps to node 4 2+21 4 4 2 52 finger 6 8 2+23 10 15 2+2 4 18 21 2+2 5 34 34 2+2 46 8 15 43 21 22 Fig. 7.1 Finger table state for Node 2 34 31 28 228 B. Bhattacharjee and M. Rabinovich Key = 3 Interval = [2, 4) Next hop = 2 Key = 3 Interval = [61, 5) Next hop = 62 Key = 3 Interval = [61, 5) Next hop = 62 62 2 4 55 52 8 Key = 42 Interval = [31, 47) Next hop = 31 Key = 42 Interval = [14, 46) Next hop = 15 15 46 43 21 22 34 31 28 Key = 42 Interval = [39, 47) Next hop = 43 Fig. 7.2 Two lookups on the Chord ring second, 42 is looked up starting from node 46. The figure shows the nodes visited by the queries in each case, and also the interval (part of the Chord space) each node is responsible for. In practice, Chord nodes inherit most of their routing table from their neighbors (and avoid the O.log2 n/ work to populate tables). Nodes periodically search the ring for “better” finger table entries. As nodes leave and rejoin, the Chord ring is kept consistent using a stabilize protocol, which ensures eventual consistency of successor pointers. More details about Chord, including the details of the stabilization protocol, can be found in [60]. 7.2.2 LMS: Lookup on Given Topologies As we saw in the previous section, Chord imposes the overlay topology on its nodes that is stipulated by node IDs, and lookup queries traverse routes in this topology. Such networks are often referred to as structured. In contrast, some overlay networks allow participating nodes to form arbitrary topologies, irrespective of their node IDs. These networks are called unstructured. The simplest form of lookup on an unstructured topology is to flood the query. Flooding searches, while adequate for small networks, quickly become infeasible as networks grow larger. LMS (Local Minima Search [43]) is a protocol designed for unstructured networks that scale better than flooding. In LMS, the owner of each object places replicas of the object on several nodes. Like in a DHT, LMS places replicas onto 7 Overlay Networking and Resiliency 229 nodes which have IDs “close” to the object. Unlike in a DHT, however, in an unstructured topology there is no deterministic mechanism to route to the node, which is the closest to an item. Instead, LMS introduces the notion of a local minimum: a node u is a local minimum for an object if and only if the ID of u is the closest to the item’s ID in u’s neighborhood (those nodes within h hops of u in the network, where h is a parameter of the protocol, typically 1 or 2). In general, for any object there are many local minima in a graph, and replicas are placed onto a subset of these. During a search, random walks are used to locate minima for a given object, and a search succeeds when a local minimum holding a replica is located. While DHTs typically provide a worst-case bound of O.log n/ steps for lookups in a network of size n, LMS provides a worst-case bound of O.T .G/ C log n/, where T .G/ is the mixing time of G (the time by which a random walk on the topology G approaches its stationary distribution). T .G/ is O.log n/ or polylogarithmic in n for a wide range of randomly-grown topologies. This “O.T .G/ C log n/” is typically in the 6–15 range in networks of size up to 100; 000. Let dh be the minimum size of the h-hop p neighborhood of any node in G. LMS achieves its performance byp storing O. n=dh / replicas, and with a message complexity (in its lookups) of O. n=dh .T .G/ C log n//. This is notably worse than DHTs, but is a considerable improvement over other (essentially linear-time) lookup techniques in networks that cannot support a structured protocol, and a vast improvement over flooding-based searches [43]. The use of local minima in LMS provides a high assurance that object replicas are distributed randomly throughout the network. This means that even if the lookup part of the LMS protocol is not used (such as for searches on object attributes that consequently cannot use the virtualized object identifier), flooding searches will succeed with high probability even with relatively small bounded propagation distances. Finally, LMS also provides a high degree of fault-tolerance. 7.2.3 Case Study: OpenDHT Since many distributed applications can benefit from a lookup facility, a logical step is to develop a DHT substrate. OpenDHT is an example of such a substrate[53]. An application using a DHT may need to execute application-specific actions at each node along DHT routing paths or at the node responsible for a given key. However, to satisfy a range of applications, OpenDHT takes a minimalist approach: it only allows applications to associate a data item with a given key and store it in the substrate (at a node or nodes that OpenDHT selects to be responsible for this key) as well as retrieve it from the substrate. The DHT routing is done “under covers” within the substrate and is not exposed to the application. In other words, OpenDHT is an external storage platform for third-party applications. While OpenDHT in itself is a peer-to-peer overlay network, application end-hosts do not participate in it directly. Instead, it runs on PlanetLab [16] nodes; applications that use OpenDHT may or may not use PlanetLab. 230 B. Bhattacharjee and M. Rabinovich OpenDHT provides two simple primitives to applications: put(key, data) which is used to store a data item and an associated key, and get(key) which retrieves previously stored data given its key.1 Multiple puts with the same key append their data items to the already existing ones, so a subsequent get would retrieve all these data. OpenDHT, therefore, implements an application-agnostic shared storage facility. Due to its open nature, OpenDHT includes special mechanisms to prevent resource hoarding by any given user. It also limits the size of data items to 1 KB and times out deposited data items that are not explicitly renewed by the application. Renewal is done by issuing an identical “put” before the original data item expires. The shared storage provided by OpenDHT allows end-hosts in a distributed application to conveniently share state, without any administrative overhead. This capability turned out to be powerful enough to support a growing number of applications. In fact, OpenDHT primitives can be used to implement an application that employs its own DHT routing among the application’s end-hosts [53]. While a great deal of engineering ingenuity ensures that OpenDHT nodes’ resources are shared fairly among competing applications, OpenDHT’s resiliency and scalability come from its overlay network architecture. Besides demonstrating these benefits of overlays, OpenDHT has shown the generality of the DHT concept by using it as a foundation of a substrate that has proved useful for a number of diverse applications. 7.2.4 Securing DHTs Chord and LMS are only two of many different contemporary lookup protocols. These two protocols assume that nodes are cooperative and altruistic. While these protocols are highly resilient to random component failures, it is more difficult to protect them against malicious attacks. This is especially a concern since DHTs may be built using public, non-centrally administered nodes, some of which may be corrupt or compromised. There are several ways in which adversarial nodes may attempt to subvert a DHT. Malicious nodes may return incorrect results, may attempt to route requests to other incorrect nodes, provide incorrect routing updates, prevent new nodes from joining the system, and refuse to store or return items. There are several DHT design that provide resilience to these types of attacks. We describe one in detail next. 7.2.5 Case Study: NeighborhoodWatch The NeighborhoodWatch DHT [11] provides security against malicious users that attempt to subvert a DHT instance by misrouting or dropping queries, 1 The actual API includes additional primitives and parameters, which are beyond the scope of our discussion. 7 Overlay Networking and Resiliency 231 refusing to store items, preventing new nodes from joining, and similar attacks. NeighborhoodWatch employs the same circular ID space as Chord [59], and also maps its nodes into neighborhoods as in [20]. However, in NeighborhoodWatch, each node has its own neighborhood that consists of itself, it’s k successors, and k predecessors, where k is a system parameter. NeighborhoodWatch’s security guarantees hold if and only if for every sequence of k C 1 consecutive DHT nodes, at least one is alive and honest. NeighborhoodWatch employs an on-line trusted authority, the Neighborhood Certification Authority (NCA) to attest to the constituents of neighborhoods. The NCA has a globally known public key. The NCA may be replicated, and the state shared between NCA replicas is limited to the NCA private key, a list of malicious nodes, and a list of complaints of non-responsive nodes. The NCA creates, signs, and distributes neighborhood certificates, or nCerts, to each node. Nodes need a current and valid nCert in order to participate in the system. Upon joining, nodes receive an initial nCert from the NCA. nCerts are not revoked; instead nodes must renew their nCerts on a regular basis by contacting the NCA. nCerts list the current membership of a neighborhood, accounting for any recent changes in membership that may have occurred. Using signed nCerts, any node can identify the set of nodes that are responsible for storing an item with a given ID. NeighborhoodWatch employs several mechanisms that detect and prove misbehavior (described in detail in [11]). The NCA removes malicious nodes from the DHT by refusing to sign a fresh nCert for that node. Nodes maintain and update their finger tables as in Chord. The join procedure is shown in Fig. 7.3. For each of node n’s successors, predecessors, and finger table n p3(n) p 2(n) p (n) n p3(n) p 2(n) p (n) n.id n.id s(n) s(n) s 2(n) 2 s (n) s 3(n) NCA (1) Node n requests to join by contacting an NCA replica. 3 p (n) p 2(n) p (n) s 3(n) NCA (2) NCA returns an nCert to n, who uses it to find owner (n.id ). n 3 n p (n) p 2(n) p (n) n.id n.id s(n) s (n) 2 2 s (n) s (n) 3 s (n) NCA (3) Node n returns nCertowner(n.id) to NCA. 3 n p (n) p 2(n) p (n) (4) NCA requests neighborhood certificates from k predecessors and k successors of n p3(n) p 2(n) p (n) n.id n s (n) s (n) k s 2(n) 2 s (n) NCA s 3(n) NCA s 3(n) (5) Nodes return current certificates and the NCA verifies their consistency NCA (6) NCA issues fresh certificates to all affected nodes Fig. 7.3 The join process in the NeighborhoodWatch DHT [11]. Here k D 3 3 s (n) 232 B. Bhattacharjee and M. Rabinovich entries, n stores a full nCert (instead of only the node ID and IP address as in Chord). When queried as part of a lookup operation, nodes return nCerts rather than information about a single node. Routing is iterative: if a node on the path fails (or does not answer), the querier can contact another node in the most recently obtained nCert. Recall that NeighborhoodWatch assumes that every sequence of k C 1 consecutive nodes in the DHT contains at least one node that is alive and honest. The insight is that if nodes cannot choose where they are placed in the DHT, malicious nodes would have to corrupt a large fraction of the nodes in the DHT in order to obtain a long sequence of consecutive, corrupt nodes. By making routing depend on long sequences of nodes (neighborhoods), nodes are guaranteed to know of at least one other honest node that is “near” a given point in the DHT. In order to protect against a given fraction f of malicious nodes, the system operator chooses a value of k such that this assumption holds with high probability. Items published to the DHT are self-certifying. In addition, when a node stores an item, it returns a signed receipt to the publisher. This receipt is then stored back in the DHT. This prevents nodes from lying about whether they are storing a given item: if a querier suspects that a node is refusing to return an item, it can look for a receipt. If it finds a receipt, it can petition the NCA to remove the misbehaving node from the DHT. 7.2.6 Summary and Further Reading In this section, we have described the basic functionality provided by DHTs, and provided case studies that demonstrate different flavors of DHTs and lookup protocols. We have described how DHTs attain their lookup performance, and also described how DHT protocols can be subverted by attackers. Finally, we have presented a DHT design that is more resilient to noncooperative and malicious behavior. Our review is not comprehensive; there are many other interesting DHT designs. We point the interested reader to [12, 20, 23, 41, 50, 51, 54, 66]. 7.3 Resilient Overlay-Based Streaming Media Overlay-based streaming media systems can be decomposed into three broad categories depending on their data delivery mechanism (Fig. 7.4). Participants in a single-tree system arrange themselves into a tree. By definition, this implies that there is a single, loop-free, path between any two tree nodes. The capacity of each tree link must be at least the streaming rate. Content is forwarded (i.e., pushed) along the established tree paths. The source periodically issues a content packet to its children in the tree. Upon receiving a new content packet, each node immediately forwards a copy to its children. The uplink bandwidth of leaf nodes remains unused (except by recovery protocols) in a single tree system. 7 Overlay Networking and Resiliency Fig. 7.4 Decomposition of Streaming Media Protocols 233 Streaming Media Protocols Single-Tree Single-Tree Mesh Hybrid Mesh Multi-Tree Multi-Tree Mesh Hybrid Examples of single-tree systems include ESM [15], Overcast [26], ZIGZAG [61], and NICE [8]. In a multi-tree system, each participating node joins k different trees and the content is partitioned into k stripes. Each stripe is then disseminated in one of the trees, just as in a single-tree system. In a multi-tree protocol, each member node can be an interior node in some tree(s) and a leaf node in other trees. Further, each stripe requires only 1=kth the full stream bandwidth, enabling multi-trees to utilize forwarding bandwidths that are a fraction of the stream rate. These two properties enable multi-tree systems to utilize available bandwidth better than a single-tree. SplitStream [13], CoopNet [45], and Chunkyspread [62] are examples of multi-tree systems. In mesh-based or swarming overlays, the group members construct a random graph. Often, a node’s degree in the mesh is proportional to the node’s forwarding bandwidth, with a minimum node degree (typically five [69]) sufficient to ensure that the mesh remains connected in the presence of churn. The source periodically makes a new content block available, and each node advertises its available blocks to all its neighbors. A missing block can then be requested from any neighbor that advertises the block. Examples of mesh-based systems are CoolStreaming [69], Chainsaw [46], PRIME [39], and PULSE [47]. As Fig. 7.4 shows, the base dataplanes can be combined to form hybrid dataplanes. Hybrid dataplanes combine tree- and mesh-based systems by employing a tree backbone and an auxiliary mesh structure. Typically, blocks are “pushed” along the tree edges (as in a regular tree protocol) and missing blocks are “pulled” from mesh neighbors (as in a regular mesh protocol). Prototypical examples of single-tree-mesh systems are mTreeBone [65] and Pulsar [37]. Bullet [29] is also a single-tree mesh but instead of relying on the primary tree backbone to deliver the majority of blocks, random subsets of blocks are pushed along a given tree edge and nodes recover the missing blocks via swarming. PRM [9] is a probabilistic single-tree mesh system. Chunkyspread [62], GridMedia [68], and Coolstreaming+ [33, 34] are multi-tree-mesh systems. CPM [22] is a server-based system that combines server multicast and peer-uploads. 234 B. Bhattacharjee and M. Rabinovich 7.3.1 Recovery Protocols Tree-based delivery is fragile, since a single failure disconnects the data delivery until the tree is repaired. Existing protocols have added extra edges to a tree (thus approximating a mesh) for reducing latency [40] and for better failure recovery [9, 67]. These protocols are primarily tree-based, but augment tree delivery (or recovery) using links. Multi-tree protocols are more resilient, since a single failure often affects only one (of k) trees. Mesh delivery is robust by design; single node or even multiple failures are not of high consequence since the data is simply pulled along surviving mesh paths. We next describe in detail different delivery protocols with a focus on their recovery behavior. 7.3.2 Case Study: Recovery in Trees Using Probabilistic Resilient Multicast (PRM) PRM [10] introduces three new mechanisms – randomized forwarding, triggered NAKs and ephemeral guaranteed forwarding – to tree delivery. We discuss randomized forwarding in detail. In randomized forwarding, each overlay node, with a small probability, proactively sends a few extra transmissions along randomly chosen overlay edges. Such a construction interconnects the data delivery tree with some cross edges and is responsible for fast data recovery in PRM under high failure rates of overlay nodes. We explain the details of proactive randomized forwarding [10] using the example shown in Fig. 7.5. In the original data delivery tree (Panel 0), each overlay node forwards data to its children along its tree edges. However, due to network losses on overlay links (e.g., hA; Di and hB; F i) or failure of overlay nodes (e.g., C , L, and Q), a subset of existing overlay nodes do not receive the packet (e.g., D; F; G; H; J; K and M ). We remedy this as follows. When any overlay node receives the first copy of a data packet, it forwards the data along all other tree edges (Panel 1). It also chooses a small number (r) of other overlay nodes and forwards 0 B A E C F D 1 B A D Q G H J K L M N P F E C T T Q G H J K L M N P Fig. 7.5 The basic idea behind PRM. The circles represent the overlay nodes. The crosses indicate link and node failures. The arrows indicate the direction of data flow. The curved edges indicate the chosen cross overlay links for randomized forwarding of data. [10] 7 Overlay Networking and Resiliency 235 data to each of them with a small probability, ˇ. For example, node E chooses to forward data to two other nodes using cross edges F and M . Note that as a consequence of these additional edges some nodes may receive multiple copies of the same packet (e.g., node T in Panel 1 receives the data along the tree edge hB; T i and cross edge hP; T i). Therefore, each overlay node needs to detect and suppress such duplicate packets. Each overlay node maintains a small duplicate suppression cache, which temporarily stores the set of data packets received over a small time window. Data packets that miss the latency deadline are dropped. Hence the size of the cache is limited by the latency deadline desired by the application. In practice, the duplicate suppression cache can be implemented using the playback buffer already maintained by streaming media applications. It is easy to see that each node on average sends or receives up to 1 C ˇr copies of the same packet. The overhead of this scheme is ˇr, where we choose ˇ to be a small value (e.g., 0.01) and r to be between 1 and 3. In PRM, nodes discover other random nodes by employing periodic random walks. It is instructive to understand why such a simple, low-overhead randomized forwarding technique is able to increase packet delivery ratios with high probability, especially when many overlay nodes fail. Consider the example shown in Fig. 7.6, where a large fraction of the nodes have failed in the shaded region. In particular, the root of the subtree, node A, has also failed. So if no forwarding is performed along cross edges, the entire shaded subtree is partitioned from the data delivery tree. No overlay node in this entire subtree would receive data packets until the partition is repaired. However, using randomized forwarding along cross edges a number of nodes from the unshaded region will have random edges into the shaded region as shown (hM; X i; hN; Y i and hP; Zi). The overlay nodes that receive data along such randomly chosen cross edges will subsequently forward data along regular tree edges and any chosen random edges. Since the cross edges are chosen uniformly at random, a large subtree will have a higher probability of cross edges being incident on it. Thus as the size of a partition increases, so does its chance of repair using cross edges. Triggered NAKs are the reactive components of PRM. An overlay node can detect missing data using gaps in received sequence numbers. This information is used to trigger NAK-based retransmissions. PRM further includes a Ephemeral Guaranteed Forwarding technique, which is useful for providing uninterrupted data service M Fig. 7.6 PRM provides successful delivery with high probability because large subtrees affected by a node failure get randomized recovery packets with high probability. [10] A X N Y Z P Overlay subtree with large number of node failures 236 B. Bhattacharjee and M. Rabinovich when the overlay construction protocol is detecting and repairing a partition in the data delivery tree. Here, when the tree is being repaired, the root of an affected subtree receives a stream of data from a “random” peer. More details about PRM are available in [10]. 7.3.3 Case Study: Multi-Tree Delivery Using Splitstream In Splitstream, the media is divided into k stripes, using a coding techniques such as multi-descriptive coding (MDC). All of the stripes in aggregate provides perfect quality, but each stripe can be used independent of the others and each received stripe progressively improves the stream quality. Splitstream forms k trees, such that, ideally, each node is an interior node in only one tree. The source multicasts stripes onto different trees, and each node receives all stripes and forwards only one stripe. When a node departs, at most one tree is affected since every node is a leaf in all but one tree. Therefore, node departures do not affect delivery quite as much as a single tree system. Further, the forwarding bandwidth of every node is now used, since each node is an interior node in at least one stripe tree. Finally, since each stripe is approximately 1=kth the bandwidth of the original stream, each node can serve more children, which results in a shorter tree (higher average outdegree) and lower latency. Splitstream is built atop Scribe, which itself is an overlay multicast protocol built using the Pastry DHT. Due to bandwidth constrains on individual nodes, it is not always feasible to form the ideal interior-disjoint trees such that each node is an interior node in only one tree. In particular, a stripe tree may run out of forwarding bandwidth (because all of its leaf nodes are interior nodes in some other tree). To solve this problem, Splitstream maintains a “Spare Capacity Group (SCG),” which contains nodes with extra capacity that can forward onto more than one stripe. In bandwidth-scare deployments, nodes may have to use the SGC to locate a parent. In extreme cases, it may be impossible to form a proper Splitstream forest; however, this condition is rare and analysed in detail in [13]. 7.3.4 Case Study: Recovery Using a Mesh in CoolStreaming/DONet In Coolstreaming, a random mesh connects the members of the data overlay, and random blocks are “pulled” from different mesh neighbors. Each node maintains an mCache, which is a partial list of other active nodes in the overlay. A new node initially contacts the source; the source selects a random “deputy” from its mCache, and the deputy supplies the new node with currently active nodes. Each 7 Overlay Networking and Resiliency 237 node periodically percolates a message (announcing itself) onto the overlay using a gossip protocol. The media stream is divided into fixed sized segments; each segment has a sequence number and each node maintains a bitmap, called the buffer map, to represent the availability of segments. In CoolStreaming, the default buffer map contains 120 bits. Each node maintains neighbors (called partners) proportional to its forwarding bandwidth, while still maintaining a minimum number of partners (typically 5). Nodes periodically (usually every second) exchange their buffer maps with their partners, and use a scheduling heuristic to exchange blocks. The scheduling algorithm must select a block to request, and an eligible node to request the block from. The block requested is the scarcest block (supplied by least number of nodes). The node from which this block is requested is the eligible node (which has advertised the scarce block) with the most bandwidth. The origin node serves only as a supplier and publishes a new content block every second. Partners can be updated from the node’s mCache as needed, and the mCache is updated using the periodic gossip. Individual node failures have very little effect on the delivery since a node can simply select a different partner to receive a block. However, the trade-off is control overhead (bitmap exchange) and latency (which is now proportional to the product of buffer map size and overlay diameter). 7.4 Web Content Delivery Networks Resource provisioning is a fundamental challenge for Internet content providers. Too much provision and the infrastructure will simply depreciate without generating return on investment; too little provision and the web site may lose business and potentially steer users to competitors. A content delivery network (CDN) offers a service to content providers that helps address this challenge. A typical CDN provider deploys a number of CDN servers around the globe and uses them as a shared resource to deliver content from multiple content providers that subscribe to the CDN’s service. The CDN servers are also known as edge servers because they are often located at the edges of the networks in which they are deployed. Content delivery networks represent a type of overlay network because they route content between the origin sites and the clients through edge servers. A CDN improves resiliency and performance of subscribing web sites in several ways. As already mentioned in Section 7.1.1, a CDN can reuse capacity slack to absorb demand peaks for different content providers at different times. By sharing a large slack across a diverse pool of content providers, CDNs improve resiliency of the subscribing web sites to flash crowds. 238 B. Bhattacharjee and M. Rabinovich A CDN promises a degree of protection against denial of service attacks because the enormous capacity the attacker would need to saturate to exert any noticeable performance degradation. A CDN improves the performance of content delivery under normal load because it can process client requests from a nearby edge server. CDNs are used to deliver a variety of content, including static web objects, software packages, multimedia files, and streaming content – both video-on-demand and live. For video-on-demand, edge servers deliver streams to viewers from their cached files; typically, these files are pre-loaded to the edge server caches from origin sites as they become available. However, if a requested file is not cached, the edge server will typically obtain the stream from the origin and forward it to the viewer, while also storing the content locally for future requests. In the case of live streaming (“Webcasts”), content flows form a distribution tree, with viewers as leaves, edge servers as intermediate nodes, and the origin as the root. Often, however, CDN servers form deeper trees. In either case, Webcast delivery through a CDN can benefit from various tree-based approaches to streaming media systems such as those discussed in Section 7.3. In the rest of this section, we will limit our discussion to how CDNs deliver static files, including static web objects, software packages, multimedia files, etc. 7.4.1 CDN Basics A CDN must interpose its infrastructure transparently between the content provider and the user. Furthermore, unlike P2P networks where users run specialized peer software, a CDN must serve clients using standard web browsers. Thus, a fundamental building block in a CDN is a mechanism to transparently reroute user requests from the content provider’s site (known as the “origin size” in the CDN parlance) to the CDN platform. The two main techniques that have been used for this purpose are DNS outsourcing and URL rewriting. Both techniques rely on the domain name system (DNS), which maps human-readable names, such as www.firm-x.com, to numeric Internet protocol (IP) addresses. A browser’s HTTP request is preceded by a DNS query to resolve the host name from the URL. The DNS queries are sent by browsers’ local DNS servers (LDNS) and processed by the web sites’ authoritative DNS servers (ADNS). In URL rewriting, a content provider rewrites its web pages so that embedded links use host names belonging to the CDN domain. For example, if a page www.firm-x.com contains an image picture.jpg that should be delivered by the CDN, the image URL would be rewritten to a form such as http://images.firmx.com.cdn-foo.net/real.firm-x.com/picture.jpg. In this case, the DNS query for images.firm-x.com.cdn-foo.net would arrive to CDN’s DNS server in a normal way, without redirection from firm-x.com’s ADNS. Note that URL rewriting only works for embedded and hyperlinked content. The container pages (i.e., the entry points to the web sites) would have to be delivered from the origin site directly. 7 Overlay Networking and Resiliency 239 CDN 135.207.24.10 5 Client 1 4 135.207.24.11 135.207.24.11 Images.firm-x.com? 135.207.24.12 6 Firm-x.com 192.15.183.17 CDN_DNS 135.207.25.01 3 Auth DNS Local DNS 135.207.24.13 Images.firm-x.com? 2 “Ask 135.207.25.01” Fig. 7.7 A high-level view of a CDN architecture DNS outsourcing refers to techniques that exploit mechanisms in the DNS protocol that allow a query to be redirected from one DNS server to another. Beside responses containing IP addresses, the DNS protocol allows two response types that can be used for redirection. An NS-type response specifies a different DNS server that should be contacted to resolve the query. A CNAME-type response specifies a canonical name, a different host name that should be used instead of the name contained in the original query. Either response type can be used to implement DNS outsourcing. Figure 7.7 depicts a high-level architecture of a CDN utilizing DNS outsourcing. Consider a content provider – firm-x.com in the example – that subscribes to CDN services to deliver its content from the images.firm-x.com subdomain. (Content from other subdomains, such as www.firm-x.com might be delivered independently, perhaps by the provider’s origin server itself.) When a client wants to access a URL with this hostname, it first needs to resolve this hostname into the IP address of the server. To this end, it sends a DNS query to its LDNS (step 1), which ultimately sends it to the ADNS server for firm-x.com (step 2). ADNS now engages the CDN by redirecting LDNS’s query to the DNS server operated by the CDN provider (CDN DNS in the figure). ADNS does it by returning, in the exchange of step 2, an NS record specifying CDN DNS. LDNS now sends the query for images.firm-x.com to CDN DNS, which can now choose an appropriate edge server and return its IP address to LDNS (step 3). The LDNS server forwards the response to the client (step 4), which now downloads the file from the specified server (step 5). When the request arrives at the edge server, the server may or may not have the requested file in its local cache. If it does not, it 240 B. Bhattacharjee and M. Rabinovich obtains the file from the origin server (step 6) and sends it to the client; the edge server can also cache this file for future use, depending on the cache-controlling headers that came with the file from the origin server. With either DNS outsourcing or URL rewriting, when a DNS query arrives at CDN’s DNS server, the latter has the discretion to select the edge server whose IP it would return in the DNS response. This provides the CDN with an opportunity to spread the content delivery load among its edge servers (by resolving different DNS queries to different edge servers) and to localize content delivery (by resolving a given DNS query to an edge server that close to the requesting client, according to some metric). There are a number of sometimes contradicting factors that can affect edge server selection. The mechanisms and policies for server selection is a large part of what distinguishes different CDNs from one another. The much-simplified architecture described above is fully workable except for one detail: how does the edge server receiving a request know which origin server to contact for the requested file? CDNs use two basic approaches to this issue. In the example of Fig. 7.7, assuming the client uses HTTP 1.1, the client will include an HTTP Host header “Host:images.firm-x.com” with its request to the edge server. This gives the edge server the necessary information. Another approach, which does not rely on the host header, involves embedding provider identity into the path portion of the URL. This technique is used in particular with URL rewriting. For example, with the above URL http://images.firmx.com.cdn-foo.net/real.firm-x.com/picture.jpg, the client’s request to the edge serve will be for file “real.firm-x.com/picture.jpg”, providing edge server with the information about the origin server. 7.4.2 Bag of DNS Tricks Looking at Fig. 7.7, an immediate concern with this architecture is the CDN DNS server. First, it is a centralized component that can become the bottleneck in the system. Second, it undermines localized data delivery to some degree because all DNS queries must travel to this centralized component no matter where they come from. These issues are exacerbated by the fact that, in order to retain fine-grained control over edge server selection, CDN DNS must limit the amount of time its responses can be cached and reused by clients. It does so by assigning a low timeto-live (TTL) value to its responses, a standard DNS protocol feature for controlling response caching. This increases the volume of DNS queries that CDN DNS must handle. Moderate-sized CDNs sometimes disregard these concerns because DNS queries usually take little processing, with a single server capable of handling several thousand queries per second. With additional consideration that DNS server load is easily distributed in a server cluster, the centralized DNS resolution can handle large amounts of load before becoming the bottleneck in practice. Furthermore, the overhead of nonlocalized DNS processing only becomes noticeable in practice 7 Overlay Networking and Resiliency 241 for delivering small files. For large file downloads, such as software packages or multimedia files, a few hundred millisecond of initial delay will be negligible compared to several minutes of the download itself. Large CDNs, however, deal with extraordinary total loads and provide content delivery services for all file sizes. Thus, they implement their DNS service as a distributed system in its own right. One approach to implement a distributed DNS service again utilizes DNS redirection mechanisms. For example, the Akamai CDN [1] implements a two-level DNS system. The top-level DNS server is a centralized component and is registered as the authoritative DNS server for the accelerated content. Thus, initial DNS queries arrive at this server. The top-level DNS server responds to queries with an NS-type response, redirecting the requester to a nearby low-level DNS server. Moreover, these redirecting responses are given a long TTL, in effect pinning the requester to the selected low-level DNS server. The actual name resolution occurs at the low-level DNS servers. Because most DNS interactions occur between clients and low-level CDN DNS servers, the DNS load is distributed and the interactions are localized in the network. Another approach uses a flat DNS system, and utilizes IP anycast to spread the load among them. A CDN using this approach deploys a number of CDN DNS servers in different Internet locations but assigns them the same IP address. Then, it relies on the underlying IP routing infrastructure to deliver clients’ DNS queries destined to this IP address to the closest CDN DNS server. In this way, DNS processing load is both distributed and localized among the flat collection of DNS servers. The Limelight CDN [35] utilizes this technique. Beside DNS service scalability, Limelight further leverages the above technique to sidestep the decision about which of the data centers would be the closest to the client. In particular, Limelight deploys a DNS server in every data center; then each given request will be delivered by the anycast mechanism to its closest data center. The DNS server receiving a request then simply picks one of the edge servers co-located in the same data center for the subsequent download. This approach, however, is not without drawbacks. One limitation is that it relies exclusively on the proximity notion reflected in Internet routing; there are other considerations, such as network congestion and costs. Another limitation is due to the originator problem discussed in the next subsection. 7.4.3 Issues The basic idea behind CDNs might seem simple, but many technical challenges lurk. An obvious challenge is server selection, which is an open-ended issue. There are a number of factors that may affect the selection. A basic factor is proximity: one of the key promises of CDN technology is that they can deliver content from a nearby network location. But what does “nearby” mean? To start with, there are a number of proximity metrics one could use, which 242 B. Bhattacharjee and M. Rabinovich differ in how closely they correlate with end-to-end performance and how hard they are to obtain. Geographical distance, autonomous system hops, and router hops, could be used as relatively static proximity metrics. Static metrics may incorporate domain knowledge, such as maps of private peering points among network providers, since private peering points can be more reliable than public network access points. Then, one could consider dynamic path characteristics, such as packet loss, network packet travel delay (one-way or round-trip), and available path bandwidth. Obtaining these dynamic metrics and keeping them fresh is much more challenging. Further, a CDN may account for economic factors, such as the preference of utilizing certain network carriers even at the expense of a slight performance degradation. Once the proximity metrics are figured out, the next question is how to combine them with server load metrics, since in the end we need to pick a certain edge server for a given request. Server loads are inherently dynamic. They raise a number of questions of their own, with their own research literature. How long a history of past data to consider, and which load characteristics to measure? One can consider a variety of characteristics, including CPU usage, network utilization, memory, and disk IO. How frequently to collect load measurements, and how frequently to recompute load metrics? How to avoid a “herd effect” [19], where a CDN sends too much the demand to an underloaded server, only to overload it in the next cycle? The next set of questions is architectural in nature. As we discussed earlier, the prevalent mechanism in CDNs for routing requests to a selected edge server is based on DNS. DNS-based routing raises so-called originator and hidden load problems [49]. The originator problem is due to the fact that CDN proximity-based server selection can only be done relative to the originator of the DNS query, which is the client’s DNS server, and not the actual host that will be downloading the content. Thus, the quality of any proximity-based server selection is limited by how close the actual client is to the LDNS it is using. While there has been some work on determining the distance between clients and their LDNSs [42, 57], the end-to-end effect of this issue on user-perceived performance is not yet fully known. One way to sidestep the originator problem is to utilize IP anycast for the HTTP interaction [2]. Similar to anycast-based DNS interactions considered previously, different edge servers in this case would advertise the same IP address. This address would be returned to the clients by CDN DNS, and packets from a given browser machine would be delivered to the closest edge server naturally thanks to IP routing. Anycast was previously considered unsuitable for HTTP downloads for two reasons. First, unlike DNS that uses the UDP transport protocol by default, HTTP runs on top of TCP. TCP is a stateful connection-oriented protocol, and if a routing path changes in the middle of the ongoing download, the edge server browser may attempt to continue the download from a different edge server, leading to a broken TCP connection. Second, IP anycast selects among end-points for packet delivery without consideration for the routing path quality or end-point load. However, recent insights into the anycast behavior [7] and network traffic engineering [63] 7 Overlay Networking and Resiliency 243 alleviate these concerns, especially when a CDN is deployed within one autonomous system. ICDS – a CDN service by AT&T [5] – is currently pursuing a variant of this approach. The hidden load problem arises because of drastically different number of clients behind different LDNS servers. A large ISP likely has thousands of clients sharing the same LDNS. Then, a single DNS query from this LDNS can result in a large amount of demand for the selected edge server. At the same time, a single query from the LDNS of a small academic department will impose much smaller load. Because a CDN distributes load at the granularity of DNS queries, potentially drastic and unknown imbalances of load resulting from single queries complicate proper load balancing. Another architectural issue relates to the large number of edge servers a CDN maintains. When new popular content appears and generates a large number of requests, these requests will initially miss in the edge server caches and will be forwarded to the origin server. These missed requests may overload the origin server in the beginning of a flash crowd, until edge servers populate their caches [27]. CDNs often pre-load new content to the edge servers when the content is known to be popular. However, unpredictable flash crowds present a danger. Consequently, CDNs sometimes deploy peer-to-peer cooperation among their edge servers, with edge servers forwarding missed requests to each other rather than directly to the origin server. This gives rise to more complex overlay network topologies than the one-hop overlay routing in the basic CDN architecture described here. In fact, the underlying mechanisms can be even more complex: the complex overlay topologies add overhead due to application-level processing at each hop. Thus, one could try to use simple one-hop topology under normal load and add more complex request routing dynamically once the danger of a “miss storm” is detected. This in turn opens a range of interesting algorithmic questions involved in deciding when to start forming a complex topology and how to form it. This overview is necessarily brief. Its goal is only to convey the fact that content delivery networks represent an important aspect of Internet infrastructure and a rich environment for research and innovation. We refer the reader to more targeted literature, such as [24, 49, 64] 7.5 Attack-Resilient Services We have seen that overlay systems provide resilience by design: the lack of centralized entities naturally provides a measure of resilience against component failures. Overlay systems can also form the building block for systems that are resilient to malicious attack. SOS [28] and a subsequent derivative, Mayday [3], are the two overlay systems that provide denial-of-service protection for Internet services. We discuss SOS next. 244 B. Bhattacharjee and M. Rabinovich 7.5.1 Case Study: Secure Overlay Service (SOS) Secure Overlay Services (SOS) is an overlay network designed to protect a server (the target) from distributed denial of service attacks. SOS enables a “confirmed” user to communicate with the protected service. Conceptually, the service is protected by a “ring” of SOS overlay nodes, which are able to confirm incoming requests as valid. Once a request is validated, it is forwarded on to the service. Users, by themselves, are not able to directly communicate with the service (initially); in fact, the protected server’s address may be hidden or changing. SOS forms a distributed firewall around the target server. The server advertises the SOS overlay nodes (called Service Overlay Access Points [SOAPs])) as its initial point of contact. Users initiate contact to the server by connecting to one of the SOS overlay nodes. Malicious users may attack overlay nodes, but by assumption are not able to bring down the entire overlay. The server’s ISP filters all packets to the server’s address, except for a chosen few (who are allowed to traverse this firewall). These privileged nodes are called “secret servlets”. Secret servlets designate a few SOAP nodes (called Beacons) as the rendezvous point between themselves and incoming connections. Regular SOAP nodes use an overlay routing protocol (such as Chord) to route authenticated requests to the Beacons. Beacons know of and forward requests to the secret servlets. Only secret servlets are allowed through the ISP firewall around the target, and the servlets finally forward the authenticated request to the protected server. 7.6 File-Sharing Peer-to-Peer Networks Consider the task of distributing a large file (e.g., in the order of hundreds of MB) to a large number of users. We already discussed one overlay approach – CDNs – targeting such an application. However, the CDN approach requires the source of the file to subscribe to CDN services (and pay the resultant service fees). Furthermore, this approach requires a CDN company to be vigilant in provisioning enough resources to keep up with the potential scale of downloads involved. Peer-to-peer networks provide an appealing alternative, which organizes users themselves into an overlay distribution platform. This approach is appealing to content providers because it does not require a CDN subscription. It also scales naturally with the popularity of a download: the more users are downloading a file, the more resources take part in the overlay distribution network adding the capacity to the delivery platform. Some peer-to-peer networks also provide administrative resiliency, as they have no special centralized administrative component. In fact, the utilization of the client upload bandwidth and CPU capacity in content delivery can also make P2P techniques interesting as an adjunct (rather than an alternative) to a CDN service. 7 Overlay Networking and Resiliency 245 In this section, we will concentrate on unique challenges that arise when the P2P system downloads a large (e.g., on the order of 100s of MB) file. In particular, we will consider the following two challenges: Block Distribution Imagine a flash crowd downloading a 100 MB software package. A naive approach (pursued by early P2P networks) would let each peer download the entire file and then make itself available as a source of this file for other peers. This approach, however, would not be able to sustain a flash crowd. Indeed, each peer would take a long time – tens of minutes over a typical residential broadband connection – to download this file and in the meantime the initial file source would have no help in coping with the demand. The solution is to chop the file into blocks and distribute different blocks to different peers, so that they can start using each other faster for block distribution. But this creates an interesting challenge. Obviously, the system needs to make a diverse set of blocks available as quickly as possible, so that each peer has a better chance of finding another peer from which to obtain missing blocks. But achieving this diversity is difficult when no peer possesses global knowledge about block distribution at any point in time. Free Riders A particularly widespread phenomenon is that of selfish peers: peers that attempt to make use of the peer content delivery without contributing their own resources. These peers are called “free riders”. More generally, a peer may try to bypass fairness mechanisms in the P2P network and obtain more than its share of resources, thus getting better service at the expense of other users. We will consider these two challenges in the context of the mesh model of content distribution. Using the terminology of BitTorrent – a popular P2P network – the key components of a mesh P2P network are seeds, trackers, and peers (or leechers). Originally the file exists at the source server (or servers) called seeds. There is a special tracker node that keeps track of at least some subset of the peers who are in the process of downloading the file. A new peer joins the download (a swarm) by contacting the tracker, obtaining a random subset of existing peers, and establishing P2P connections (i.e., overlay network links) with them. The download makes collective progress by peers exchanging missing blocks along the overlay edges. Having completed the download, a peer may stay in the swarm as a seed, uploading without downloading anything more in return. 7.6.1 Block Distribution Problem BitTorrent attempts to achieve a uniform distribution of blocks (or “pieces”: a set of blocks in BitTorrent) among the peers through localized decisions. Neighboring peers exchange lists of blocks that they already have. A peer determines which of the blocks it is missing are the rarest in its local neighborhood and requests these blocks first. Because the neighborhoods in the BitTorrent protocol evolve over time, 246 B. Bhattacharjee and M. Rabinovich the rarest-first block distribution leads to more uniform distribution of blocks in the network and to better chance of a peer finding a useful block without contacting the source. Recently, an ingenious alternative to the BitTorrent protocol has been proposed, which removes the issue of choosing the blocks completely [21]. This new approach, called Avalanche, follows the same mesh model with seeds, trackers, and peers, as BitTorrent. However, Avalanche makes virtually every block useful to any peer through network coding as follows. Peers no longer choose a single, original block to download from their neighbors at a time. Instead, every time a peer uploads a block to a neighbor, it simply computes a linear combination of all the blocks it currently has from a given file using random coefficients, and uploads the result along with auxiliary information, derived from the coefficients it used and those previously received with its own downloaded blocks. Once a peer collects enough encoded blocks (usually the same number as the number of blocks in the file), it can reconstruct the original file by solving a system of linear equations. A system implementing these ideas has been publicly available as Microsoft Secure Content Downloader since 2007, although the original author of BitTorrent raised questions about the importance of the removal of the block distribution problem in practice and the possible performance overhead involved [17]. These concerns have been reflected in recent empirical studies demonstrating that BitTorrent’s rarest-first piece selection strategy effectively provides block uniformity [30]. 7.6.2 Free Riders Problem: Upload Incentives To improve its resiliency to free riding, BitTorrent utilizes an incentives mechanism. The goal of this mechanism is to ensure that peers who contribute more to content upload receive better download service. Just like its approach to block distribution problem, BitTorrent implements its incentives mechanism largely through localized decisions by each peer using a round-based unchoking algorithm to decide how much to send to its neighbors. When a peer learns a set of other peers from the file’s tracker (usually around 30–50), the peer starts by establishing connections to these peers, some of which will agree to send blocks to the peer. At the end of every unchoking round (10 s in most BitTorrent clients), the peer decides which of the peers it should upload blocks to in the next round. To this end, the peer considers the throughput of its download from the peers in the previous round and selects a small number (four in Azureus, a popular BitTorrent client implementation) of peers to which it will upload blocks in the next round. Selecting a peer for uploading is called “unchoking” a peer. In addition to unchoking the top four peers who have given in the past, a peer also unchokes another peer at random in each round. This helps the peer to bootstrap new peers, to discover potentially higher-performing peers, and to ensure that every peer, even with poor connectivity, makes some progress; without this “optimistic 7 Overlay Networking and Resiliency 247 unchoking,” these impoverished peers would end up choked by everybody. Except for optimistic unchoking, a peer only uploads to other peers if they have blocks that it does not. If two peers have blocks that the other lacks, the peers are said to be interested in one another. This protocol works because a free rider will end up being choked by most of its neighbors, only relying on random unchokes to make any progress. However, recent work [48] has found that the BitTorrent protocol penalizes high-capacity peers: as the upload performance of a peer increases, its download performance grows but less than proportionally to the upload contribution. In other words, the protocol is not entirely tit-for-tat in a usual sense of the word. Consequently, a new BitTorrent client called BitTyrant has been implemented that improves the download performance of high-capacity peers [48]. BitTyrant achieves this goal by exploiting the following observation. Regular BitTorrent peers allocate their upload capacity equally among their unchoked neighbors. Because of this, a strategic peer does not need to upload to regular peers at its maximum capacity: it only needs to upload faster than most of its peers’ other neighbors, so that its peers would keep it unchoked. Thus, the key idea behind the BitTyrant client is to keep an estimate of the individual upload rates to its neighbors that is sufficient to stay in the neighbors’ unchoked set most of the time, and to upload to each neighbor at just that rate. Then, BitTyrant uses the spared upload capacity to unchoke more peers and hence to increase its download performance. Furthermore, the BitTyrant client selects only the peers with the highest return-on-investment: those peers whose data capacity can be obtained “cheaply.” The authors of BitTyrant observed significant reduction in file download times by their modified client. However, if all clients adopted selfish BitTyrant behavior with cut-off of expensive peers as mentioned above, the overall performance for all clients would decrease, especially for low-capacity clients. Thus, while discouraging free riding, BitTorrent still relies on altruistic contribution of high-capacity peers to achieve its performance. Although BitTorrent’s unchoking algorithm of giving to the top-four contributors has been broadly described as being tit-for-tat, recent work has shown that it is more accurately represented as an auction [32]. Each unchoking round can be viewed as an auction, where the “bids” are other peers’ uploads in previous rounds, and the “good” being auctioned is the peer’s upload bandwidth. Viewed this way, BitTyrant’s strategy of “coming in the last (winning) place” is easily seen as the clear winning strategy. Also by reframing BitTorrent as an auction, a solution to strategic attempts like BitTyrant arises: change the way peers “clear” their auction. A new client has been introduced that replaces BitTorrent’s top-four strategy with a proportional share response. Proportional share is a simple strategy: if a peer has given some fraction, say 10%, of all of the blocks you received in the previous round, then allot to that peer the same fraction, 10%, of your upload bandwidth. Note that this does not necessarily result in peers providing the same number of blocks in return, rather the same fraction of bandwidth. This results in what turns out to be a very robust form of fairness: the more a peer gives, the more that peer gets. Even highly provisioned peers therefore have incentive to contribute as much of 248 B. Bhattacharjee and M. Rabinovich their bandwidth as possible. The authors of this PropShare client have demonstrated that proportional share is resilient to a wide array of strategic manipulation. Further, PropShare outperforms BitTorrent and BitTyrant, and as more users adopt the PropShare client, the overall performance of the system improves.This work demonstrates the importance of an accurate model of incentives in a complex system such as BitTorrent. A strategic peer can achieve higher download performance by manipulating the list of blocks it announces to its neighbor [32]. Suppose node p in a BitTorrent swarm possesses some rare blocks. Since p has rare blocks, it is going to be interesting to many of its neighbors, who will all want to upload blocks to p in exchange for these rare blocks. However, once p announces these rare blocks, p’s neighbors will download these blocks from p and exchange them amongst themselves. Node p can sustain interest amongst its neighbors longer by under-reporting its block map, in particular, by strategically revealing the rare blocks one by one. This strategy guarantees p remains interested for longer since p’s neighbors, who all get the same rare block from p, cannot benefit by exchanging amongst themselves. This observation suggests a general under-reporting strategy. A node can remain interesting to its neighbors longest by announcing only the blocks necessary to maintain interest but no more. Similar to an all-BitTyrant strategy, when all peers strategically under-report their blocks in this manner [32], the overall performance of the system degrades. In general, BitTorrent’s incentives mechanisms have come under intense scrutiny. Through rich empirical studies and analyses that incorporate various economic principles, BitTorrent continues to grow more robust to cheating clients. Whether a system as complex as BitTorrent can be made fully robust to such users remains open. 7.7 Conclusion This chapter considers ways by which overlays-based techniques improve application resiliency. We have described how applications can utilize overlay networks to better cope with challenges such as flash crowds, the need to scale to often unpredictable loads, network failures and congestion, and denial of service attacks. We have considered a representative sample of these applications, focusing on their use of overlay network concepts. This sample included distributed hash tables, network storage, large file distribution by peer-to-peer networks, streaming content delivery, content delivery networks, and web services. It is simply not feasible to comprehensively cover overlay applications and research within one chapter. Instead, we hope that this chapter conveys sufficient information to give the reader a sampling of the various application domains where overlays are useful, and a sense for the flexibility that overlay networks provide to an application designer. 7 Overlay Networking and Resiliency 249 Acknowledgments The authors thank Katrina LaCurts, Dave Levin, and Adam Bender for their comments on this chapter. The authors are grateful to the editors, Chuck Kalmanek and Richard Yang, for their comments and encouragement. References 1. Akamai Technologies. Retrieved from http://www.akamai.com/html/technology/index.html 2. Alzoubi, H. A., Lee, S., Rabinovich, M., Spatscheck, O., & Van der Merwe, J. (2008). Anycast cdns revisited. In Proceedings of WWW ’08 (pp. 277–286). New York, NY: ACM. DOI http://doi.acm.org/10.1145/1367497.1367536 3. Andersen, D. G. (2003). Mayday: Distributed filtering for Internet services. In USITS. 4. Andersen, D. G., Balakrishnan, H., Kaashoek, M. F., & Morris, R. (2001). Resilient overlay networks. In Proceedings of 18th ACM SOSP, Banff, Canada. 5. ATT ICDS: Retrieved from http://www.business.att.com/service fam overview.j-sp?serv fam=eb intelligent content distribution 6. Balakrishnan, H., Lakshminarayanan, K., Ratnasamy, S., Shenker, S., Stoica, I., & Walfish, M. (2004). A layered naming architecture for the Internet. In Proceedings of the ACM SIGCOMM, Portland, OR. 7. Ballani, H., Francis, P., & Ratnasamy, S. (2006). A measurement-based deployment proposal for IP anycast. In Proceedings of the ACM IMC, Rio de Janeiro, Brazil. 8. Banerjee, S., Bhattacharjee, B., & Kommareddy, C. (2002). Scalable application layer multicast. In Proceedings of ACM SIGCOMM, Pittsburg, PA. 9. Banerjee, S., Lee, S., Bhattacharjee, B., & Srinivasan, A. (2003). Resilient multicast using overlays. In Proceedings of the Sigmetrics 2003, Karlsruhe, Germany. 10. Banerjee, S., Lee, S., Bhattacharjee, B., & Srinivasan, A. (2006). Resilient overlays using multicast. IEEE/ACM Transactions of Networking, 14(2), 237–248. 11. Bender, A., Sherwood, R., Monner, D., Goergen, N., Spring, N., & Bhattacharjee, B. (2009). Fighting spam with the NeighborhoodWatch DHT. In INFOCOM. 12. Castro, M., Druschel, P., Ganesh, A. J., Rowstron, A. I. T., & Wallach, D. S. (2002). Secure routing for structured peer-to-peer overlay networks. In OSDI. 13. Castro, M., Druschel, P., Kermarrec, A., Nandi, A., Rowstron, A., & Singh, A. (2003). Splitstream: High-bandwidth multicast in a cooperative environment. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP 2003), Lake Bolton, NY. 14. Castro, M., Druschel, P., Kermarrec, A. M., & Rowstron, A. (2002). Scribe: A large-scale and decentralized application-level multicast infrastructure. IEEE Journal on Selected Areas in Communication, 20(8), 1489–1499. DOI 10.1109/JSAC.2002.803069 15. Chu, Y., Ganjam, A., Ng, T., Rao, S., Sripanidkulchai, K., Zhan, J., & Zhang, H. (2004). Early experience with an Internet broadcast system based on overlay multicast. In Proceedings of USENIX Annual Technical Conference, Boston, MA. 16. Chun, B., Culler, D., Roscoe, T., Bavier, A., Peterson, L., Wawrzoniak, M., & Bowman, M. (2003). Planetlab: An overlay testbed for broad-coverage services. SIGCOMM Computer Communication Review, 33(3), 3–12. 17. Cohen, B. Avalanche. Retrieved from http://bramcohen.livejournal.com/20140.html 18. Dabek, F., Kaashoek, M. F., Karger, D. R., Morris, R., & Stoica, I. (2001). Wide-area cooperative storage with cfs. In SOSP (pp. 202–215). 19. Dahlin, M. (2000). Interpreting stale load information. IEEE Transactions on Parallel and Distributed Systems, 11(10), 1033–1047. 20. Fiat, A., Saia, J., & Young, M. (2005). Making chord robust to Byzantine attacks. In ESA. 21. Gkantsidis, C., & Rodriguez, P. (2005). Network coding for large scale content distribution. In INFOCOM (pp. 2235–2245). 250 B. Bhattacharjee and M. Rabinovich 22. Gopalakrishnan, V., Bhattacharjee, B., Ramakrishnan, K. K., Jana, R., & Srivastava, D. (2009). Cpm: Adaptive video-on-demand with cooperative peer assists and multicast. In Proceedings of INFOCOM, Rio De Janeiro, Brazil. 23. Gupta, I., Birman, K. P., Linga, P., Demers, A. J., & van Renesse, R. (2003). Kelips: Building an efficient and stable p2p dht through increased memory and background overhead. In IPTPS (pp. 160–169). 24. Hofmann, M., & Beaumont, L. R. (2005). Content networking: Architecture, protocols, and practice. San Francisco, CA: Morgan Kaufmann. 25. Iyer, S., Rowstron, A. I. T., & Druschel, P. (2002). Squirrel: A decentralized peer-to-peer web cache. In PODC (pp. 213–222). 26. Jannotti, J., Gifford, D., Johnson, K. L., Kaashoek, M. F., & Jr., J. W. O. (2000). Overcast: reliable multicasting with an overlay network. In Proceedings of the Fourth Symposium on Operating System Design and Implementation (OSDI), San Diego, CA. 27. Jung, J., Krishnamurthy, B., & Rabinovich, M. (2002). Flash crowds and denial of service attacks: Characterization and implications for cdns and web sites. In WWW (pp. 293–304). 28. Keromytis, A. D., Misra, V., & Rubenstein, D. (2002). SOS: Secure overlay services. In SIGCOMM. 29. Kostic, D., Rodriguez, A., Albrecht, J., & Vahdat, A. (2003). Bullet: High bandwidth data dissemination using an overlay mesh. In Proceedings of SOSP (pp. 282-297), Lake George, NY. 30. Legout, A., Urvoy-Keller, G., & Michiardi, P. (2006). Rarest first and choke algorithms are enough. In IMC. 31. Lemos, R.: Blue security folds under spammer’s wrath. http://www.securityfocus.com/news/ 11392 32. Levin, D., LaCurts, K., Spring, N., & Bhattacharjee, B. (2008). Bittorrent is an auction: Analyzing and improving bittorrent’s incentives. In SIGCOMM (pp. 243–254). 33. Li, B., Xie, S., Qu, Y., Keung, G., Lin, C., Liu, J., & Zhang, X. (2008). Inside the new coolstreaming: Principles, measurements and performance implications. In Proceedings of the INFOCOM 2008, Phoenix, AZ (pp. 1031–1039). 34. Li, B., Yik, K., Xie, S., Liu, J., Stoica, I., Zhang, H., & Zhang, X. (2007). Empirical study of the coolstreaming system. Proceedings of the IEEE Journal on Selected Areas in Communication (Special Issues on Advance in Peer-to-Peer Streaming Systems), 25(9), 1627-1639. 35. http://www.limelightnetworks.com/network.htm 36. Linga, P., Gupta, I., & Birman, K. (2003). A churn-resistant peer-to-peer web caching system. In 2003 ACM Workshop on Survivable and Self-Regenerative Systems (pp. 1–10). 37. Locher, T., Meier, R., Schmid, S., & Wattenhofer, R. (2007). Push-to-pull peer-to-peer live streaming. In Proceedings of the International Symposium of Distributed Computing, Lemesos, Cyprus. 38. Lumezanu, C., Baden, R., Levin, D., Spring, N., & Bhattacharjee, B. (2009). Symbiotic relationships in internet routing overlays. In Proceedings of NSDI, Boston, MA. 39. Magharei, N., & Rejaie, R. (2007). PRIME: Peer-to-peer receiver-drIven MEsh-based streaming. In Proceedings of the INFOCOM 2007, Anchorage, Alaska (pp. 1424–1432). 40. Magharei, N., Rejaie, R., & Guo, Y. (2007). Mesh or multiple-tree: A comparative study of live p2p streaming approaches. In Proceedings of the INFOCOM 2007, Anchorage, Alaska. 41. Malkhi, D., Naor, M., & Ratajczak, D. (2002). Viceroy: A scalable and dynamic emulation of the butterfly. In PODC (pp. 183–192). 42. Mao, Z. M., Cranor, C. D., Douglis, F., Rabinovich, M., Spatscheck, O., & Wang, J. (2002). A precise and efficient evaluation of the proximity between web clients and their local dns servers. In USENIX Annual Technical Conference (pp. 229–242). 43. Morselli, R., Bhattacharjee, B., Marsh, M. A., & Srinivasan, A. (2007). Efficient Lookup on Unstructured Topologies. IEEE Journal on Selected Areas in Communications, 25(1), 62–72. 44. Nakao, A., Peterson, L., & Bavier, A. (2006). Scalable routing overlay networks. SIGOPS Operating Systems Review, 40(1), 49–61. 45. Padmanabhan, V., Wang, H., Chou, P., & Sripanidkulchai, K. (2002). Distributing streaming media content using cooperative networking. In NOSSDAV, Miami Beach, FL, USA. 7 Overlay Networking and Resiliency 251 46. Pai, V., Kumar, K., Tamilmani, K., Sambamurthy, V., & Mohr, A. (2005). Chainsaw: Eliminating trees from overlay multicast. In IPTPS 2005, Ithaca, NY, USA. 47. Painese, F., Perino, D., Keller, J., & Biersack, E. (2007). PULSE: An adaptive, incentive-based, unstructured p2p live streaming system. IEEE Trans. on Multimedia 9(8), 1645–1660. 48. Piatek, M., Isdal, T., Anderson, T. E., Krishnamurthy, A., & Venkataramani, A. (2007). Do incentives build robustness in bittorrent? (awarded best student paper). In NSDI. 49. Rabinovich, M., & Spatscheck, O. (2001). Web caching and replication. Reading, MA: Addison-Wesley, Longman Publishing Co., Inc. Boston, MA, USA. 50. Ramasubramanian, V., & Sirer, E. G. (2004). Beehive: O(1) lookup performance for power-law query distributions in peer-to-peer overlays. In NSDI (pp. 99–112). 51. Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Shenker, S. (2001). A scalable contentaddressable network. In SIGCOMM. 52. Rhea, S., Geels, D., Roscoe, T., & Kubiatowicz, J. (2004). Handling churn in a dht. In USENIX Annual Technical Conference. 53. Rhea, S. C., Godfrey, B., Karp, B., Kubiatowicz, J., Ratnasamy, S., Shenker, S., Stoica, I., & Yu, H. (2005). Opendht: A public dht service and its uses. In SIGCOMM (pp. 73–84). 54. Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM Middleware 2001, Heidelberg, Germany. 55. Savage, S., Anderson, T., Aggarwal, A., Becker, D., Cardwell, N., Collins, A., Hoffman, E., Snell, J., Vahdat, A., Voelker, G., & Zahorjan, J. (1999). Detour: A case for informed internet routing and transport IEEE Micro, 19(1), 50–59. 56. Savage, S., Collins, A., Hoffman, E., Snell, J., & Anderson, T. (1999). The end-to-end effects of Internet path selection. In SIGCOMM. 57. Shaikh, A., Tewari, R., & Agrawal, M. (2001). On the effectiveness of DNS-based server selection. In Proceedings of IEEE Infocom, Anchorage, Alaska. 58. Stoica, I., Adkins, D., Zhuang, S., Shenker, S., & Surana, S. (2002). Internet indirection infrastructure. In SIGCOMM (pp. 73–86). 59. Stoica, I., Morris, R., Karger, D. R., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM (pp. 149–160). 60. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D. R., Kaashoek, M. F., Dabek, F., & Balakrishnan, H. (2003). Chord: A scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Transactions on Networking, 11(1), 17–32. 61. Tran, D., Hua, K., & Do, T. (2003). ZIGZAG: An efficient peer-to-peer scheme for media streaming. In Proceedings of the INFOCOM 2003, San Francisco, CA. 62. Venkataraman, V., Francis, P., & Calandrino, J. (2006). Chunkyspread: Multi-tree unstructured peer-to-peer multicast. In Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS ’06), Santa Barbara, CA. 63. Verkaik, P., Pei, D., Scholl, T., Shaikh, A., Snoeren, A., & Van der Merwe, J. (2007). Wresting control from BGP: Scalable fine-grained route control. In 2007 USENIX Annual Technical Conference. 64. Verma, D. C. (2001). Content distribution networks: An engineering approach. New York: Wiley. 65. Wang, F., Xiong, Y., & Liu, J. (2007). mTreebone: A hybrid tree/mesh overlay for applicationlayer live video multicast. In Proceedings of the ICDCS 2007, Toronto, Canada. 66. Wang, P., Hopper, N., Osipkov, I., & Kim, Y. (2006). Myrmic: Secure and robust DHT routing. Technical Report, University of Minnesota. 67. Yang, M., & Fei, Z. (2004). A proactive approach to reconstructing overlay multicast trees. In Proceedings of the IEEE Infocom 2004, Hong Kong. 68. Zhang, M., Luo, J., Zhao, L., & Yang, S. (2005). A peer-to-peer network for live media streaming – Using a push-pull approach. In Proceedings of the ACM Multimedia, Singapore. 69. Zhang, X., Liu, J., Li, B., & Yum, T. (2005). Donet: A data-driven overlay network for efficient live media streaming. In Proceedings of the INFOCOM 2005. Miami, FL. Part IV Configuration Management Chapter 8 Network Configuration Management Brian D. Freeman 8.1 Introduction This chapter will discuss network configuration management by presenting a high-level view of the software systems that are involved in managing a large network of routers in support of carrier class services. It is meant to be an overview, highlighting the major areas that a network operator should assess while designing or buying a configuration management system, and not the final source of all information needed to build such a system. When a service and its network are small, network configuration management is typically done manually by a knowledgeable technician with some form of workflow to get the data needed to perform their configuration tasks from the sales group. Inventory tracking may be handled by simply inserting comments into the interface description fields on the router and perhaps by maintaining some spreadsheets on a file server. The technician might or might not use an element management system (EMS) to do the configuration changes. If the network is new, for example, supporting the needs of a small company or the network needs of an “Internet startup,” most of the configuration tasks represent a “new order.” Configuration requests occur at low volume and the technician probably has a great deal of flexibility in how he or she goes about meeting the needs of the new network service. As the number of users of the service grows, the expectations placed on the network operator to meet a certain level of reliability and performance grows accordingly. In time, because of growth in the sheer volume of orders, the single knowledgeable worker becomes a department, and “change orders” that modify the configuration associated with an existing customer of the network start becoming a larger and larger share of the effort. At this point, the network may contain multiple types of routers purchased from different vendors, each of which has different features and resource limits. Changes made to a router configuration to support one customer can now affect another customer. For example if one customer’s B.D. Freeman () AT&T Labs, Middletown, NJ, USA e-mail: bdfreeman@att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 8, c Springer-Verlag London Limited 2010 255 256 B.D. Freeman configuration change causes a router resource such as table size to be exceeded, multiple customers might be affected. In addition, other departments or areas within the business now need data on the installed inventory to drive customer reporting, usage-based billing or ticketing, etc. Finally, as the volume grows, there is a need for automation or “flow through provisioning” to both reduce cost/time and protect against mistakes. The simple, manual approaches no longer work: an end-to-end view is needed for network configuration management so that all the pieces required to support the business can be integrated. This chapter provides an overview of the elements of a robust network configuration management system. There are many goals for such a system, but the primary goal of any network configuration management system is to protect the network while providing the ordered service for the customers. Since changing the network configuration can cause outages if not done correctly, a key requirement of a network configuration management system is to ensure that the configuration changes do not destabilize the network. The system must provide the ordered service for the customer without affecting other customers, other ports associated with the customer being provisioned or the network at large. The network configuration management system is also typically the primary source of data – the source of truth – used by many business systems and processes that surround the network. The functions that depend on configuration data are as mundane as trouble ticketing and spare part tracking, to more sophisticated capabilities like traffic reporting, for which the association of ports to customers must be obtained so that traffic reports can be properly displayed on the customer service portal. Finally, the network configuration management system is the enforcer of the engineering rules that specify the maximum safe resources to be consumed on the routers for various features. As such, in addition to protecting the network, the system also impacts profitability, since inventory is either used efficiently or inefficiently. This depends on how good the configuration management system is at implementing the engineering rules as well as how good it is at processing service cancellation or disconnect requests in a timely fashion. If the configuration management system does not properly return a port that is no longer in service to the inventory available for new requests, expensive router hardware can be stranded indefinitely. In summary, the primary goal of a network configuration management system is to manage router configurations to support customer service, subject to three key secondary goals: Protect the network Be the source of truth about the network Enforce the business and engineering rules To explore this topic further, we will first review some key concepts to help structure the types of data items the system must deal with in Section 8.2. Section 8.3 describes the subcomponents of the system and the unique requirements of each subcomponent. This section also discusses the two approaches that are commonly 8 Network Configuration Management 257 used for router configuration – policy-based and template-based approaches – since this is a key aspect of the problem to be solved. Section 8.3 also touches on the differences between provider-edge (PE) and customer-edge (CE) router configuration tasks and the differences between consumer and enterprise IP router services in their typical approaches to configuration management. We present a brief overview of provisioning audits, which is discussed in more detail in Chapter 9. Provisioning audits are important to ensure that the network configuration management system stays as a good source of truth for the other systems and business processes that need data about the network. Finally, one of the key challenges in a large network is handling changes, ranging from an isolated change to a setting on an individual customer’s interface, to more complex changes such as bulk changes to a large number of routers and interfaces. To illustrate these issues, Section 8.4 discusses the data model and process issues associated with moving a working connection from one configuration to the next. This section also touches on some typical network maintenance activities that impact a system in different ways than a customer provisioning focus. Section 8.5 shows a complete step by step example of provisioning a port order. 8.2 Key Concepts There are two important types of data that a network configuration management system must handle: physical inventory data and logical inventory data. In addition to these data types, the system has to be designed to appropriately handle and resolve data discords between the state of the network (“What it is”) and the view of the network that is contained in the network inventory database (“What it should be”). This section introduces these concepts. 8.2.1 Physical Inventory The physical inventory database, as the name implies, contains the network hardware that is deployed in the field. The basic unit is usually a chassis with a set of components, including common elements like route processor cards or power supplies, and line cards with transport interfaces that support one or more customer “ports.” These ports are what carry the customer-facing and backbone-facing traffic. Line cards that support multiple customer ports are often referred to as channelized interfaces (e.g. channelized T3 cards or channelized OC48 cards). The physical inventory database keeps track of whether the subchannels on these line cards are assigned to a customer with a state for each channel of “assigned” or “unassigned.” The data model for physical inventory often reflects the physical world in which cards are contained in a chassis and a chassis is contained in a cabinet. Each customer port is associated with a subchannel on a physical interface. 258 B.D. Freeman 8.2.2 Logical Inventory The logical inventory database includes the inventory data that are not physical. This is a broad and less rigid category of information, since it includes multiple database entities with ephemeral ties to the physical inventory. An IP address is a good example of a database entity with an ephemeral tie. IP addresses exist on an interface, but we can move addresses to ports on another router; hence, an address is not permanently tied to a single piece of physical equipment. Many logical components are inventoried as database entities and assigned as needed by the carrier. IP addresses, VLAN tags, BGP community strings [1], and Autonomous System Numbers (ASNs) [2] are all examples of logical data that need to be tracked and managed. Generally, logical inventory assigned to a customer is associated with a particular piece of physical inventory. However, the association can change over time. A good example of a change in the association between physical and logical inventory occurs when a customer’s connection is upgraded from a T1 to a T3. The physical inventory will change drastically but the logical inventory in terms of the IP address, BGP routing, and QoS settings may not change. It is also useful to understand that some logical inventory is associated with a single piece of equipment like an IP address while other logical inventory is “network wide” and is associated with multiple pieces of equipment like MPLS Route Distinguishers and Route Targets. 8.2.3 Discords: What It Is Versus What It Should Be? Data discords are a fact of life in production systems. Through a variety of means, the data in the network and the data in the inventory system get out of synch. In plain language, a situation is created where the inventory view of the world, “what it should be,” does not match with truth or the network view of the world, “what it is.” Both physical and logical inventory can contain discords. Generally, the physical inventory discords occur because of card replacements and initial installation errors that occur without a corresponding update of the database. For example, a discord would occur if a 4-port Ethernet card was replaced with an 8-port Ethernet card, but the database was not updated. Autodiscovery of hardware components can greatly assist in reducing the data discords in the physical inventory. Many production systems back up the router configuration daily and use commands from the vendor to collect detailed firmware and hardware data from the equipment. The command “show diag” dumps this kind of detailed information and the output can be saved to a file. Very accurate physical inventory information can be obtained by parsing the output of commands run on the router to obtain hardware information like the “show diag” command or various SNMP MIB queries. Automatic discovery of physical inventory can reduce the physical discords to zero. Many spare part tracking processes are dependent on the ability to automatically discover changes 8 Network Configuration Management 259 in serial numbers on components so that failure rates on cards can be tracked and replacement parts restocked as needed. Maintaining control on “What it is” is part of the physical inventory audit process. Logical inventory discords also happen frequently but are harder to resolve. As an example, if a customer port that is running in the network has static routing and the inventory database indicates that it should be BGP routing, which is correct? Another example of logical inventory discord is the mismatch between the service that the customer currently has and the ordered service. In general, it is easier to detect logical inventory discords than to resolve them. Given their impacts on the external support processes and billing, detection, reporting, and correcting these situations is important. Another key concept that the industry uses is that “the network is the database.” This concept results from a desire by network operators to use the network configuration as ground truth to drive processes. Most equipment has some mechanism for querying for configuration data. However, practical matters require externally accessible views of those data. Fault management, for example, cannot query the network in real time on every SNMP trap that gets generated (this can be thousands per second); so a copy of the configuration data has to reside in a database and consequently a process/program to audit and synchronize that data with the network has to be part of the overall network configuration management system. With these key concepts in mind, we will discuss the elements of a network configuration management system. 8.3 Elements of a Network Configuration Management System Figure 8.1 provides a high-level view of the elements that make up a Network Configuration Management System. The external interfaces are to technicians and Operating Support Systems/Business Support Systems (OSS/BSS) on the top and the Network Elements at the bottom. Each of the major elements inside the system will be addressed in subsequent sections. 8.3.1 Inventory Database A database of the physical and logical inventory is the core of the system. This database will consist of both the real assets purchased and deployed by the corporation (the physical inventory discussed in Section 8.2.1) and the logical assets that need to be tracked (e.g., WAN and LAN IP address assignments, number of QoS connections per router, max assigned Virtual Route Forwarding (VRF) tables [3] on the router, etc.). The database entities have parent/child relationships that form a tree as you place items in the schema. For example, a complex is a site with a set of cabinets. A cabinet within a site may have multiple chassis or routers. A router has multiple cards, 260 B.D. Freeman Technicians OSS/BSS (Ordering) GUI API OSS/BSS (Maintenance/Inventory) Reports and Feeds Design & Assign Physical Inventory Management Logical Inventory Management Router Audit Mediation Layer Router Configuration Mediation Layer Inventory Data base Network Elements Fig. 8.1 High-level view of network configuration management system each in a slot on the chassis. A card can have multiple ports. When viewed graphically, this parent/child relationship is a tree with the single item complex at the top and the ports at the “leaves” of the tree. A robust inventory database will have a schema with multiple “regions” of data with linkages between them as needed. One major ISP has an inventory database with over 1,000 tables to handle the inventory and the various applications that deal with the inventory. The two main regions are the physical equipment tree of data (e.g., complex/cabinet/router/slot/port) and logical inventory tree of data (e.g., customer, premise, service, and connection). The service database entity (one node up from the connection entity in the tree) typically contains the linkage to other logical assignments like Serial IP address, VRF labels, Route Distinguishers [3], Route Targets [3], etc. The reason the data are separated into these regions is to permit the movement of logical assets to different ports (i.e., connections) and to support changes in the physical assets associated with a customer as a result of changes in technology or network-grooming activities. Changes in technology, such as a new router with lower port costs, and network grooming, moving connections from one router/circuit to another to improve efficiency, are examples of carrier changes that may also affect the data model. These carrier decisions are sometimes even more complex than the customer-initiated changes to deal with correctly in the inventory database. Without separation of the regions, the ongoing life-cycle management of the service is difficult. For example, at points in time, we need to have multiple assets available for testing and move the “active” connection to the new assets only after satisfactory testing has completed. This means that we maintain multiple “services” for the same physical port, both the old service and the future service. The inventory database stores the “What should be” for the corporation and the current and future state of the equipment and connections for a customer. 8 Network Configuration Management 261 Many subsystems of a configuration management system are dependent on the inventory database. One of the major dependencies is the audit subsystem. The audit subsystem must store information for the physical “What it is” form of the network in a schema. Typically, since audit or discovery starts with the physical assets, the physical inventory model at the router/component level is reused for the “What it is” model. It is interesting to note that cabinet and location of equipment data are typically not discoverable, so those are usually inferred through naming conventions like the encoding of the router hostname. For example, a router might have a hostname like “n54ny01zz1” where the “n54” indicates a particular office in New York City and “ny” is for New York State. The “01” indicates that it is the first router in the office and the “zz1” would indicate the type or model of the router. The encoding is not an industry standard, but most carriers use something similar. The logical “What it is” model is also based on the rich “What it should be” model. It is again interesting to note that the logical discovery does not have the nonnetwork data items like street address of the customer or other business information. A prudent network operation puts processes in place to encode pertinent information in the interface description line so that linkages to business support systems can be maintained and audited. For example, large carriers tend to automatically encode a customer name and pointers to location records to make it easier to manage events pertinent to the interface in customer care and ticketing systems. The example below shows an active port in maintenance (MNX), for a customer, ACME MARKETING that is located in ANYTOWN, NJ, on circuit DHEC.123456..ATI. Various database keys are also encoded. interface Serial4/0.11/8:0 description MNX j ACME TECHNICAL MARKETING j ANYTOWN j NJ j DHEC.123456..ATI j 19547 j 3933470 j 4151940 j USA j MIS j j The two main inputs to the inventory database are the physical and logical inventory on the router and the customer order data. The physical and logical router data are typically inserted through the GUI during network setup by the capacity management organization as assets are installed, tested, and made ready for service. Another practice in use is to install the equipment and then use the autodiscovery tools to “learn” the equipment’s physical inventory. Logical assets are entered into the system as appropriate since they are not necessarily tied to the equipment in all cases. The customer order data are created usually through an API from the OSS/BSS during the ordering phase of a customer’s request for service and updated as the order progresses through the business processes to move from an order to an installed and tested connection. A note of caution, the amount of customer order data that are replicated into the network configuration management system should be minimized. A good design incorporates just enough to make it easier for people to deal with problems encountered in provisioning and activities that the upstream OSS/BSS may not have the capability to manage like custom features. The more customer order data stored 262 B.D. Freeman in the network configuration management system, the more the management of that data alone becomes a problem. Customer contact data are an example of data that should not be in the network configuration management system, since they are volatile and in fact may pertain to broader applications than the network service. 8.3.2 Router Configuration Subsystem The second subsystem we will discuss is the Router configuration management system. This subsystem takes the information from the inventory database and creates configuration changes for the installed router. The inventory database typically provides data needed to drive configuration details like the types and versions of commands to use for configuration (these can vary by make and model), the IP addresses/hostnames and passwords for access to the routers, and the customer order data for the specific configuration. The generation of the specific router configuration commands is the more difficult aspect. There are numerous approaches to the creation of the configuration changes, but the two main ones large carriers use today are policy management and templates. 8.3.2.1 Policy Management Approach The policy management approach attempts to break down the router configuration into a set of conditions and actions (e.g., policies) and generates the combined configuration on the router by evaluating the conditions and action in a set of policies. For example, QoS settings fit nicely into the policy management approach, since the router typically has a configuration statement to define the condition and action for applying QoS. The configuration statement can be shared by multiple ports and any interface can be assigned to that policy. Creating a QoS policy that assigns 20% of the bandwidth to high-priority class (e.g., voice traffic) and the remainder to a best-effort class could be reused by many ports on a router. One condition/action definition (e.g., policy) reused multiple times is easy to implement and maintain. Some configurations are more difficult to implement in a policy management system since they do not adhere nicely to a condition/action policy format. An example of this is IP addressing (or address management), which typically uses fairly complex rules to determine which address to assign to an interface. Large policy management systems do exist, but the linkage between different policies can be subject to scaling issues when dealing with the application of a large number of network and customer policies as in a VPN with a large number (e.g., thousands) of end points. Configuration auditing (described later) in particular becomes difficult to manage in a policy management system because the policy view of the data sometimes is not readily apparent to the knowledgeable network engineer when looking at the more detailed CLI commands in the backup configuration file used for audits. Finally, testing of policy-based systems is complicated, since it is not always clear what the resulting policy-based configuration will be in the CLI. 8 Network Configuration Management 263 The number of test cases increases to make sure the policy engine generates all the configuration change options that the network certification process has confirmed as working correctly. 8.3.2.2 Template Management Approach Template management uses a more simplistic approach. The details of tested sets of configurations are documented in a template and the data to drive a particular template is pulled from the inventory database. The benefit of a template approach is that only the configurations that are known to be valid are put into the network. This approach is a more reliable method of ensuring that the network is always configured to operate in a configuration supported by the testing and certification program. Policy-management systems have a more difficult time ensuring that they are always configuring the router into a condition that matches the certified configurations. The challenge is building the template from the set of features ordered by the customer. Generally, the template languages have a nesting structure so that the range of templates can be kept under control. As the set of templates grows, there is some complication in applying the correct template, but the resulting router configuration tends to be cleaner and more optimized (since each template is a test case) than the policy-based configurations. Both approaches have merit and a growing set of functions can be handled more readily with policies; so the likely system for a large carrier is a mixture of these techniques with templates for the basic configurations like basic IP conditions and routing and policies for the more advanced functions like QoS configuration on CPE routers. Large ISPs will have hybrid approaches to provide the best fit tool for each problem. An important aspect of the router configuration subsystem is the interaction between the users of current inventory (processes like ticketing and fault management) and the need to deal with future changes. Growing from a 512 kb/s link to a full T1 or growing from a single T1 to Multilink PPP (MLPPP) [4] are examples with very different degrees of complexity but both have the need to track both the current connection data and the future connection data. The router configuration system has to be able to handle modifying the current configuration to move an active connection to the new connection configuration. To handle failure conditions properly, this subsystem has to deal with roll forward and roll back of the configuration. Sometimes, the template approach is cleaner, since the “before” configuration can be captured directly from the router and re-applied even if the original data for it are not readily available. There are some key differences in managing provider-edge and customer-edge configurations that influence the choice of template-based or policy-based configuration management that we discuss here. Provider-edge (PE) routers tend to have a large number of interfaces (100 or more) with many interfaces of the same basic type. Generally, the configurations are relatively simple since the router’s primary role is stability, reliability, and fast 264 B.D. Freeman packet forwarding. Since large carrier router configurations tend to be less variable, we tend to see template-based configuration management systems on the PE. However, since MPLS VPNs have the added complexity of multiple router configurations being involved to correctly implement the VPN, usage of policy-based configuration management is growing. Customer-edge (CE) routers tend to have a much smaller number of interfaces (less than 10) with a wide variation in configurations depending on the business/industry of the customer. For example, the CE router may need advanced traffic-shaping rules to ensure that performance-sensitive traffic has a priority on their internal network over the access to the internet proxy/firewall. Other customers might need to do video streaming for training and thus need QoS setting for video priority over other data traffic. Some customer may even be running internet applications that require prioritization of the http/ftp traffic to/from their router to provide service to their customers. The CE router is closer to the customer and thus gets the burden of handling more customer-specific applications like firewalls, packet shaping, and complicated internal routing policies. Policy-based router configuration management systems are commonly used on CE routers because that is a better fit to the disparate customer needs for the edge environment. Finally, for the network carrier it is important to understand the different challenges that a mass-market consumer broadband internet access service places on the configuration management system. Mass-market configuration tends to have a very small set of routing configuration options. The most obvious variable in the configurations is the access speed. While you might think setting up QoS and ACLs would tend to increase the configuration options, it really only adds complexity and not much variation, since the configurations tend to be similar across large sets of connections. Although the number of different configurations is small, the rate of change is large. Initial provisioning rates are not only much larger than the enterprise space but the volumes of change orders are large as well. An Enterprise Internet access service might typically need to process several thousand orders a week with a similar magnitude of change orders. A mass-market service might need to process thousands of orders per day and tens of thousands of change orders per day. Mass-market router configuration systems tend toward template-based approaches because of the simplicity of the configuration, the smaller range of features, and the performance advantages of the template approach for large-scale processing. 8.3.2.3 Mediation Layer Most service providers have multiple vendor platforms in their network, but even single vendor network will have multiple models and versions of the router operating system. The router configuration subsystem that writes data to the routers usually has a mediation layer to deal with the router-specific commands. The mediation layer also exists when reading data for the audit layer to turn the vendor-specific commands/output into a common syntax for use by the audit application. The mediation layer will also handle nuances of the security model for accessing the routers that may vary based on vendor and region of the globe. 8 Network Configuration Management 265 8.3.3 GUI/API The GUI/API subsystem deals with the typical functions of retrieval, display, and data input for the system. The technology of this subsystem is typical of large-scale systems. This subsystem uses HTTP Web server technology with an html-based GUI and a SOAP/XML-based API. A critical aspect for large carrier is that the API becomes the predominant flow into the system. At scale, the API is used to handle the large volume order flow from the business support systems (BSS), both to electronically transfer the data and trigger the various automated functions in the router configuration management system. The GUI is used infrequently for customer provisioning and is used primarily for correcting any fallout that might have occurred. Having a robust set of APIs is critical to business success. Obviously, the APIs must also keep pace, as new features are added to the router so that the automated processes can trigger them. The GUI comes into play for manual interaction and maintenance activities and various other tasks that are not economic to automate through APIs. The other important aspect of the GUI is the implementation of a robust authentication/authorization layer, since some user groups should not have access to the router configuration change functions to prevent unintended changes that could cause a service outage. One aspect of the GUI that is also worth mentioning is read access to the “What it is” state of the router. Typically, there are sets of read-only CLI commands that the customer care organization depends on for responding to customer-reported problems. Most router platforms have a limited set of connections, so it is problematic to give a large customer care team direct access to the router CLI. The solution large carriers typically use is to put a web-based GUI in place with a limited set of functions that can be selected by the customer care agents. The GUI then acts as a proxy through the router configuration subsystem to execute these commands on the router. These commands include the various “show” commands as well as options to run limited repair functions like “clear counters” and/or “shutdown”/“no shutdown” on the interface. Exposing these functions through the GUI reduces the impact on the router and provides a mechanism for the throttling and audit rules to be applied to prevent a negative impact. The edit checks that occur before commands are executed on the router also help one to prevent unintended effects. 8.3.4 Design and Assign This subsystem applies the engineering rules to select a port for a customer’s service and can accept or reject a request for service based on available inventory. The subsystem has an API that takes the service request parameters and other customer network information and generates an assignment to a particular port on a router. That assignment is typically called a Tie Down and the data set is Tie Down Information (TDI). The API can be called either through the GUI or directly by the BSS. Assignment is nontrivial, since the function must ensure that all engineering 266 B.D. Freeman rules that help protect the network are satisfied like finding a port on a card with sufficient resources while also satisfying the business rules, which seek to limit transport costs and latency by picking a router closer to the customer. For example, the engineering rules may limit the number of QoS configured ports on certain card types. As an example of router assignment, a poor assignment would be to pick a router in California for a customer in New York. The assignment function calculates both an optimal assignment and the current assignment. The optimal assignment is the first choice router location that minimizes backhaul cost (e.g., ideally a customer in Ohio will be homed to a router in Ohio). However, it could be that the Ohio router complex does not have a router with sufficient capacity (bandwidth, QoS ports, etc.). The design and assign function system needs to be designed to implement the appropriate business rules in this case. For example, the business rule in this case is to “home” the customer on an alternate router in a different location. Alternatively, the business rule could be to reject the order. Typically, the “reject the order” business rule applies in mass-market situations. Business rules for enterprise markets usually choose to have longer backhaul costs rather than reject the order. In the enterprise market, the business rule might select a router in an alternate location like Indiana if no routers in Ohio had sufficient resources. The business would like the flexibility to be able to move the port from the Indiana router to an Ohio router in the future without impacting the customer. Consequently, the “assign” function will allocate a Serial IP address from a logical inventory pool associated with Ohio’s router complex, assign it to the interface on the router in Illinois, and “exception route” that address to Indiana. This assignment permits the CE/PE connection to be re-homed from Ohio to Indiana without affecting the customer’s router configuration, since their WAN IP address would not change and then the exception route for Indiana can be removed to get to a more optimum network routing configuration as well as a reduced backhaul configuration. The tracking of the optimal and current assignment data adds complexity to this subsystem, the inventory database, and the router configuration system (for the exception routes), but it is a good example of the types of business decisions that can ripple back into the router configuration management system requirements. 8.3.5 Physical Inventory Management Physical inventory management deals with the entering and tracking of data about the router equipment. It deals not only with equipment configuration details like what cards are installed in the routers but also where those routers are located for maintenance dispatch. The physical database also contains the parameters for the engineering rules that vary by equipment make and model. These parameters come either from the router vendor documentation or from certification testing. The parameters and the associated rules can range from simple rules like maximum bandwidth per line card to complex rules like the maximum number of VPN routes 8 Network Configuration Management 267 with QoS on all line cards on the router with version 3 of the line card firmware. As new routers or cards are added to the network, this subsystem tracks all the associated data for these assets including tracking whether a router or port is “in service” and available for assignment. As ports are assigned to customers, the physical inventory removes those ports from the assets that are available for assignment. The physical inventory also deals with the tracking of serial numbers of cards so that as cards are replaced or upgraded, the new parameters can be used for the engineering rules. For example, a card with 256 MB of memory could be upgraded to 512 MB and thus be able to support more QoS connections. The physical inventory subsystem keeps track of these engineering parameters (sometimes called reference data) about vendor equipment for use by other subsystems. Here are a few of the typical parameters tracked: Maximum logical ports Maximum aggregate bandwidth Maximum card assignment Maximum PVCs Logical channel limits IDB limit VRF limit BGP limit COS limit Routes limit 8.3.6 Logical Inventory Management Logical inventory management deals with the entering and tracking of data about the logical assets (IP addresses, ACLs, Route Distinguishers, Route Targets, etc.). This can be a large subsystem depending on the different features available, but the hardest item in the category is the IP address management. IP address management deals with the assignment of efficient blocks to the various intended uses. Typically, the engineering rules require different blocks of addresses to be used for infrastructure connections, WAN IP address blocks, and customer LAN address blocks. This requires not only higher-level IP address block management functions so that access control lists can be managed efficiently but also functions to deal with external systems like the ARIN registry. Service Providers typically update the ARIN “Who Is” database through an API so that LAN IP blocks assigned to enterprise customers appear as being assigned to those customers. This aids the service provider in obtaining additional IP address blocks from the registrar if needed. The tracking of per router elements like ACL numbers is simpler but has its own nuances and complexity, since the goal is to reuse ACL numbers where it is possible to reduce the load on the router. Typically, memory is consumed for every ACL on a router. The ACLs for different ports for the same customer tend to be identical so that memory utilization (and processing time on the ACL) can be reduced by compressing the 268 B.D. Freeman disparate ACLs into a single ACL that can be shared among a custom’s ports. Numerous other items have to be tracked in logical inventory and assigned during the assignment function depending on the feature or service being provided and the logical inventory management system grows in complexity as more logical features are added to the service. 8.3.7 Reports and Feeds The reports and feeds subsystem is responsible for distributing inventory data to users and systems required to run the business. The main users of this subsystem are the fault/service assurance system and the ticketing system. The fault/service assurance system needs data about the in-service assets so that alarms can be processed correctly. Its source of truth is usually the “What it is” data from the inventory database. The ticketing system is more concerned with the data about the customer, since they get notification of an event from the fault/service assurance system and have need to understand for a given port/card/router problem which customer or customers are affected. Fault and ticketing systems tend to get feeds of the inventory data, since their query volume can be quite high and the load can best be managed with a local cache of the data rather than directly querying the inventory database. Generally, the inventory data does not change rapidly; so a local cache is sufficient and alarms/tickets do not need these data until after test and turn up of the interface. Other users need various reports and feeds from the inventory database, and generally these are pulled either as a report from the GUI or APIs. A GUI-based reporting application can easily be deployed on the inventory database for items like port utilization reports for capacity management. APIs can be created as needed for generating bulk files or responding to simple queries. 8.3.8 Router Audit The router audit subsystem is responsible for doing both the discovery of the “What it is” state of the router and comparing the “What it is” with the “What it should be” in the inventory database. The audit function described in this section is designed to detect differences with the inventory data. There are other mechanisms that can be applied to look at the larger set of configuration rules. Some of these are covered in Chapter 9. Discovery is typically done with an engine that parses the router configurations into database attributes. As described before, the parsed router configuration data are stored in the inventory database but in a separate set of tables from the physical and logical inventory. The schemas of the audit tables are similar to the physical and logical inventory tables, but they lack some attributes that do not exist in the router configuration; the major attributes are the same so that they can be compared 8 Network Configuration Management 269 with the “what it should be” tables. After storage, the compare or audit function does an item-by-item comparison, tracking any discords. The audit is CPU- and disk-intensive and typically is only done across the entire network data set on a daily basis. The discovery/audit process is also used to pick up changes like card replacements. It is typical for this audit function to take 4–6 hours to complete across a large network even when high-end servers are employed. The good news is that the process can typically be run using the backup copies of the router configuration files so that there is no impact on the network and limited impact on the users of the system. Incremental audits can also be done on a port or card basis on demand as part of the router configuration process. It is worth noting that the tracking of discords requires a historical view: when a discord was first detected and when was the last time it was detected. New discords could correlate with an alarm or customer-reported problem. Old discords might be indicative of data integrity error from a manual correction that was implemented to repair a customer problem but not appropriately reflected back into the inventory database. While perhaps less visible to the overall router configuration management process than other aspects of the configuration workflow, audit is a key step. Real-time validations must be implemented for a change order so that if there is a discord, the process will stop the change order from being applied to prevent a problem. It is important to subsequently find and fix these discords so that future change orders are not affected. 8.4 Dealing with Change An important aspect of a configuration management system is to deal with changes to an existing service. For example, the initial configuration of an interface can be done in various phases and with little concern for timing until the interface is moved from the shutdown state to the active state. However, an active interface has a different set of rules. Generally, the timing associated with configuration changes is more critical and the set of checks on the data and the configuration are more involved. First, a robust network configuration management system will validate the current configuration of the interface (“What it is”) against the “What it should be” data and if there is a mismatch it should stop the change. The reasons are probably obvious that unless the “What it is” and “What it should be” data sets are in agreement, we are running the risk of changing to a configuration that will not work for the customer because of a previous data inconsistency. For instance, if there have been problems with a previous re-home and the ACLs are not the same between the old configuration and the new configuration, it could prevent the customer from accessing their network services. Second, for the intended change, the configuration management system should validate the data set against the interface data, the global configuration of the router, and to the extent possible the larger network for the customer to ensure that the 270 B.D. Freeman change is consistent with other “What it is” data. This usually consists of a set of rules applied by the configuration management subsystem to ensure that a successful change will be applied. A good example is again a re-home. If the old port is still advertising its WAN IP address, you cannot bring up the same WAN IP address on a different router or instabilities can be introduced (duplicate IP address detection is an important validation rule). 8.4.1 Test and Turn Up Bringing up a new connection involves testing that the connection works correctly as ordered and then turning up the port for full service. Turning up a large connection like a 10 Gb Ethernet connection is something done carefully because if mis-configured it could either drive large amounts of traffic into a customer’s network before they are prepared for it or remove traffic from a customer’s network by mistake. For most changes against a running configuration, the process of applying the change has to be coordinated with a maintenance window1 since service could be impacted. Some changes may also require changes on the customer’s side of the connection; so proper scheduling with the customer’s staff is required. For changes that involve the physical connection (speed changes and re-homes), typically two ports are in assignment at the same time and operations would like to test all or parts of the new port before swinging the customer’s connection over. This “testing phase” creates database complexity, since the new port has to be reserved for the customer but it is not the “in-service” port from an alarming/ticketing standpoint. Both the old and new have to be tracked until the port is fully migrated to the new configuration. This requires the concept of “Pending” port assignments/connections and database transactions to move a port from “Pending” to “Active,” from “Active” to “Disconnected,” and finally the old record is deleted from the database. The router configuration system has to maintain the ability to generate router configurations for each of the interim steps in moving an active connection from one port to another. There are configurations to bring up the new interface on temporary information (e.g., temporary serial IP addresses and/or RD/RT/VRF information for testing), steps to “shutdown” the old interface, steps to “no shutdown” the new interface, and steps to reverse the entire process to roll back to the old interface. All these need to be able to be driven through the API for relatively straightforward changes with automated PE side re-homes that do not affect the customer premise router and via the GUI for those more complicated changes that require coordination with the customer. It is with dealing with change that the entire system is stressed the most to meet the needs of not only ensuring that the network is protected but also that the entire system responds fast enough to meet the human- or machine-driven process requirements. 1 The Maintenance Window is a time period when there is expected to be low traffic and is used by an operator for planned activities that could impact service. Usually it is in the late night/ early morning of the time zone of the router like 3–6 a.m. 8 Network Configuration Management 271 Another attribute of change that is worth mentioning is changes to active interfaces that are infrastructure connections (e.g., two or more backbone links that connect network routers). A routine task is to change the OSPF metric on one link to “cost it out”2 of use so that maintenance on the connection can be done. A problem exists if the state of this link is left in the “costed out” state. Failure of the now single primary link causes isolation, since one link could be hard failed and the other link is out of service by being “costed out.” A robust configuration management system also has maintenance functions to permit the operations staff to cost out a link, to record that the link is “costed out,” and to generate an alarm condition if the link stays “costed out” for a period of time. Finally, a type of change that is of growing importance in large networks is the ability to apply changes in bulk. The complexity of modern routers leads to situations where a latent bug or security vulnerability is found in a router that can only be repaired by changing the configuration on a large number of ports in the network. This requires special update processes to handle the updates in a bulk fashion. Typically, this is a customized application on the router configuration subsystem that is targeted at dealing with the bulk processing. The reason why this gets complicated is not only because of tracking that all the changes are applied (routers sometimes tend to refuse administrative requests under heavy load) but also throttling the updates to specific routers so as not to overload them. 8.5 Example of Service Provisioning This section will tie all the pieces together in an example of service provisioning for a simple Internet access service. Once all the order data are collected and optionally entered into an automated order management, the provisioning steps can occur including downloading the configurations to the router. The individual configurations are called configlets, since they are usually incremental changes to an interface or pieces of the global configuration, and not an entire router configuration. They are outlined below. 1. 2. 3. 4. 5. 6. 7. 8. 9. Create customer Create premise/site Create service instance Create connection and reserve inventory Download initial configuration Download loopback test configlet Download shutdown configlet Download final configlet with “no shutdown” Run daily audit 2 When OSPF costs on a set of links are adjusted to shift traffic off of one link and onto another link, the process is informally called “costing out” the link. 272 B.D. Freeman 1. Create customer This task is simply to group all the customer data into one high-level account by creating (or using a previously created) customer entity in the database. Sometimes, it relates to an enterprise but oftentimes because of mergers and acquisitions or even departmental billing arrangement the “customer” at this level does not uniquely identify a corporation. There can even be complicated arrangements with wholesalers that must be reflected in various customer attributes. 2. Create premise/site This task creates a database entity corresponding to the physical site that the access circuit terminates in at the customer’s site. Street address, city, state/province, country etc. are typical parameters. Corporations can have multiple services at an address so that we track the address partly not only to make it easier to work with the customer but also because these data will impact the selection of the optimum router to reduce backhaul costs. 3. Create service instance This task collects the parameters about the intended service on this connection. It will define the speed, any service options like quality of service, and all the other logical connection parameters. These data directly affect the set of engineering rules that will be applied to actually find an available port on an optimum router. 4. Create connection and reserve inventory This task combines the above data into an assignment. The selection of a router complex is done first using the parameters of address to look for a complex with a short backhaul. This is called “Homing.” After a preliminary complex is assigned, the routers in the complex are checked for available port capacity and if there is port capacity, the engineering rules for this connection on that router are tested. For example, a router may have available ports, but there may be insufficient resources for additional QoS or MPLS VPN routes on the cards. The system will recursively examine all routers in the complex to look for an available port that matches the engineering rules. If no router is found, the system will examine a next best optimum complex and repeat the search. This assignment function can take a substantial amount of system resources to complete and is not guaranteed to find a solution due to resource or other business rule constraints. Once a complex, router, and port has been selected, the logical inventory will be tied to the physical inventory and this Tie Down Information (TDI) will be returned to the ordering system so that it can order the layer 1 connection from the router to the customer premise. It is important to note that at this point the Inventory database must set a state of the port so that no other customer can use that router port. If the customer’s order is cancelled, the business process must ensure that the port assignment is deleted as well to avoid stranded inventory. At this point, the inventory database would show the port as “PENDING,” since the inventory has been assigned but it is not in service. All the logical data needed to configure the interface are in the database and any provider inventory items have been assigned (serial IP addresses, ACL numbers, etc.). 8 Network Configuration Management 273 5. Download initial configuration After the inventory has been assigned, an initial configuration of the port is downloaded to the router to define the basic interface. This configlet typically only includes the serial IP address and default routing and defines the interface in a shutdown state. This is also the first real-time audit step. This audit will confirm that the assigned port is not used by some other connection. While rare, data discords of this type do occur. This download need not occur in real time, since it will typically be some amount of time before the Layer 1 connection is ready. 6. Download the loopback test configlet This step depends on the layer 1 connection to be installed so that it can occur days, weeks, or months after step 5. In addition, after Layer 1 is installed, this step typically occurs 24 h before the scheduled turn up date for a customer. This configlet contains all the routing and configuration data for the connection. Downloading a configlet to do loopback testing on the network side of the connection provides a final check of the provider’s part of the work. Just before the configlet is downloaded, a series of real-time audits are again conducted, since the initial configlet audits could have been months ago. These audits check both the static order data against the running router and attributes on other ports on the router. For example, there is a verification that any new ACL number is not already in use on another port for another customer. This check makes sure that a manually configured port was not done in error. There is a verification that any new VRF does not already exist on the router to check and see if another order has been processed in parallel. There are numerous other validations as well. This real-time audit is more detailed than the audit done for the initial configlet, since it contains all the routing, QoS, and VPN data. If all validations are successful, the configuration is downloaded and activated for testing with Layer 1 in loopback. 7. Download shutdown configlet After successful pretesting, the router port is left in a shutdown state. It can remain in this configuration for some period of time but because routing instances may have been defined even though the port is shutdown typically operators do not leave a shutdown interface in the router configuration for more than 48 hours or so. A shutdown interface is still discoverable from an SNMP network management perspective so that a large number of admin down interfaces simply adds load to the fault management system without adding value. If it is not successfully turned up, the configuration will be rolled back to the initial configuration. While the Layer 1 circuit is being ordered/installed, there will likely be many daily audits that run. These audits will find the port in the router in shutdown state. The discord analysis will compare the “What it is” configuration and state with the “What it should be” configuration and state and report any problems. For our example, there is no problem but the audit might find that the port is in a “no shutdown” state in the network indicating that perhaps a test and turn up occurred but was not completed in the inventory database. The daily audit would also find if the router card had been replaced for some reason and update tracking data like serial numbers, etc. 274 B.D. Freeman 8. Download final configlet with “no shutdown” At activation, the system will download the final router configuration with “no shutdown” of the interface. Final testing may occur with the customer. The testing for single-link static routed interfaces is usually automated but for advanced configurations with multiple links or BGP routing, manual testing procedures are typical. It is at this point that the inventory database will update its status on the port to active and mark the port “In service” for downstream systems like the Fault Management and Ticketing systems. 9. Run daily audit The daily audit will find the new state of the port to be active and the “What it should be” state of “ACTIVE” matches the “What it is” state in the network. 8.6 Conclusion Hopefully, we have provided a useful overview of a robust router configuration management system and helped to tie the key functions and subsystems back to the business needs that drive complexity. From inventory management to provisioning the customer’s service to handling changes to dealing with bulk security updates, a large carrier cannot provide reliable service without a robust router configuration management system. Here is a summary of some “best practice” principles that will be helpful when designing a Network Configuration Management system. Recognize data discords as a fact of life. Separate “What it is” and “What it should be” data in the inventory database Configuration management is the source of truth for the business about the current network using the “What it is” data Protect the network through real-time validation and auditing of the run- ning network Design for change so that logical data are not permanently tied to physical data Separate the schema for physical inventory and logical inventory Use templates to make configuration, discord detection, and testing easier Track port history, and not just the current state Design for multiple configurations of a port to handle the current port configuration and the pending port configuration Design the system to support testing a port before it is turned up and rollback to an earlier configuration when tests fail Limit the amount of business data in the network-facing system so that you do not create a problem of maintaining consistency 8 Network Configuration Management 275 References 1. Chandra, R., Traina, R., & Li, T. IETF Request for Comments 1997, BGP Communities Attribute, August 1996. 2. Hawkinson, J., & Bates, T. IETF Request for Comments 1930, Guidelines for creation, selection, and registration, March 1996. 3. Rosen, E., & Rekhter, Y. IETF Request for Comments 4364, BGP/MPLS Virtual Private Networks, April 2006. 4. Sklower, K., Lloyd, B., McGregor, G., Carr, D., & Coradetti, T. IETF Request for Comments 1990, The PPP Multilink Protocol, August 1996. Chapter 9 Network Configuration Validation1 Sanjai Narain, Rajesh Talpade, and Gary Levin 9.1 Introduction To set up network infrastructure satisfying end-to-end requirements, it is not only necessary to run appropriate protocols on components but also to correctly configure these components. Configuration is the “glue” for logically integrating components at and across multiple protocol layers. Each component has configuration parameters, each of which can be set to a definite value. However, today, the large conceptual gap between end-to-end requirements and configurations is manually bridged. This causes large numbers of configuration errors whose adverse effects on security, reliability, and high cost of deployment of network infrastructure are well documented. For example: “Setting it [security] up is so complicated that it’s hardly ever done right. While we await a catastrophe, simpler setup is the most important step toward better security.” – Turing Award winner Butler Lampson [42]. “. . . human error is blamed for 50 to 80 percent of network outages.” – Juniper Networks [40]. “The biggest threat to increasingly complex systems may be systems themselves.” – John Schwartz [61]. “Things break and complex things break in complex ways.” – Steve Bellovin [61]. “We don’t need hackers to break systems because they’re falling apart by themselves.” – Peter Neumann [61]. S. Narain (), R. Talpade, and G. Levin Telcordia Technologies, Inc., 1 Telcordia Drive, Piscataway, NJ 08854, USA e-mail: narain@research.telcordia.com; rrt@research.telcordia.com; glevin@research.telcordia.com 1 This material is based upon work supported by Telcordia Technologies, and Air Force Research Laboratories under contract FA8750-07-C-0030. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Telcordia Technologies or of Air Force Research Laboratories. Approved for Public Release; distribution unlimited: 88ABW-2009-3797, 27 August 09. C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 9, c Springer-Verlag London Limited 2010 277 278 S. Narain et al. Thus, it is critical to develop validation tools that check whether a given configuration is consistent with the requirements it is intended to implement. Besides checking consistency, configuration validation has another interesting application, namely, network testing. The usual invasive approach to testing has several limitations. It is not scalable. It consumes resources of the network and network administrators and has the potential to unleash malware into the network. Some properties such as absence of single points of failure are impractical to test as they require failing components in operational networks. A noninvasive alternative that overcomes these limitations is analyzing configurations of network components. This approach is analogous to testing software by analyzing its source code rather than by running it. This approach has been evaluated for a real enterprise. Configuration validation is inherently hard. Requirements can be on connectivity, security, performance, and reliability and span multiple components and protocols. A real infrastructure can have hundreds of components. A component’s configuration file can have a couple of thousand configuration commands, each setting the value of one or more configuration parameters. In general, the correctness of a component’s configuration cannot be checked in isolation. One needs to evaluate global relationships into which components have been logically integrated. Configuration repair is even harder, since changing configurations to make one requirement true may falsify another. The configuration change needs to be holistic in that all requirements must concurrently hold. This chapter motivates the need for configuration validation in the context of a realistic collaboration network, proposes an abstract design of a configuration validation system, surveys current technologies for realizing this design, outlines experience with deploying such a system in a real enterprise, and outlines future research directions. Section 9.2 discusses the challenges of configuring a realistic, decentralized collaboration network, the vulnerabilities caused by configuration errors, and the benefits of using a validation system. Requirements on this network are complex to begin with. Their manual implementation can cause a large number of configuration errors. This number is compounded by the lack of a centralized configuration authority. Section 9.3 proposes a design of a system that can not only validate the above network but also evolve to validate even more complex ones. This design consists of four subsystems. The first is a Configuration Acquisition System for extracting configuration information from components in a vendor-neutral format. The second is a Requirement Library capturing best practices and design patterns that simplify the conceptualization of end-to-end requirements. The third is a Specification Language whose syntax simplifies the specification of requirements. The fourth is an Evaluation System for efficiently evaluating requirements, for suggesting configuration repair when requirements are false, and for creating visualizations of logical relationships. Section 9.4 discusses the Telcordiar IP Assure product [38] and the choices it has made to realize this design. It uses a parser generator for configuration acquisition. Its Requirement Library consists of requirements on integrity of logical 9 Network Configuration Validation 279 structures, connectivity, security, performance, reliability, and government policy. Its specification language is one of visual templates. Its evaluation system uses algorithms from graph theory and constraint solving. It computes visualizations of several types of logical topologies. Section 9.5 discusses logic-based techniques for realizing the above validation system design. Their use is particularly important for configuration repair. They simplify configuration acquisition and specification. They allow firewall subsumption, equivalence, and rule redundancy analysis. These techniques are the languages Prolog, Datalog, and arithmetic quantifier-free forms [51, 53, 67], the Kodkod [41, 69] constraint solver for first-order logic of finite domains, the ZChaff [27, 46, 73] minimum-cost SAT solver for Boolean logic, and Ordered Binary Decision Diagrams (OBDDs) [12]. Section 9.6 outlines related techniques for realizing the above validation system design. These are type inference for configuration acquisition [47], symbolic reachability analysis [72], its implementation [3] with symbolic model checking [48], and finally, validation techniques for Border Gateway Protocol (BGP), the Internet-wide routing protocol, and one of the most complex. Section 9.7 contains a summary and outlines future research directions. 9.2 Configuration Validation for a Collaboration Network This section discusses the challenges of configuring a realistic, multi-enterprise collaboration network, the types of its vulnerabilities caused by configuration errors, the reasons why these arise, and the benefits that can be derived from using a configuration validation system. Multiple communities of interest (COIs) are set up as logically partitioned virtual private networks (VPNs) overlaid on a common IP network backbone. The “nodes” of this VPN are gateway routers at each enterprise that participate in the COI. An enterprise can participate in more than one COI, in which case it would have one gateway router for each COI. For each COI, agreement is reached between participating network administrators on the top-level connectivity, security, performance, and reliability requirements governing the COI. Configuration of routers, firewalls, and other network components to implement these requirements is up to administrators. There is no centralized configuration authority. The administrators at different enterprises in a COI negotiate with each other to ensure configuration consistency. Such decentralized networks exist in industry, academia, and government and are clear candidates for the application of configuration validation tools. Typical COI requirements are now described. The connectivity requirement is that every COI site must be reachable from every other COI site. The security requirement is twofold. First, all communication between sites must be encrypted. Second, no packets from one COI can leak into another COI. This requirement is especially important since collaborating enterprises have limited mutual trust. A site can be a part of more than one COI but the information that site is willing to share 280 S. Narain et al. with partners on one COI is distinct from that with partners in another COI. The performance requirement specifies the bandwidth, delay, jitter, and packet loss for various types of applications. The reliability requirement specifies that connectivity be maintained in the face of link or node failure. Since these requirements are complex, large numbers of configuration errors can be made. This number is compounded by the lack of a centralized configuration authority. The complexity has the further consequence that –less experienced administrators, especially in an emergency, tend to statically route traffic directly over the IP backbone rather than correctly set up dynamic routing. But, when the emergency passes, static routes are not removed for concern of breaking the routing. Over time, this causes the COIs to become brittle in that routes cannot be automatically recomputed in the face of link or node failure. While administrators are well aware of configuration errors and their adverse effects on the global network, they lack the tools to identify these, much less remove these. The decentralized nature of the network prevents them from obtaining a picture of the global architecture. A validation system that could identify configuration errors, make recommendations for repairing these and help understand the global relationships would be of immense value to administrators. Figure 9.1 shows the architecture of a typical COI with four collaborating sites A, B, C, D. Each site contains a host, an internal router, and a gateway router. The first two items are shown only for sites A and C. Each gateway router is physically connected to the physical IP backbone network (WAN). Overlaid on this backbone is a network of IPSec [41] tunnels interconnecting the gateway routers. An IPSec tunnel is used to encrypt packets flowing between its endpoints. Overlaid on the IPSec network is a network of GRE [22] tunnels. A GRE tunnel provides the appearance of two routers being directly connected even though there may be many physical hops between them. The two overlay networks are “glued” together in such RB Physical Link RC RA WAN IC IPSec Tunnel IA HA RD Fig. 9.1 Community of interest architecture HC GRE Tunnel 9 Network Configuration Validation 281 a way that all packets through GRE tunnels are encrypted. A routing protocol, e.g., BGP [33, 36], is run over the GRE network to discover routes on this overlay. If a link or node in this network fails, BGP discovers an alternate route if possible. A packet originating at host HA destined to host HC is first directed by its internal router IA to the gateway router RA. RA encrypts the packet, then finds a path to HC on the GRE network. When the packet arrives at RC, it is decrypted, decapsulated, and forwarded to IC. IC then forwards it to HC. All routers also run the internal routing protocol called OSPF [42]. OSPF discovers routes to destinations that are internal to a site. The OSPF process at the gateway router redistributes or injects internal routes into the BGP process. The BGP process then informs its peers at other gateway routers about these routes. Eventually, all gateway routers come to know about how to route packets to any reachable internal destination at any site. In summary, connectivity, security, and reliability requirements are satisfied by the use, respectively, of GRE, IPSec and BGP, and OSPF. The security requirement that data from one COI not leak into another is satisfied implicitly. GRE reachability to a different COI is disallowed, static routes to destinations in different COIs are not set up, gateway routers at the same enterprise but belonging to different COIs are not directly connected, and BGP sessions across different COIs are not set up. The performance requirement is satisfied by ensuring that GRE tunnels are mapped to physical links of the proper bandwidth, delay, jitter, and packet loss properties, although this is not always in control of COI administrators. Avoiding one cause of packet loss, is however, in their control. This is the blocking of Maximum Transmission Unit (MTU) mismatch messages. If a router receives a packet whose size is larger than the router’s configured MTU, and the packet’s Do Not Fragment bit is set, the router will drop the packet. The router will also warn the sender in an ICMP message that it has dropped the packet. Then, the sender can reduce the size of packets its sends. However, since ICMP is the same protocol used to carry ping messages, firewalls at many sites block ICMP. The result is that the sender will continue to send packets without reducing their size and they will all be dropped by the router [68]. Packets increase in size beyond an expected MTU because GRE and IPSec encapsulations add new headers to packets. To avoid such packet loss, the MTU at all routers is set to some fixed value accounting for the encapsulation. Alternatively, ICMP packets carrying MTU mismatch messages are not blocked. This design is captured by the following requirements: Connectivity Requirements 1. Each site has a gateway router connected to the WAN. 2. There is a full-mesh of GRE tunnels between gateway routers. 3. Each gateway router is connected to an internal router at the same site. Security Requirements 1. 2. 3. 4. There is a full-mesh network of IPSec tunnels between all gateway routers. Packets through every GRE tunnel are encrypted with an IPSec tunnel. No gateway router in a COI has a static route to a destination in a different COI No cross-COI physical, GRE, BGP connectivity, or reachability is permitted. 282 S. Narain et al. Reliability Requirements 1. BGP is run on the GRE tunnel network to discover routes to destinations in different sites. 2. OSPF is run within a site to discover routes to internal destinations. 3. OSPF and BGP route redistribution is set up. Performance Requirements 1. MTU settings on all interfaces are set to be less than the expected packet size after taking into account GRE and IPSec encapsulation. 2. Alternatively, access-control lists at each gateway router permit ICMP packets carrying MTU messages. Configuration parameters that must be correctly set to implement the above requirements include: 1. IP addresses and mask of physical and GRE interfaces 2. IP address of the local and remote BGP session end points and the autonomous system (AS) number of the remote end point 3. Names of GRE interface and IP address of associated local and remote physical tunnel end points 4. IP addresses of local and remote IPSec tunnel end points, encryption and hash algorithms to apply to protected packets, and the profile of packets to be protected 5. Destination, destination mask, and next hop of static routes 6. Interfaces on which OSPF is enabled and the OSPF areas to which they belong 7. Source and destination address ranges, protocols, and port ranges of packets for access-control lists 8. Maximum transmission units for router interfaces As can be imagined, a large number of errors can be made in manual computation of configuration parameter values implementing these requirements. GRE tunnels may only configure in one direction or not at all. IPSec tunnels may only configure in one direction or not at all. GRE and IPSec tunnels may not be “glued” together. GRE tunnels or sequences of tunnels may link routers in distinct COIs. A COI gateway router may contain static routes to a different COI, so packets could be routed to that COI via the WAN. BGP sessions may be set up between routers in different COIs, so these routers may come to know about destinations behind each other. BGP sessions may only be configured in one direction or not at all. BGP sessions may not be supported by GRE tunnels, so these sessions will not be established. There may be single points of failure in the GRE and BGP networks. Finally, MTU settings on routers in a COI may be different leading to the possibility of packet loss. Such errors can be visualized by mapping various logical topologies. Two of these are shown below. In Fig. 9.2, nodes represent routers and edges represent a GRE edge between routers. These edges have to be set up in both directions for a GRE tunnel to be established. This graph shows two problems. First, the edge labeled “Asymmetric” has no counterpart in the reverse direction. Second, the dotted line indicates a missing 9 Network Configuration Validation 283 Fig. 9.2 GRE tunnel topology Single point of Failure Missing Asymmetric Fig. 9.3 BGP neighbor topology COl 1 COl 2 tunnel. Third, the hatched router indicates a single point of GRE failure. All GRE packets to destinations to the right of this router pass through this router. In Fig. 9.3, nodes represent routers and links represent BGP sessions between nodes. This graph shows two problems. First, there is no full-mesh of BGP sessions within COI 1. Second, there is a BGP session between routers in two distinct COIs. 9.3 Creating a Configuration Validation System This section outlines the design of a system that can not only validate the network of the previous section but also evolve to validate even more complex ones. As shown in Fig. 9.4, this consists of a Configuration Acquisition system to acquire configuration information in a vendor-neutral format, a Requirement Library containing fundamental requirements simplifying the task of conceptualizing administrator intent, an easy-to-use Specification Language in which to specify requirements, and an Evaluation System to efficiently evaluate specifications in this language. These subsystems are now described. 284 S. Narain et al. Configuration Files Requirement Library Configuration Acquisition System Administrator Configuration Database End-to-End Requirements in Specification Language Specification Language Evaluation System Root-Cause Of Non-Compliance Visualizations Suggestions For Repair Fig. 9.4 Validation system architecture 9.3.1 Configuration Acquisition System Each component has associated with it a configuration file containing commands that define that component’s configuration. These commands are entered by the network administrator. The most reliable method of acquiring a device’s configuration information is to acquire this file, manually or automatically. Other less-reliable methods are accessing the devices’ SNMP agent and querying configuration databases. SNMP agents often do not store all of the configuration information one might be interested in. The correctness and completeness of a configuration database varies from enterprise to enterprise. If configuration information is acquired from files, then these files have to be parsed. Configuration languages have a simple syntax and semantics, since they are intended to be used by network administrators who may not be expert programmers. Different vendors offer syntactically different configuration languages. However, the abstract configuration information stored in these files is the same, barring nonstandard features that vendors sometimes implement. This information is associated with standardized protocols. Examples of it from the previous section are IP addresses, OSPF area identifiers, BGP neighbors, and IPSec cryptographic algorithms. This information needs to be extracted from files and stored in a vendor-neutral format database. Then, algorithms for evaluating requirements can be written just once against this database, and not once for every combination of vendor configuration language. However, configuration languages are vast, each with a very large set of features. Their syntax can change from one product release to another. Some 9 Network Configuration Validation 285 vendors do not supply APIs to extract the abstract information. It should be possible to extract configuration information without having to understand all features of a configuration language. Extraction algorithms should be resilient to inevitable changes in configuration language syntax. 9.3.2 Requirement Library The Requirement Library is analogous to libraries implementing fundamental algorithms in software development. The Library should capture design patterns and best practices for accomplishing fundamental goals in connectivity, security, reliability, and performance. Examples of these for security can be found in [18] and for routing in [33]. These patterns can be expressed as requirements. The administrator should be easily able to conceptualize end-to-end requirements as compositions of Library requirements. 9.3.3 Specification Language The specification language should provide an easy-to-use syntax for expressing end-to-end requirements. Specifications should be as close as possible in their forms to their natural language counterparts. The syntax can be text-based or visual. Since requirements are logical concepts, the syntax should allow specification of objects, attributes, and constraints between these and compositions of constraints via operators such as negation, conjunction, disjunction, and quantification. For example, all of these constructs appear in the Section 9.2 requirement “No gateway router in a COI has a static route to a destination in a different COI.” 9.3.4 Evaluation System The Requirement Evaluation system should contain efficient algorithms to evaluate a requirement against configuration. These algorithms should output not just a yes/no answer but also explanations or counterexamples to guide configuration repair. Configuration repair is harder than evaluation. A set of requirements can be independently evaluated but if some are false, they cannot be independently made true. Changing the configuration to make one requirement true may falsify another. To provide further insight into reasons for truth or falsehood of requirements, this system should compute visualizations of logical relationships that are set up via configuration, analogous to visualizations of quantitative data [70]. 286 S. Narain et al. 9.4 IP Assure Validation System This section describes the Telcordiar IP Assure product and discusses the choices made in it to implement the above abstract design of a validation system. This product aims to improve the security, availability, QoS, and regulatory compliance of IP networks. It uses a parser generator for configuration acquisition. Its Requirement Library consists of well over 100 requirements on integrity of logical structures, connectivity, security, performance, reliability, and government policy. Its specification language is one of visual templates. Its evaluation system uses algorithms from graph theory and constraint solving. It also computes visualizations of several types of logical topologies. If a requirement is false, IP Assure does compute a root-cause, although its computation is hand-crafted for each requirement. IP Assure does not compute a repair that concurrently satisfies all requirements. 9.4.1 Configuration Acquisition System Section 9.3 raised three challenges in the design of a configuration acquisition system. The first was the design of a vendor-neutral database schema for storing configuration information. The second was extracting information from configuration files without having to know the entire configuration language for a given vendor. The third was making the extraction algorithms robust to inevitable changes in the configuration language. This section describes IP Assure’s configuration acquisition system and sketches how well it meets these challenges. IP Assure has defined a schema loosely modeled after DMTF [20] schemas. It uses the ANTLR [5] system to define a grammar for configuration files. The parser generated by ANTLR reads the configuration file and if successful returns an abstract syntax tree exposing the structure of the file. This tree is then analyzed by algorithms implemented in Java to create and populate tables in its schema. Often, information in a table is assembled from information scattered in different parts of the file. The system is illustrated in the context of a configuration file containing the following commands in Cisco’s IOS configuration language: hostname router1 ! interface Ethernet0 ip address 1.1.1.1 255.255.255.0 crypto map mapx ! crypto map mapx 6 ipsec-isakmp set peer 3.3.3.3 set transform-set transx match address aclx ! 9 Network Configuration Validation 287 crypto ipsec transform-set transx esp-3des hmac ! ip access-list extended aclx permit gre host 3.3.3.3 host 4.4.4.4 A configuration file is a sequence of command blocks consisting of a main command followed by zero or more indented subcommands. The first command specifies the name router1 of the router. It has no subcommands. Any line beginning with ! is a comment line. The second command specifies an interface Ethernet0. It has two subcommands. The first specifies the IP address and mask of this interface. The second specifies the name mapx of an IPSec tunnel originating from this interface. The parameters of the IPSec tunnel are specified in the next command block. The main command specifies the name of the tunnel, mapx. The subcommands specify the address of the remote endpoint of the IPSec tunnel, the set transx of cryptographic algorithms to be used, and the profile aclx of the traffic that will be secured by this tunnel. The next command block defines the set transx as consisting of the encryption algorithm esp-3des and the hash algorithm hmac. The last command block defines the traffic profile aclx as any packet with protocol, source address and destination address equal to gre, 3.3.3.3 and 4.4.4.4, respectively. Part of an ANTLR grammar for recognizing the above file is: commands: command NL (rest=commands | EOF) ->ˆ(COMMAND command $rest?); command: (’interface’) => interface_cmd |(’crypto’) => crypto_cmd |(’ip’) => ip_cmd |unparsed_cmd; interface_cmd: ’interface’ ID (LEADINGWS interface_subcmd) * -> ˆ(’interface’ ID interface_subcmd *) interface_subcmd: ’ip’ ’address’ a1=ADDR a2=ADDR -> ˆ(’address’ $a1 $a2) |’crypto’ ’map’ ID -> ˆ(CRYPTO_MAP ID) |unparsed_subcmd; The first grammar rule states that commands is a sequence of one or more command blocks. The ˆ symbol is a directive to construct the abstract syntax tree whose root is the symbol COMMAND, whose first child is the command block just read, and second child is the tree representing the sequence of subsequent command blocks. The next rule states that a command block begins with the keywords interface, crypto, or ip. The symbol = > means no backtracking. The last line in this rule states that if a command block does not begin with any of these identifiers, it is skipped. Skipping is done via the unparsed cmd symbol. Grammar rules defining it skip all tokens till the beginning of the next command block. The last two rules define the structure of an interface command block. ANTLR produces a parser that processes the above file and outputs an abstract syntax tree. This tree is then analyzed to create the tables below. Note that the ipsec table assembles information from the interface, crypto map, crypto ipsec, and ip access-list command blocks. 288 Host router1 S. Narain et al. Interface Ethernet0 Host router1 SrcAddr 1.1.1.1 Host router1 Filter Aclx ipAddress Table Address 1.1.1.1 ipsec Table DstAddr EncryptAlg 3.3.3.3 esp-3des acl Table Protocol gre SrcAddr 3.3.3.3 Mask 255.255.255.0 HashAlg hmac DstAddr 4.4.4.4 Filter aclx Perm permit IP Assure’s vendor-neutral schema captures much of the configuration information for protocols it covers. Its skipping idea allows one to parse a file without recognizing the structure of all possible commands and command blocks. However, the idea is quite hard to get right in the ANTLR framework. One is trying to avoid writing a grammar for the skipped part of the language, yet the only method one can use is to write rules defining unparsed cmd. 9.4.2 Requirement Library 9.4.2.1 Requirements on Integrity of Logical Structures A very useful class of requirements is on the integrity of logical structures associated with different protocols. Before a group of components executing a protocol can accomplish an intended joint goal, various logical structures spanning these components must be set up. These structures are set up by making component configurations satisfy definite constraints. For example, before packets flowing between two interfaces can be secured via IPSec, the lPSec tunnel logical structure must be set up. This is done by setting IPSec configuration parameters at the two interfaces and ensuring that their values satisfy definite constraints. For example, the two interfaces must use the same hash and encryption algorithms, and the remote tunnel endpoint at each interface must equal the IP address of its counterpart. An Hot Standby Routing Protocol (HSRP) [44] router cluster is another example of a logical structure. It allows two or more routers to behave as a single router by offering a single virtual IP address to the outside world, on a given subnet. This address is mapped to the real address of an interface on the primary router. If this router fails, another router takes over the virtual address. Before the cluster correctly functions, however, the same virtual address and HSRP group identifier must be configured on all interfaces and the virtual and all physical addresses must belong to the same subnet. Much more complex logical structures are set up for BGP. Different routers in an autonomous system (AS) connect to different neighboring ASes, giving each router only a partial view of BGP routes. To allow all routers in an AS to construct 9 Network Configuration Validation 289 a complete view of routes, routers exchange information between themselves via iBGP (internal BGP) sessions. The simplest logical structure for accomplishing this exchange is a full-mesh of iBGP sessions, one for each pair of routers. But a full-mesh is impractical for a large AS, since the number of sessions grows quadratically with the number of routers. Linear growth is accomplished with a hub-and-spoke structure. All routers exchange routes with a spoke called a route reflector. If these structures are incorrectly set up, protocol oscillations, forwarding loops, traffic blackholes, and violation of business contracts can arise [6,31,74]. See Section 9.6.4 for more discussion of BGP validation. IP Assure evaluates requirements on integrity of logical structures associated with all common protocols. These structures include IP subnets, GRE tunnels, IPSec tunnels, MPLS [60] tunnels, BGP full-mesh or hub-and-spoke structures, OSPF subnets and areas, and HSRP router clusters. 9.4.2.2 Connectivity Requirements Connectivity (also called reachability) is a fundamental requirement of a network. It means the existence of a path between two nodes in the network. The most obvious network is an IP network whose nodes represent subnets and routers and links represent direct connections between these. But as noted in Section 9.2, connectivity requirements are also meaningful for many other types of networks such as GRE, IPSec, and BGP. IP Assure evaluates connectivity for IP, VLANs, GRE, IPSec, BGP, and MPLS networks. IP Assure also evaluates reachability in the presence of access-control policies, or lists, configured on routers or firewalls. An access-control list is a collection of rules specifying the IP packets that are permitted or denied based on their source and destination address, protocol, and source and destination ports. These rules are order-dependent. Given a packet, the rules are scanned from the top-down and the permit or deny action associated with the first matching rule is taken. Even if a path exists, a given packet may fail to reach a destination because an access-control list denies that packet. 9.4.2.3 Reliability Requirements Reliability in a network means the ability to maintain connectivity in the presence of failures of nodes or links. A single point of failure for connectivity between two nodes in a network is said to exist if a single failure causes connectivity between the two nodes to be lost. Reliability is achieved by provisioning backup resources and setting up a reliability protocol. This protocol monitors for failures and when one occurs, finds backup resources and attempts to restore connectivity using those. Configuration errors may prevent backup resources from being provisioned. For example, in Section 9.2, some GRE tunnels were only configured in one direction, not in the other, so they were unavailable for being rerouted over. Even if backup 290 S. Narain et al. resources have been provisioned, configuration errors in the routing protocol can prevent these resources from being found. For example, in Section 9.2, BGP was simply not configured to run over some GRE tunnels, so it would not find these links to reroute over. The architecture of the fault-tolerance protocol itself can introduce a single point of failure. For example, a nonzero OSPF area may be connected to OSPF area zero by a single area-border-router. If that router fails, then OSPF will fail to discover alternate routes to another area [36] even if these exist. Similarly, unless BGP route reflectors are replicated, they can become single points of failure [7]. Furthermore, redundant resources at one layer must be mapped to redundant resources at lower layers. For example, if all GRE tunnels originate at the same physical interface on a router, then if that interface fails, all tunnels would simultaneously fail. Ideally, all GRE tunnels originating at a router must originate at distinct interfaces on that router. Single points of failure can also arise out of the dependence between security and reliability. As shown in Fig. 9.5, routers R1 and R2 together constitute an HSRP cluster with R1 as the primary router. This cluster forms the gateway between an enterprise’s internal network on the right and the WAN on the left. For security, an IPSec tunnel is configured from R1 to the gateway router C of a collaborating site. However, this tunnel is not replicated on R2. Consequently, if R1 fails, then R2 would take over the cluster’s virtual address; however, IPSec connectivity to C would be lost. Reliability requirements that IP Assure evaluates include absence of single points of failure in IP networks, with and without access-control policies; absence of single OSPF area-border-routers; and replication of IPSec tunnels in an HSRP cluster. IPSec Tunnel 1 C WAN R1 HSRP Cluster X IPSec Tunnel 2 Fig. 9.5 HSRP cluster R2 Internal network 9 Network Configuration Validation 291 9.4.2.4 Security Requirements Typical network security requirements are about data confidentiality, data integrity, authentication, and access-control. IPSec is commonly used to satisfy the first three requirements and access-control lists are used to satisfy the last one. Access-control lists were discussed in Section 9.4.2.2. Components dedicated just to processing access-control lists are called firewalls. IP Assure evaluates requirements for both these technologies. For IPSec, it evaluates the tunnel integrity requirements in Section 9.4.2.1. For access-control lists, IP Assure evaluates two fundamental requirements. First, an access-control list subsumes another in that any packet permitted by the second is also permitted by the first. A related requirement is that one list is equivalent to another in that any packet permitted by one is permitted by the other. Two lists are equivalent if each subsumes the other. An enterprise may have multiple egress firewalls. Access-control lists on these may have been set up by different administrators over different periods of time. It is useful to check that the policy governing packets that leave the enterprise are equivalent. The second requirement that IP Assure evaluates on access-control lists is that a firewall has no redundant rules. A rule is redundant if deleting it will not change the set of packets a firewall permits. Deleting redundant rules makes lists compact and easier to understand and maintain. 9.4.2.5 Performance Requirements The [19] protocol allows one to specify policies for partitioning packets into different classes, and then for according them differentiated performance treatment. For example, a packet with a higher DiffServ class is given transmission priority over one with a lower. Typically, voice packets are given highest priority because of the high sensitivity of voice quality to end-to-end delays. Performance requirements that IP Assure evaluates are that all DiffServ policies on all routers are identical, and that any policy that is defined is actually used by being associated with an interface. IP Assure also evaluates the requirement that ICMP packets are not blocked. This is a sufficient condition for avoiding packet loss due to mismatched MTU sizes and setting of Do Not Fragment bits discussed in Section 9.2. 9.4.2.6 Government Regulatory Requirements Government regulatory requirements represent “best practices” that have evolved over a period of time. Compliance to these is deemed essential for connectivity, reliability, security, and performance of an organization’s network. Compliance to certain regulations such as the Federal Information Security Management Act (FISMA) [26] is mandatory for government organizations. Two examples of a FISMA requirement are (a) alternate communications services do not share a single 292 S. Narain et al. point of failure with primary communication services, (b) all access between nodes internal to an enterprise and those external to it is mediated by a proxy server. IP Assure allows specification of a large number of FISMA requirements. 9.4.3 Specification Language IP Assure’s specification language is that of graphical templates. It offers a menu of more than 100 requirements in different categories. A user can select one or more of these to be evaluated. For each requirement, one can specify its parameters. For example, for a reachability requirement, one can specify the source and destination. For an access-control list equivalence requirement, one can specify the two lists. One cannot apply disjunction or quantification operators to requirements. The only way to define new requirements is to program in Java and SQL. Figure 9.6 shows a few requirement classes that can be evaluated. These are QoS (DiffServ), HSRP, OSPF, BGP, and MPLS. Fig. 9.6 IP Assure requirement specification screen 9 Network Configuration Validation 293 9.4.4 Evaluation System Structural integrity requirements are evaluated with algorithms specialized to each requirement. In IP Assure, these algorithms are implemented with SQL and Java. The relevant tuples from the configuration database are extracted with SQL and analyzed by Java programs. For example, to evaluate whether an IPSec tunnel between two addresses local1 and local2 is set up, one checks that there are tuples ipsec(h1, local1, remote1, ea1, ha1, filter1) and ipsec(h2, local2, remote2, ea2, ha2, filter2) in the configuration database, and that local1 = remote2, remote1 = local2, ea1 = ea2, ha1 = ha2 and filter1 is a mirror image of filter2. Reachability and reliability requirements for a network are evaluated by extracting the relevant graph information from the configuration database with SQL queries, then applying graph algorithms [63]. For example, given the tuple ipAddress(host, interface, address, mask), one creates two nodes, the router host and the subnet whose address is the bitwise-and of address and mask, and then creates directed edges linking these in both directions. This step is repeated for all such tuples to compute an IP network graph. To evaluate whether a node or a link is a single point of failure, one removes it from the graph and checks whether two nodes are reachable. If not, then the deleted node or link is a single point of failure. To check reachability in the presence of access-control lists, all edges at which these lists block a given packet are deleted, and then reachability analysis is repeated for the remaining graph. Firewall requirements cannot be evaluated by enumerating all possible packets and checking for subsumption, equivalence, or redundancy. The total number of combinations of all source and destination addresses, ports, and protocols is astronomical: the total number of IPv4 source and destination address, source and destination port, and protocol combinations is 2^ 104 (32 C 32 C 16 C 16 C 8). Instead, symbolic techniques are used. Each policy is represented as a constraint on the following fields of a packet: source and destination address, protocol, and source and destination ports. The constraint is true precisely for those packets that are permitted by the firewall, taking rule ordering into account. Let P1 and P2 be two policies and C1 and C2 be, respectively, the constraints representing them. The constraint can be constructed in time linear in the number of rules. Then, P1 is subsumed by P2 if there is no solution to the constraint C1 ^ :C2. To check that a rule in P1 is redundant, delete it from P1 and check that the resulting policy is equivalent to P1. For example, let a firewall contain the following rules that, for simplicity, only check whether the source and destination addresses are in definite ranges: 1, 2, 3, 4, deny 5, 6, 7, 8, permit 10, 15, 15, 20, permit 294 S. Narain et al. The first rule states that any packet with source address between 1 and 2 and destination address between 3 and 4 is denied. Similarly, for the second and third rules. These are represented by the following constraint C1 on the variables src and dst. : (1=, >D, and bitwise logic operators. This QFF is then efficiently solved by Kodkod. If ConfigAssure is unable to find a solution, it outputs a proof of unsolvability, inherited from Kodkod. This proof is interpreted as a root-cause and guides configuration repair. Arithmetic quantifierfree forms constitute a good intermediate language between Boolean logic and first-order logic. Not only is it easy to express requirements in it, but it can also be efficiently compiled into Boolean logic. ConfigAssure was designed to avoid, where possible, the generation of very large intermediate constraints in Kodkod’s transformation of first-order logic into Boolean. If the fields that are responsible for making a requirement false are known, then one way to repair these is as follows: replace these fields with variables and use ConfigAssure to find new values of these variables that make the requirement true. Two approaches can be used to narrow down these fields. The first exploits the proof of unsolvability of the falsified requirement to compute a type of root-cause. The second exploits properties of Datalog proofs and ZChaff to compute that set of fields whose cost of change is minimal. The second approach has been developed in the MulVAL [35,55,56] system. More generally, MulVAL is a system for enterprise security analysis using attack graphs. Ordered Binary Decision Diagrams are an alternative to SAT solvers for evaluating firewall policy subsumption and rule redundancy with a method conceptually similar to that in Section 9.4.4. The use of these techniques for building different parts of a validation system is now illustrated with concrete examples based on the case study in Section 9.2. 9.5.1 Configuration Acquisition by Querying When the structure of a configuration file is simple, as it is for Cisco’s IOS, then it is not necessary to write a grammar with ANTLR or PADS/ML [47]. Instead, the structure can be put into a command database and then queried to construct the 298 S. Narain et al. configuration database. The query needs to refer only to that part of the command database necessary to construct a given table. All other parts are ignored. This idea provides substantial resilience to insertion of new command blocks, insertion of new subcommands in a known command block, and insertion of new keywords in subcommands. This idea is illustrated using Prolog, although any database engine could be used. Each command block is transformed into an ios cmd tuple or Prolog fact, with the structure ios_cmd(FileName, MainCommand, ListOfSubCommands) where MainCommand and each item in ListOfSubCommands is of the form [NestingLevel j ListOfTokens]. [AjB] means the list with head A and tail B. For example, the IOS file of Section 9.4.1, named f here, is transformed into the following Prolog tuples: ios_cmd(f, [0, hostname, router1], []). ios_cmd(f, [0, interface, ’Ethernet0’], [ [1, ip, address, ’1.1.1.1’, ’255.255.255.0’], [1, crypto, map, mapx] ]). ios_cmd(f, [0, crypto, map, mapx, 6, ’ipsec-isakmp’], [ [1, set, peer, ’3.3.3.3’], [1, set, ’transform-set’, transx], [1, match, address, aclx]]). ios_cmd(f, [0,crypto,ipsec,’transform-set’, transx,’esp-3des’,hmac], []). ios_cmd(f, [0, ip, ’access-list’, extended, aclx], [ [1, permit, gre, host, ’3.3.3.3’, host, ’4.4.4.4’]]). Note the close correspondence between the structure of command blocks in the IOS file and associated ios cmd tuples. One can now write Prolog rules to construct the configuration database. For instance, to construct rows for the ipAddress table, one can use: ipAddress(H, I, A, M):ios_cmd(File, [0, hostname, H|_], _), ios_cmd(File, [0, interface, I|_], Args), member(SubCmd, Args), subsequence([ip, address, A, M], SubCmd). The syntactic convention followed in Prolog is that identifiers beginning with capital letters are variables, otherwise they are constants. The :- symbol is a shorthand for if. All variables are universally quantified. The rule states that ipAddress of an interface I on host H is A with mask M if there is a File containing a hostname command declaring host H, an interface command declaring interface I, and a subcommand of that command declaring its address and mask to be A and M, respectively. 9 Network Configuration Validation 299 Note that this definition is unaffected by subcommands of the interface command that are not of interest for computing ipAddress, or that are defined in a subsequent IOS release. It only tries to find a subcommand containing the sequence [ip, address, A, M]. It does not require that the subcommand be in a definite position in the block, or that the sequence address A, M appear in definite position in the ip subcommand. Now, where H, I, A, M are variables, the query ipAddress(H, I, A, M) will succeed with the solution H = f, I = ’Ethernet0’, A = ’1.1.1.1’ and M = ’255.255.255.0’. Here f is a host, I is an interface on this host, and A and M its address and mask, respectively. ipsec is more complex but querying simplifies the assembly of information from different parts of a configuration file. For each interface, one finds the name of a crypto map Map applied to that interface, and then finds the corresponding crypto map command, from which one can extract the peer address Peer, the filter Filter, and transform-set Transform. These values are used to select the crypto ipsec command from which the Encrypt and Hash values are extracted. Thus, the ipSecTunnel(H, Address, Peer, Encrypt, Hash, Filter) is constructed. ipsec(H, Address, Peer, Encrypt, Hash, Filter):ios_cmd(File, [0, interface, I |_], Args), member([_, crypto, map, Map |_], Args), ios_cmd(File, [0, hostname, H |_], _), ipAddress(H, I, Address, _), ios_cmd(File, [0, crypto, map, Map |_], CArgs), member([_, set, peer, Peer |_], CArgs), member([_, match, address, Filter|_], CArgs), member([_, set, ’transform-set’, Transform |_], CArgs), ios_cmd(File, [0, crypto, ipsec, ’transform-set’, Transform, Encrypt, Hash],_). The ipAddress and ipsec tuples are constructed in all possible ways via Prolog backtracking. Together, these form the configuration database for these protocols. 9.5.2 Specification Language This section shows how Prolog can be used to specify the types of requirements in the case study of Section 9.2. It has already been used to validate VPN and BGP requirements [50, 58] As shown in Fig. 9.9, routers RA and RB are in the same COI but RX is in a different COI. RA’s configuration violates two security requirements and one connectivity requirement. First, RA has a GRE tunnel into RX. Second, RA has a default static route using which it can forward packets destined to RX, to the WAN. Third, RA does not have a GRE tunnel into RB. All these violations need to be detected and configurations repaired. 300 S. Narain et al. RB COI1 eth_0 address = 200 COI1 eth_0 address = 100 RA WAN tunnel_0 COI2 eth_0 address = 300 RX Fig. 9.9 Network violating security and connectivity requirements A configuration database for the above network is represented by the following Prolog tuples: static_route(ra, 0, 32, 400). gre(ra, tunnel_0, 100, 300). ipAddress(ra, eth_0, 100, 0). ipAddress(rb, eth_0, 200, 0). ipAddress(rx, eth_0, 300, 0). coi([ra-coi1, rb-coi1, rx-coi2]). The first tuple states that router ra has a default static route with a next hop of address 400. Normally, a mask is a sequence of 32 bits containing a sequence of ones followed by a sequence of zeros. In the ipAddress tuple, a mask is represented implicitly as the number of zeros at the end of the sequence. This simplifies the computations we need. The route is called “default” because any address matches it. The second states that router ra has a GRE tunnel originating from GRE interface tunnel 0 with local physical address 100 and remote physical address 300. The third tuple states that router ra has a physical interface eth 0 with address 100 and mask 0. Similarly, for the fourth and fifth tuples. The last tuple lists the community of interest of each router. Requirements are defined with Prolog clauses, e.g.: good:-gre_connectivity(ra, rb). gre_connectivity(RX, RY):gre_tunnel(RX, RY), route_available(RX, RY). 9 Network Configuration Validation 301 gre_tunnel(RX, RY):gre(RX, _, _, RemoteAddr), ipAddress(RY, _, RemoteAddr, _). route_available(RX, RY):static_route(RX, Dest, Mask, _), ipAddress(RY, _, RemotePhysical, 0), contained(Dest, Mask, RemotePhysical, 0). contained(Dest, Mask, Addr, M):Mask>=M, N is ((2ˆ32-1)<< Mask)/\Dest, N is ((2ˆ32-1)<< Mask)/\Addr. bad:-gre_tunnel(ra, rx). bad:-route_available(ra, rx). The first clause states that good is true provided there is GRE connectivity between routers ra and rb since they are in the same COI. The second clause states that there is GRE connectivity between any two routers RX and RY provided RX has a GRE tunnel configured to RY and a route available to RY. The third clause states that a GRE tunnel to RY is configured on RX provided there is a GRE tuple on RX whose remote address is that of an interface on RY. The fourth clause states that a route to RY is available on RX provided an address RemotePhysical on RY is contained within the address range of a static route on RX. The fifth clause checks this containment. < < is the left-shift operator and /n is the bitwise-and operator, not to be confused with the conjunction operator. The sixth clause states that bad is true provided there is a gre tunnel between ra and rx since ra and rx are not in the same COI. The last clause states that bad is also true provided a route on ra is available for packets with a destination on rx. We now show how to capture requirements containing quantifiers. To capture the requirement all good that between every pair of routers in a COI there is GRE connectivity, we can write: all_good:-not(same_coi_no_gre). same_coi_no_gre:-same_coi(X, Y), not(gre_connectivity (X, Y)). same_coi(X, Y):-coi(L), member(X-C, L), member (Y-C, L). The first rule states all good is true provided same coi no gre is false. The second rule states that same coi no gre is true provided there exist X and Y that are in the same COI but for which gre connectivity(X, Y) is false. The last rule states that X and Y are in the same COI provided there is some COI C such that X-C and Y-C are in the COI association list L. Similarly, we can capture the requirement no bad that no router contains a route to a router in a different COI. As previously mentioned, the MulVAL system has proposed the use of Datalog for specification and analysis of attack graphs. Datalog is a restriction of Prolog in which arguments to relations are just variables or atomic terms, i.e., no complex terms and data structures. This restriction means, in particular, that predicates such as all good and all pairs gre cannot be specified and neither can subnet id since it needs bitwise operations. However, the first five Prolog tuples 302 S. Narain et al. above and the first three rules can be specified. This restriction, however, permits MulVAL to perform fine-grained analysis of root-causes of configuration errors and to compute strategies for their repair. This is discussed in the next section. 9.5.3 Evaluation for Repair If a configuration database and requirements are expressed in Prolog, then its query capability can be used to evaluate whether requirements are true. For example, the query route available(ra, rb) is evaluated to be true by clauses for route available, static route, and contained. The query bad succeeds for two reasons. First, the static route on ra is a default route. It forwards packets to any destination, including to destinations in a different COI. Second, a GRE tunnel to router rx is configured on ra even though rx is in a different COI. On the other hand, the query good fails. This is because the predicate gre tunnel(ra, rb) fails. The only GRE tunnel configured on ra is to rx, not to rb. If requirement evaluation against a configuration database is the only goal, then a Prolog-based validation system is practical on a realistic scale. However, if a requirement is false for a configuration database and the goal is to change some fields in some tuples so that the requirement becomes true, then Prolog is not adequate. The Prolog query (good,not(bad)), representing the conjunction of good and not(bad), will simply fail. Prolog will not return new values of these fields that make the query true. In order to efficiently compute new values of these fields, a constraint solver with the capability to compute a proof of unsolvability is needed. Such a capability is provided by the ConfigAssure system. ConfigAssure allows one to replace some fields in some tuples in a configuration database with configuration variables. These variables are unrelated to Prolog variables. ConfigAssure also allows one to specify a requirement R as an equivalent QFF RC on these configuration variables. Solving RC would compute new values of these fields, in effect repairing the fields. For example, suppose we suspect that the query (good,not(bad)) fails because addresses and the static route mask are incorrect. We can replace all these with configuration variables to obtain the following database: static_route(ra, dest(0), mask(0), 400). gre(ra, tunnel_0, gre_a_local(0), gre_a_remote(0)). ipAddress(ra, eth_0, ra_addr(0), 0). ipAddress(rb, eth_0, rb_addr(0), 0). ipAddress(rx, eth_0, rx_addr(0), 0). coi([ra-coi1, rb-coi1, rx-coi2]). Here, dest(0), mask(0), gre a local(0), gre a remote(0), ra addr(0), rb addr(0), rx addr(0) are all configuration variables. In order that this database satisfy (good ^ not(bad)), these configuration variables must satisfy the following constraint RC: 9 Network Configuration Validation 303 :gre a remote(0)=rx addr(0)^ :contained(dest(0),mask (0), rx addr(0),0) ^ gre a remote(0)=rb addr(0) ^ contained(dest(0),mask(0),rb addr(0),0) ^ : ra addr(0)=rb addr(0) ^ :rb addr(0)=rx addr(0) ^ :rx addr(0)=ra addr(0) The constraint on the first two lines is equivalent to not(bad). It states that ra should neither have a GRE tunnel nor a static route to rx. The constraint on the next two lines is equivalent to good. It states that ra should have both a GRE tunnel and a static route to rb. The constraint on the last line states that all interface addresses are unique. Solving this constraint would indeed find new values of configuration variables and hence repair the fields. However, one may change fields, such as ra addr(0), unrelated to the failure of (good,not(bad)). To change fields only related to failure, one can exploit the proof of unsolvability that ConfigAssure automatically computes when it fails to solve a requirement. This proof is a typically small and unsolvable part of the requirement, and can be taken to be a root-cause of unsolvability. The idea is to generate a new constraint InitVal that is a conjunction of equations of the form x = c where x is a configuration variable that replaced a field and c is the initial value of that field. Now try to solve RC^InitVal. Since R is false for the database without variables, ConfigAssure will find RC^InitVal to be unsolvable and return a proof of unsolvability. If, in this proof, there is an equation x = c that is also in InitVal, then relax the value of x by deleting x = c from InitVal to create InitVal’. Reattempt a solution to RC^InitVal’ to find a new value of x. More than one such equation can be deleted in a single step. For example, the definition of InitVal for above configuration variables is: dest(0)=0 ^ mask(0)=32 ^ gre a local(0)=100 ^ gre a remote(0)=300 ^ ra addr(0)=100 ^ rb addr(0)=200 ^ rx addr(0)=300 Submitting RC^InitVal to ConfigAssure generates a proof of unsolvability that ra should have a tunnel to rb but instead has one to rx: gre a remote(0)=rb addr(0) ^ gre a remote(0)=300 ^ rb addr(0)=200 Deleting the second equation from InitVal to obtain InitVal’ and solving RC^InitVal’ we obtain another proof of unsolvability that ra has a static route to rx: rx addr(0)=300 ^ dest(0)=0 ^ mask(0)=32 ^ :contained (dest(0),mask(0),rx addr(0),0) 304 S. Narain et al. Deleting the second and third equations and solving, we obtain a solution that fixes both the GRE tunnel and the static route on ra: dest(0)=200 mask(0)=0 gre_a_remote(0)=200 gre_a_local(0)=100 ra_addr(0)=100 rb_addr(0)=200 rx_addr(0)=300 Values of just the first three variables needed to be recomputed. Values of others do not need to be. Note that ra addr(0) never appeared in a proof of unsolvability even though it did in RC. Thus, its value definitely does not need to be recomputed. This is not obvious from RC. Note also that repair is holistic in that it satisfies both good and not(bad). The remaining task is generation of the constraint RC. It is accomplished by thinking about specification as a method of computing an equivalent quantifier-free formula, i.e., defining the predicate eval(Req, RC) where Req is the name of a requirement and RC is a QFF equivalent to Req. The original Prolog specification of Req in Section 9.5.2 is no longer needed. It is replaced by a metalevel version as follows: eval(bad, or(C1, C2)):eval(gre_tunnel(ra, rx), C1), eval(route_available(ra, rx), C2). eval(gre_tunnel(RX, RY), RemoteAddr=Addr):gre(RX, _, _, RemoteAddr), ipAddress(RY, _, Addr, _). eval(route_available(RX, RY), C):static_route(RX, Dest, Mask, _), ipAddress(RY, _, RemotePhysical, _), C=contained(Dest, Mask, RemotePhysical, 0). eval(addr_unique, C):andEach([not(ra_addr(0)=rb_addr(0)), not(rb_addr(0)=rx_addr(0)), not(rx_addr(0)=ra_addr(0))], C). eval(topReq, C):eval(good, G), eval(bad, B), eval(addr_unique, AU), andEach([G, B, AU], C). These rules capture the semantics of the Prolog rules. The first states that a QFF equivalent to bad is the disjunction of C1 and C2 where C1 is the QFF equivalent to gre tunnel(ra, rx) and C2 is the QFF equivalent to route available(ra, rx). The second rule states that the QFF equivalent to gre tunnel(RX, RY) is RemoteAddr= Addr where RemoteAddr is the remote physical address of a GRE tunnel on RX and Addr is the address of an interface on RY. The third rule states that the QFF equivalent to 9 Network Configuration Validation 305 route available(RX, RY) is C provided C is the constraint that RX contains a static route for an address on RY. The fourth rule computes the QFF for all interface addresses being unique. The last rule computes the QFF for the top-level constraint topReq. Now, the Prolog query eval(topReq, RC) computes RC as above. As has been shown in [51], QFFs are much more expressive than Boolean logic, so it is not hard to write requirements using the eval predicate. 9.5.4 Repair with MulVAL The MulVAL system proposes an alternative, precise method of computing the fields that cause the success of an undesirable requirement provided that requirement is expressed in Datalog. A requirement, such as bad, is said to be undesirable if it enables adversary success. This method is based on the observation that any tuple in a proof of an undesirable requirement is responsible for the truth of that requirement. These tuples contain all the fields that need to be replaced by configuration variables. For example, one proof of bad with the original Prolog specification in Section 9.5.2 is: bad gre_tunnel(ra, rx) gre(ra, tunnel 0, 100, 300) ^ ipAddress(rx, eth 0, 300, 0) Here, each condition is implied by its successor by the use of a rule in the Prolog specification. The second proof of bad is: bad route_available(ra, rx) static route(ra,0,32,400) ^ ipAddress(rx, eth 0,300, 0) ^ contained(0,32,300,0) The tuples that contribute to the proof of bad are: gre(ra, tunnel_0, 100, 300) -- from the first proof ipAddress(rx, eth_0, 300, 0) -- from the first proof static_route(ra, 0, 31, 400) -- from the second proof The following tuples do not contribute to the proof of bad: ipAddress(ra, eth_0, 100, 0). ipAddress(rb, eth_0, 200, 0). The three tuples in the proof of bad contain all the fields that need to be replaced by configuration variables. Note that the address of interfaces at ra and rb do not need to be replaced. 306 S. Narain et al. The MulVAL system does not actually compute new values of fields. It only computes the set of tuples that should be disabled to disable all proofs of the undesirable property. A tuple can be disabled by changing its fields to different values or deleting it. But, MulVAL computes the set in an optimal way. It first derives a Boolean formula representing all the ways in which tuples should be disabled, then solves this with a minimum-cost SAT solver. A solution represents a set of tuples to disable. For example, the Boolean formula for the above two proofs is: : gre(ra, tunnel 0, 100, 300) _ :ipAddress(rx, eth 0,300, 0) ^ : ipAddress(rx, eth 0, 300, 0) _ :static route(ra, 0, 32, 400) The first formula states that to disable the first proof, either the gre tuple or the ipAddress tuple must be disabled. The second formula states that to disable the second proof, either the ipAddress or the static route tuple must be disabled. Costs are associated with disabling each tuple. The minimum-cost SAT solver computes that set of tuples whose cost of disabling is a minimum. For example, the cost of disabling the ipAddress tuple may be high because many requirements depend on this tuple. The cost of disabling the static route and gre tuples may be a lot lower. It is not, in general, simple to assign cost to disabling a tuple. Furthermore, this approach only computes how to disable an undesirable requirement. It does not guarantee that disabled tuples will also not disable desirable requirements, unless these latter requirements are also expressed in Boolean logic and the combined constraint is solved. 9.5.5 Evaluating Firewall Requirements with Binary Decision Diagrams Hamed et al. [34] evaluate firewall subsumption and rule redundancy using Ordered Binary Decision Diagrams [12]. Their algorithm is conceptually the same as in Section 9.4.4. It first transforms firewall policies into Boolean constraints upon source and destination addresses, source and destination ports, and the protocol. These constraints are true only for those packets that are permitted by the firewall. These fields are represented as sequences of Boolean variables, e.g., an address field as a sequence of 32 variables and a port field as a sequence of 16 bits. The algorithm then checks whether combinations of constraints for evaluating subsumption and redundancy have a solution. Since constraints are represented as Ordered Binary Decision Diagrams, this check is straightforward. By contrast, ConfigAssure represents the above fields as integer variables and represents a policy as an arithmetic quantifierfree form constraint. It lets Kodkod transform this into a Boolean constraint and use a SAT solver to check satisfiability. 9 Network Configuration Validation 307 9.6 Related Work 9.6.1 Configuration Acquisition by Type Inference Another approach to parsing configuration files is with the use of PADS/ML system [47]. Based on the functional language ML, PADS/ML describes the accepted language as if it were a type definition. PADS/ML supports the generation of parser, printer, data structure representation, and a generic interface to this representation. The generated code is in OCAML [43] language and additional tools, written in OCAML, then manipulate the internal data structure. This internal data structure is traversed to populate the relational database in the same way that the ANTLR abstract syntax tree is traversed. Adaptive parsers are reported in [17]. These can modify the language they recognize when given examples of legal input. The inference system recognizes commands that are only handled in the abstract, much as the ANTLR grammar of IP Assure skips over some commands. Repeated instances of commands are used to generate new PADS/ML types, which are then further refined to provide access to fields in the commands. This means that as the IOS language evolves, the parser can evolve to provide an ever richer internal representation. 9.6.2 Symbolic Reachability Analysis Instead of performing reachability analysis for each packet, a system for reachability analysis for sets of packets is described in Xie et al. [72]. This makes it possible to evaluate a requirement such as “a change in static routes at one or more routers does not change the set of packets that can flow between two nodes.” It is not feasible to evaluate such a requirement by enumerating all packets and checking reachability. In this system, the reachability upper bound is defined to be the union of all packets permitted by each possible forwarding path from the source to the destination. This bound models a security policy that denies some packets (i.e., those outside the upper bound) under all conceivable operational conditions. The reachability lower bound is defined to be the common set of packets allowed by every feasible forwarding path from the source to the destination. This bound models a resilience policy that assures the delivery of some packets despite network faults, as long as a backup forwarding path exists. Algorithms are created for estimating the reachability upper and lower bounds from a network’s packet filter configurations. Moreover, the work shows that it is possible to jointly reason about how packet filters, routing, and packet transformations affect reachability. An interesting implementation of reachability analysis for sets of packets is found in the ConfigChecker [3] system. It represents the network’s packet forwarding behavior as a giant state machine in which a state defines what packets are at what routers. However, the state-transition relation is not represented explicitly but rather 308 S. Narain et al. symbolically as a constraint that must be satisfied by two states for the network to transition between these. This constraint itself is represented as an Ordered Binary Decision Diagram and input to a symbolic model checker [48]. Reachability requirements such as that above are expressed in Computational Tree Logic [48] and the symbolic model checker used to evaluate these. The transition-relation also takes into account features such as IPSec tunnels, multicast, and network address translation. 9.6.3 Alloy Specification Language Alloy [2, 39] is a first-order relational logic system. It lets one specify object types and their attributes. It also lets one specify first-order logic constraints on these attributes. These are more expressive than Prolog constraints. Alloy solves constraints by compiling these into Kodkod and using Kodkod’s constraint solver. The use of Alloy for network configuration management was explored in [49].Alloy’s specification language is very appropriate for specifying requirements. All the requirements in Section 9.2 can be compactly expressed in Alloy. However, its constraint solver is inappropriate for evaluating requirements. This is because the compilation of first-order logic into Boolean logic leads to very large intermediate constraints. Kodkod addresses this problem by its partial-model optimization that exploits knowledge about parts of the solution. If the value of a variable is already known, it does not appear in the constraint that is submitted to the SAT solver. ConfigAssure follows a related approach but at a higher layer. The intuition is that given a requirement, many parts of it can be efficiently solved with non-SAT methods. Solving these parts and simplifying can yield a requirement that truly requires the power of a SAT solver. This plan is carried out by transforming a requirement into an equivalent quantifier-free form by defining the eval predicate for that requirement. QFFs have the property that not only is it easy to write eval rules, but also that QFFs are efficiently compiled and solved by Kodkod. Evaluation of parts of requirements and simplification are accomplished in the definition of eval. 9.6.4 BGP Validation The Internet is, by definition, a “network of networks,” and the responsibility for gluing together the tens of thousands of independently administered networks falls to the Border Gateway Protocol (BGP) [59, 64]. A network, or AS uses BGP to tell neighboring networks about each block of IP addresses it can reach; in turn, neighboring ASes propagate this information to their neighbors, allowing the entire Internet to learn how to direct packets toward their ultimate destinations. On the surface, BGP is a relatively simple path-vector routing protocol, where each router selects a single best route among those learned from its neighbors, adds its own AS 9 Network Configuration Validation 309 number to the front of the path, and propagates the updated routing information to its neighbors for their consideration; packets flow in the reverse direction, with each router directing traffic along the chosen path in a hop-by-hop fashion. Yet, BGP is a highly configurable protocol, giving network operators significant control over how each router selects a “best” route and whether that route is disseminated to its neighbors. The configuration of BGP across the many routers in an AS collectively expresses a routing policy that is based on potentially complex business objectives [15]. For example, a large Internet Service Provider (ISP) uses BGP policies to direct traffic on revenue-generating paths through their own downstream customers, rather than using paths through their upstream providers. A small AS like a university campus or corporate network typically does not propagate a BGP route learned from one upstream provider to another, to avoid carrying data traffic between the two larger networks. In addition, network operators may configure BGP to filter unexpected routes that arise from configuration mistakes and malicious attacks in other ASes [14,52]. BGP configuration also affects the scalability of the AS, where network operators choose not to propagate routes for their customers’ small address blocks to reduce the size of BGP routing tables in the rest of the Internet. Finally, network operators tune their BGP configuration to direct traffic away from congested paths to balance load and improve user-perceived performance [25]. The routing policy is configured as a “route map” that consists of a sequence of clauses that match on some attributes in the BGP route and take a specific action, such as discarding the route or modifying its attributes with the goal of influencing the route-selection process. The BGP defines many different attributes, and the route-selection process compares the routes one attribute at a time to ultimately identify one “best” route. This somewhat indirect mechanism for selecting and propagating routes, coupled with the large number of route attributes and routeselection steps, makes configuring BGP routing policy immensely complicated and error-prone. Network operators often use tools for automatically configuring their BGP-speaking routers [11, 21, 29]. These tools typically consist of a template that specifies the sequence of vendor-specific commands to send to the router, with parameters unique to each BGP session populated from a database; for example, these parameters might indicate a customer’s name, AS number, address block(s), and the appropriate route-maps to use. When automated tools are not used, the network operators typically have configuration-checking tools to ensure that the sessions are configured correctly, and that different sessions are configured in a consistent manner [16, 24]. Configuring the BGP sessions with neighboring ASes, while important, is not the only challenge in BGP configuration. In practice, an AS consists of multiple routers in different locations; in fact, a large ISP may easily have hundreds if not thousands of routers connected by numerous links into a backbone topology. Different routers connect to different neighbor ASes, giving each router only a partial view of the candidate BGP routes. As such, large ISPs typically run BGP inside their networks to allow the routers to construct a more complete view of the available routes. These internal BGP (iBGP) sessions must be configured correctly to ensure that each router has all the information it needs to select routes that satisfy the AS’s 310 S. Narain et al. policy. The simplest solution is to have a “full-mesh” configuration, with an iBGP session between each pair of routers. However, this approach does not scale, forcing large ISPs to introduce hierarchy by configuring route reflectors or confederations that limit the number of iBGP sessions and constrain the dissemination of routes. Each route reflector, for instance, selects a single “best route” that it disseminates to its clients; as such, the route-reflector clients do not learn all the candidate routes they would have learned in a full-mesh configuration. When the “topology” formed by these iBGP sessions violates certain properties, routing anomalies like protocol oscillations, forwarding loops, traffic blackholes, and violations of business contracts can arise [6, 31, 74]. Fortunately, static analysis of the iBGP topology, spread over the configuration of the routers inside the AS, can detect when these problems might arise [24]. Such tools check, for instance, that the top-level route reflectors are fully connected by a “full-mesh” of iBGP sessions. This prevents “signaling partitions” that could prevent some routers from learning any route for a destination. Static analysis can also check that route reflectors are “close” to their clients in the underlying network topology, to ensure that the route reflectors make the same routing decisions that their clients would have made with full information about the alternate routes. Finally, these tools can validate an ISP’s own local rules for ensuring reliability in the face of router failures. For instance, static analysis can verify that each router is configured with at least two route-reflector parents. Collectively, these kinds of checks on the static configuration of the network can prevent a wide variety of routing anomalies. For the most part, configuration validation tools operate on the vendor-specific configuration commands applied to individual routers. Configuration languages vary from one vendor to another, – for example, Cisco and Juniper routers have very different syntax and commands, even for relatively similar configuration tasks. Even within a single company, different router products and different generations of the router operating system have different commands and options. This makes configuration validation an immensely challenging task, where the configuration-checking tools much support a wide range of languages and commands. To address these challenges, research and standards activities have led to new BGP configuration languages that are independent of the vendor-specific command syntax [1, 71], particularly in the area of BGP routing policy. In addition to abstracting vendor-specific details, these frameworks provide some support for configuring entire networks rather than individual routers. For example, the Routing Policy Specification Language (RPSL) [1] is object-oriented, where objects contain AS-wide policy and administrative information that can be published in Internet Routing Registries [37]. Routing policy can be expressed in terms of user-friendly keywords for defining actions and groups of address blocks or AS number. Configuration-generation tools can read these specifications to generate vendor-specific commands to apply to the individual routers [37]. However, while RPSL is used for publishing information in the IRRs, many ISPs still use their own configuration tools (or manual processes) for configuring their underlying routers. In summary, the configuration of BGP takes place at many levels – within a single router (to specify a single end point of a BGP session with the appropriate route- 9 Network Configuration Validation 311 maps and addresses), between pairs of routers (to ensure consistent configuration of the two ends of a BGP session), across different sessions to the same neighboring AS (to ensure consistent application of the routing policy at each connection point), and across an entire AS (to ensure that the iBGP topology is configured correctly). In recent years, tools have emerged for static analysis of router-configuration data to identify potential configuration mistakes, and for automated generation of the configuration commands that are sent to the routers. Still, many interesting challenges remain in raising the level of abstraction for configuring BGP, to move from the low-level focus on configuring individual routers and BGP sessions toward configuring an entire network, and from the specific details of the BGP route attributes and route-selection process to a high-level specification of an AS’s routing policy. As the Internet continues to grow, and the business relationships between ASes become increasingly complex, these issues will only become more important in the years ahead. 9.6.5 Other Validation Systems Netsys was an early software product for configuration validation. It was first acquired by Cisco Systems and then by WANDL Corporation. It contained about a 100 requirements that were evaluated against router configurations. OPNET offers validation products NetDoctor and NetMapper. These are not standalone but rather modules that need to be plugged into the base IT Sentinel system [54]. For more description of these, see [23]. None of these products offer configuration repair, reasoning about firewalls, or symbolic reachability analysis. The Smart Firewalls work [13] was an early attempt at Telcordia to develop a network configuration validation system. A survey of system, not network, configuration is found in [4]. Formal methods for jointly reasoning about IPSec and firewall polices are described in [32]. A high-level configuration language is described in [45]. 9.7 Summary and Directions for Future Research To set up network infrastructure satisfying end-to-end requirements, it is not only necessary to run appropriate protocols on components but also to correctly configure these components. Configuration is the “glue” for logically integrating components at and across multiple protocol layers. Each component has a finite number of configuration parameters, each of which can be set to a definite value. However, today, the large conceptual gap between end-to-end requirements and configurations is manually bridged. This causes large numbers of configuration errors whose adverse effects on security, reliability, and high cost of deployment of network infrastructure are well documented. See also [57, 62]. 312 S. Narain et al. Thus, it is critical to develop validation tools that check whether a given configuration is consistent with the requirements it is intended to implement. Besides checking consistency, configuration validation has another interesting application, namely network testing. The usual invasive approach to testing has several limitations. It is not scalable. It consumes resources of the network and network administrators and has the potential to unleash malware into the network. Some properties such as absence of single points of failure are impractical to test as they require failing components in operational networks. A noninvasive alternative that overcomes these limitations is analyzing configurations of network components. This approach is analogous to testing software by analyzing its source code rather than by running it. This approach has been evaluated for a real enterprise. Configuration validation is inherently hard. Whether a component is correctly configured cannot be evaluated in isolation. Rather, the global relationships into which the component has been logically integrated with other components have to be evaluated. Configuration repair is even harder since changing configurations to make one requirement true may falsify another. The configuration change should be holistic in that it should ensure that all requirements concurrently hold. This chapter described the challenges of configuring a typical collaboration network and the benefits of using a validation system. It then presented an abstract design of a configuration validation system. It consists of four subsystems: configuration acquisition system, requirement library, specification language, and evaluation system. The chapter then surveyed technologies for realizing this design. Configuration acquisition systems have been built using three approaches: parser generator, type inference, and database query. Classes of requirements in their Requirements Library are logical structure integrity, connectivity, security, reliability, performance, and government regulatory. Specification languages include visual templates, Prolog, Datalog, arithmetic quantifier-free forms, and Computational Tree Logic. Evaluation systems have used graph algorithms, the Kodkod constraint solver for first-order logic constraints, the ZChaff SAT solver for Boolean constraints, Binary Decision Diagrams, and symbolic model checkers. Visualization of not just the IP topology but also of various other logical topologies provides useful insights into network architecture. Logic-based languages are very useful for creating a validation system, particularly for solving the hard problems of configuration repair and symbolic reasoning about requirements. Future research needs to focus on all four components of a validation system. Robust configuration acquisition systems are critical to automated validation. The accumulated experience of building large networks is vast but largely unformalized. Formalizing these in a Requirement Library would not only raise the level of abstraction at which network requirements are written but also improve their precision. New classes of requirements, one on VLAN optimization and another on configuration complexity, are reported in [28, 65] and in [9], respectively. Specification languages that are easy to use by network administrators are also critical for broad adoption of validation systems. Logic-based languages are a good candidate despite the perception that these are too complex for administrators. These are closest in form to the natural language requirements in network design documents. The 9 Network Configuration Validation 313 configuration languages administrators use are already declarative in that they do not contain side-effects and the ordering of commands is unimportant. Introducing logical operators, data structures, and quantifiers into these is a natural step toward making these much more expressive. See [71] for a recent example of using the Haskell functional language for specifying BGP policies. High-level descriptions of component configurations could then again be composed by logical operators to describe network-wide requirements. In the nearer term, even making an implementation of the Requirement Library available as APIs in system administration languages like Perl or Python should vastly improve configuration debugging. Much greater understanding is needed of useful ways to visualize logical structures and relationships in networks. One might derive inspiration from works such as [70]. Finally, a good framework for repairing configurations was described in Section 9.5.3, but it needs to be further explored. For example, one needs to understand how the convergence of the repair procedure is affected by choice of configuration variable to relax, and how ideas of MulVAL can be generalized and combined with those of ConfigAssure. Creating the trust in network administrators before they allow automated repair of their component configurations is an open problem. Acknowledgments We are very grateful to Jennifer Rexford, Andreas Voellmy, Richard Yang, Chuck Kalmanek, Simon Ou, Geoffrey Xie, Yitzhak Mandelbaum, Ehab Al-Shaer, Sanjay Rao, Adel El-Atawy, and Paul Anderson for their contributions and comments. References 1. Alaettinoglu, C., Villamizar, C., Gerich, E., Kessens, D., Meyer, D., Bates, T., et al. (1999). Routing Policy Specification Language. RFC 2622. 2. Alloy. http://alloy.mit.edu/ 3. Al-Shaer, E., Marrero, W., El-Atawy, A., & ElBadawy, K. (2008). Towards global verification and analysis of network access control configuration. Technical Report, TR-08008, DePaul University, from http://www.mnlab.cs.depaul.edu/projects/ConfigChecker/TR08-008/paper.pdf 4. Anderson P (2006) System Configuration. In Short Topics in System Administration ed. Rick Farrow. USENIX Association. 5. ANTRL v3. http://www.antlr.org/ 6. Basu, A., Ong, C.H., Rasala, A., Shepherd, F.B., & Wilfong, G. (2002). Route oscillations in I-BGP with route reflection. ACM SIGCOMM. 7. Bates, T., Chandra, R., & Chen, E. (2000). BGP route reflection – an alternative to full mesh IBGP. RFC 2796. http://www.faqs.org/rfcs/rfc2796 8. Bellovin, R., & Bush, R. (2009). Configuration management and security. IEEE Journal on Selected Areas in Communications [special issue on Network Infrastructure Configuration], 27(Suppl. 3). 9. Benson, T., Akella, A., & Maltz, D. (2009). Unraveling the complexity of network management. USENIX Symposium on Network Systems Design and Implementation. 10. Berkowitz, H. (2000). Techniques in OSPF-Based Network. http://tools.ietf.org/html/draft-ietfospf-deploy-00 11. Bohm, H., Feldmann, A., Maennel, O., Reiser, C., & Volk, R. (2005). Network-wide interdomain routing policies: Design and realization. Unpublished report, http://www.net.t-labs. tu-berlin.de/papers/BFMRV-NIRP-05.pdf. 314 S. Narain et al. 12. Bryant, R. (1986). Graph-based algorithms for Boolean function manipulation. IEEE Transactions on Computers, C-35(Suppl. 8), 677–691. 13. Burns, J., Cheng, A., Gurung, P., Martin, D., Rajagopalan, S., Rao, P., et al. (2001). Automatic management of network security policy. Proceedings of DARPA Information Survivability Conference and Exposition (DISCEX II’01), volume 2, Anaheim, CA. 14. Butler, K., Farley, T., McDaniel, P., & Rexford, J. (2008). A survey of BGP security issues and solutions. Unpublished manuscript. 15. Caesar, M., & Rexford, J. (2005). BGP routing policies in ISP networks. IEEE Network Magazine [Special issue on Interdomain Routing], 19, 5–11. 16. Caldwell, D., Gilbert, A., Gottlieb, J., Greenberg, A., Hjalmtysson, G., & Rexford, J. (2003). The cutting EDGE of IP router configuration. ACM SIGCOMM HotNets Workshop. 17. Caldwell, D., Lee, S., & Mandelbaum, Y. (2008). Adaptive parsing of router configuration languages. Proceedings of the Internet Management Workshop. 18. Cheswick, W., Bellovin, S., & Rubin, A. (2003). Firewalls and Internet security: Repelling the Wily Hacker. Reading, MA: Addison-Wesley. 19. Cisco Systems. (2005). DiffServ – The Scalable End-to-End QoS Model. 20. Distributed Management Task Force, from http://www.dmtf.org/home 21. Enck, W., Moyer, T., McDaniel, P., Sen, S., Sebos, P., Spoerel, S., et al. (2009). Configuration management at massive scale: System design and experience. IEEE Journal on Selected Areas in Communications. 27(Suppl. 3), 323–335. 22. Farinacci, D., Li, T., Hanks, S., Meyer, D., & Traina, P. (2000). Generic routing and encapsulation. RFC 2784. 23. Feamster, N. (2006). Proactive techniques for correct and predictable Internet routing. Doctoral dissertation, Massachusetts Institute of Technology, Boston, MA. 24. Feamster, N., & Balakrishnan, H. (2005). Detecting BGP configuration faults with static analysis. Symposium on Networked Systems Design and Implementation. 25. Feamster, N., & Rexford, J. (2007). Network-wide prediction of BGP routes. IEEE/ACM Transactions on Networking, 15(2), 253–266. 26. Federal Information Security Management Act. (2002). National Institute of Standards and Technology. 27. Fu, Z., & Malik, S. (2006). Solving the minimum-cost satisfiability problem using branch and bound search. Proceedings of IEEE/ACM International Conference on Computer-Aided Design ICCAD. 28. Garimella, P., Sung Y.W., Zhang, N., & Rao, S. (2007). Characterizing VLAN usage in an Operational Network. ACM SIGCOMM Workshop on Internet Network Management. 29. Gottlieb, J., Greenberg, A., Rexford, J., & Wang, J. (2003). Automated provisioning of BGP customers IEEE Network Magazine. 30. Graphviz. http://www.graphviz.org/ 31. Griffin, T.G., & Wilfong, G. (2002). On the correctness of IBGP configuration. Proceedings of ACM SIGCOMM. 32. Guttman, J. (1997). Filtering postures: local enforcement for global policies. Proceedings of the 1997 IEEE Symposium on Security and Privacy. 33. Halabi, B. (1997). Internet routing architectures. Indianapolis, IN: New Riders Publishing. 34. Hamed, H., Al-Shaer, E., & Marrero, W. (2005). Modeling and verification of IPSec and VPN security policies. Proceedings of IEEE International Conference on Network Protocols. 35. Homer, J., & Ou, X. (2009). SAT-solving approaches to context-aware enterprise network security management. IEEE JSAC [Special Issue on Network Infrastructure Configuration]. 36. Huitema, C. (1999). Routing in the Internet. Upper Saddle River, NJ: Prentice Hall. 37. Internet Routing Registry Toolset Project, from https://www.isc.org/software/IRRtoolset 38. IP Assure. Telcordia Technologies, Inc., from http://www.telcordia.com/products/ip-assure/ 39. Jackson, D. (2006). Software abstractions: Logic, language, and analysis. Cambridge, MA: MIT Press. 40. Juniper Networks. (2008). What is behind network downtime? Proactive steps to reduce human error and improve availability of networks, from http://www.juniper.net/ solutions/literature/white papers/200249.pdf 9 Network Configuration Validation 315 41. Kodkod, from http://web.mit.edu/emina/www/kodkod.html 42. Lampson, B. (2000). Computer security in real world. Annual computer security applications conference, from http://research.microsoft.com/en-us/um/people/blampson/64securityinrealworld/acrobat.pdf 43. Leroy, X., Doligez, D., Garrigue, J., Rémy, D., & Vouillon, J. (2007). The objective caml system, release 3.10, documentation and user’s manual. 44. Li, T., Cole, B., Morton, P., & Li, D. (1998). Cisco Hot Standby Router Protocol. RFC 2281. 45. Lobo, J., & Pappas, V. (2008). C2: The case for network configuration checking language. Proceedings of IEEE Workshop on Policies for Distributed Systems and Networks. 46. Mahajan, Y., Fu, Z., & Malik, S. (2004). Zchaff2004, An Efficient SAT Solver. Proceedings of 7th International Conference on Theory and Applications of Satisfiability Testing. 47. Mandelbaum, Y., Fisher, K., Walker, D., Fernandez, M., & Gleyzer, A. (2007). PADS/ML: A functional data description language. ACM Symposium on Principles of Programming Language. 48. McMillan, K. (1992). Symbolic model checking. Doctoral dissertation, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. 49. Narain, S. (2005). Network configuration management via model-finding. Proceedings of USENIX Large Installation System Administration (LISA) Conference. 50. Narain, S., Kaul, V., & Parmeswaran, K. (2003). Building autonomic systems via configuration. Proceedings of AMS Autonomic Computing Workshop. 51. Narain, S., Levin, G., Kaul, V., & Malik, S. (2008). Declarative infrastructure configuration synthesis and debugging. In E. Al-Shaer, C. Kalmanek, F. Wu (Eds), Journal of Network Systems and Management [Special issue on Security Configuration] 52. Nordstrom, O. & Dovrolis, C. (2004). Beware of BGP attacks. ACM SIGCOMM Computer Communications Review, 34(Suppl. 2), 1–8. 53. O’Keefe, R. (1990). The craft of prolog. Reading, MA: Addison Wesley. 54. OPNET IT Sentinel, from http://www.opnet.com/solutions/network planning operations/ it sentinel.html 55. Ou, X., Boyer, W., & McQueen, M. (2006). A scalable approach to attack graph generation. 13th ACM Conference on Computer and Communications Security (CCS). 56. Ou, X., Govindavajhala, S., & Appel, A. (2005). MulVAL: A logic-based network security analyzer. 14th USENIX Security Symposium, Baltimore, MD. 57. Pappas, V., Wessels, D., Massey, D., Terzis, A., Lu, S., & Zhang, L. (2009). Impact of configuration errors on DNS robustness. IEEE Journal on Selected Areas in Communication, 27(Suppl. 1), 275–290. 58. Qie, X., & Narain, S. (2003). Using service grammar to diagnose configuration errors in BGP-4. Proceedings of USENIX Systems Administrators Conference. 59. Rekhter, Y., Li, T., & Hares, S. (2006). A Border Gateway Protocol 4 (BGP-4), RFC 4271. 60. Rosen, E., Viswanathan, A., & Callon, R. (2001). Multiprotocol Label Switching Architecture. RFC 3031. 61. Schwartz, J. (2007). Who Needs Hackers? New York Times http://www.nytimes.com/ 2007/09/12/technology/techspecial/12threat.html 62. Securing Cyberspace for the 44th Presidency. (2008). CSIS Commission On Cybersecurity. 63. Sedgewick, R. (2003). Algorithms in Java. Reading, MA: Addison Wesley. 64. Stewart, J. (1999). BGP4: Inter-Domain Routing in the Internet. Reading, MA: AddisonWesley. 65. Sung, E.Y., Rao, S., Xie, G., & Maltz, D. (2008). Towards systematic design of enterprise networks. ACM CoNEXT Conference. 66. SWI-Prolog Semantic Web Library, from http://www.swi-prolog.org/pldoc/package/ semweb.html 67. SWI-Prolog, from http://www.swi-prolog.org/ 68. TCP Problems with Path MTU discovery. RFC 2923. 69. Torlak, E., & Jackson, D. (2007). Kodkod: A Relational Model Finder. Tools and Algorithms for Construction and Analysis of Systems (TACAS ‘07). 316 S. Narain et al. 70. Tufte, E. (2001). The visual display of quantitative information. Cheshire, CT: Graphics Press. 71. Voellmy, A., & Hudak, P. Nettle: A domain-specific language for routing configuration, from http://www.haskell.org/YaleHaskellGroupWiki/Nettle 72. Xie, G., Zhan, J., Maltz, D., Zhang, H., Greenberg, A., Hjalmtysson, G., et al. (2005). On static reachability analysis of IP networks. IEEE INFOCOM. 73. ZChaff, from http://www.princeton.edu/chaff/ 74. Zhang-Shen, R., Wang, Y., & Rexford, J. (2008). Atomic routing theory: Making an AS route like a single node. Princeton University Computer Science technical report TR-827-08. Part V Network Measurement Chapter 10 Measurements of Data Plane Reliability and Performance Nick Duffield and Al Morton 10.1 Introduction 10.1.1 Service Without Measurement: A Brief History Measurement was not a priority in the original design of the Internet, principally because it was not needed in order to provide Best Effort service, and because the institutions using the Internet were also the providers of this network. A technical strength of the Internet has been that endpoints have not needed visibility into the details of the underlying network that connects them in order to transmit traffic between one another. Rather, the functionality required for data to reach one host from another is separated into layers that interact through standardized interfaces. The transport layer provides a host with the appearance of a conduit through which traffic is transferred to another host; lower layers deal with routing the traffic through the network, and the actual transmission of the data over physical links. The Best Effort service model offers no hard performance guarantees to which conformance needs to be measured. Basic robustness of connectivity – the detection of link failures and rerouting traffic around them – was a task of the network layer, and so need not concern the endpoints. The situation described above has changed over the intervening years; the complexity of networks, traffic, and the protocols that mediate them, the separation of network users from network providers, coupled with customer needs for service guarantees beyond Best Effort now require detailed traffic measurements to manage and engineer traffic, and to verify that performance meets required goals, and to diagnose performance degradations when they occur. In the absence of detailed N. Duffield () AT&T Labs, 180 Park Avenue, Florham Park, NJ 07901, USA e-mail: duffield@research.att.com Al Morton AT&T Labs, 200 S Laurel Ave, Middletown, NJ 07748, USA e-mail: acmorton@att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 10, c Springer-Verlag London Limited 2010 319 320 N. Duffield and Al Morton network monitoring capabilities integrated with the network, many researchers, developers, and vendors jumped into the void to provide solutions. As measurement methodologies become increasingly mature, the challenge for service providers becomes how to deploy and manage measurement infrastructure scalably. Indeed, to meet this need, sophisticated measurement capabilities are increasingly being found on network routers. Furthermore, all parties concerned with the provenance and interpretation of measurements – vendors of measurement systems, software and services, service providers and enterprises, network users and customers – need a consistent way to specify how measurements are to be conducted, collected, transmitted, and interpreted. Many of these aspects for both passive and active measurement are now codified by standard bodies. We continue this introduction by briefly setting out the type of passive and active measurements that are the subject of this chapter, then previewing the broader challenges that face service providers in realizing them in their networks. 10.1.2 Passive and Active Measurement Methods This chapter is concerned with two forms of dataplane measurement: passive and active measurements. These two types of measurement have generally focused on different aspects of network behavior, support different applications, and are accomplished by different technical means. Passive measurement comprises recording information concerning traffic as it passes observation points in the network. We consider three categories of passive measurement: – Link utilization statistics as provided by router interface counters; these are retrieved from a managed device by a network management station using the SNMP protocol. – Flow-level measurements comprising summaries of flows of packets with common network and transport header properties. These are commonly compiled by routers, then exported to a collector for storage and analysis. These statistics enable detailed breakdown of traffic volumes according to network and transport header fields, e.g., IP addresses and TCP/UDP ports. – Inspection of packet payloads in order to provide application-level flow measurements, or to support other payload-dependent applications such as network security and troubleshooting. In active measurement, probe traffic is inserted into the network, and the probe traffic, or the response of the network to it, is subsequently measured. Comparing the probe and response traffic provides a measure of network performance, as experienced by the probes. Active probing has been conducted by standalone tools such as ping and traceroute [53] that utilize or coerce IP protocols for measurement functionality. These and other methods are used for active 10 Measurements of Data Plane Reliability and Performance 321 measurement between hosts in special purpose measurement infrastructures, or between network routers, or from these to other endpoints such as application or other servers. Although the correspondence between methods and applications – passive measurement for traffic analysis and active measurement for performance – has been the norm, it is not firm: passive measurement is used to observe probe packets, and there are purely passive approaches to performance measurement. 10.1.3 Challenges for Measurement Infrastructure and Applications We now describe challenges facing design and deployment of active and passive measurement infrastructure by service providers and enterprises. As we discuss passive and active measurement methodologies in the following sections, we shall discuss their strengths and weaknesses in meeting these challenges. As one would expect, weaknesses in some of the more mature methods that we discuss have often provided the motivation for subsequent methods. Speed Increasingly fast line rates challenge the ability of routers to perform complex per packet processing, including updating flow statistics, and packet content inspection. Scale The product of network speed times the large number of devices producing measurements, gives rise to an immense amount of measurement data (e.g., flow statistics). In addition to consuming resources at the observation points, these data require transmission, storage, and processing in the measurement infrastructure and back-end systems. Granularity Service providers and their customers increasingly require a detailed picture of network usage and performance. This is both to support individualized routine reporting, and also to support detailed retrospective studies of network behavior. These requirements reduce the utility of aggregate usage measurements, such as link-level counters, and simple performance measurement tools, such as ping and traceroute. Scope For passive measurement: not all routers support granular measurement functionality, e.g, reporting flow statistics; or, the functionality may not be enabled due to resource constraints at the observation point or in the measurement collection infrastructure. When measurements are performed, information about protocol layers below IP (such as MPLS), or optical layer attributes (such as the physical link of an IP composite link) may be incompletely reported or even absent. Information above the network layer may be hidden as a result of endpoint encryption. For active measurement: not all network paths or links may be directly measured because of cost or other limitations in the deployment of active measurement hosts. 322 N. Duffield and Al Morton Timeliness Measurement applications increasingly require short temporal granularity of measurements, either because it is desirable to measure events of short duration, such as traffic microbursts and sub-second timescale routing events, or because the reporting latency must be short, e.g., in real-time anomaly detection for security applications. The concomitant increase in measurement reporting or polling frequency increases load on measurement devices and increases the number of measurement data points. Accuracy In passive measurement, reduction of data volumes through sampling, in order to meet the challenges of speed and scale, introduces statistical uncertainty into measurements. In active measurement, bandwidth and scale constraints place a limit on active probing frequency and hence measurement accuracy is inherently dependent on the duration of the measurement period. Management There are several challenges for the management and administration of measurement infrastructure. – Reliability Measurement infrastructure components are subject to failure or outage, resulting in loss or corruption of measurements. The effects of component failure can be mitigated (i) at the infrastructure level (providing redundant capacity with fast detection of failure resulting in failover to backup subsystems), (ii) by employing reporting paradigms (e.g., sequence numbers) that facilitate automated checking, flagging, or workarounds for missing data, and (iii) reporting measurement uncertainty due to missing data or sampling to the consumer of the measurements. – Correlation Measurement applications may require correlation of measurements generated by different measurement subsystems, for example, passive and active traffic measurements, logs from application servers, and authentication, authorization, and accounting subsystems. A common case is when measurements are to be attributed to an entity such as an end host, but the mapping between measurement identifier (such as source IP address) and entity is dynamic (e.g., dynamic DHCP mappings). Correlation of multiple data sets presents challenges for data management, e.g., due to data size, diverse provenance, physical locations, and access policies. The measurement infrastructure must facilitate correlation by measures including the synchronization of timestamps set by different measurement subsystems. – Consistency The methodologies, reporting and interpretation of measurements must be consistent across different equipment and network management software vendors, service providers, and their customers. In this chapter, Sections 10.2–10.6 cover passive measurement, including linklevel aggregates, flow measurement, sampling, packet selection, and deep packet inspection (DPI). Sections 10.7–10.10 cover active measurements, including standardization of performance metrics, service level agreements, and deployment issues for measurement infrastructures. We conclude with an outlook on future challenges in Section 10.11. We shall make use of and refer to other chapters in this book that deal with specific applications of measurements, principally Chapter 5 on Network Planning and Chapter 13 on Network Security. 10 Measurements of Data Plane Reliability and Performance 323 10.2 Passive Traffic Measurement As previewed in Section 10.1.2, we consider three broad types of passive measurement: link statistics, flow measurements, and DPI. These encompass methods that are currently employed in provider networks, and also describe some newer approaches that have been proposed or may be deployed in the medium term. We now motivate and outline in more detail the material on passive measurement. Section 10.3 describes SNMP measurements, or, more precisely, interface packet counters maintained in a router’s Management Information Base (MIB) that are retrieved using the Simple Network Management Protocol (SNMP). The remote monitoring capabilities supported by the RMON MIB are also discussed. SNMP measurements provide an undifferentiated view of traffic on a link. By contrast, measurement applications often need to classify traffic according to the values occurring in protocol header fields that occur at different levels of the protocol stack. They must determine the aggregate traffic volumes attributable to each such value, for example, to each combination of the network layer IP addresses and transport layer TCP/UDP ports. This information, and that relating to encapsulating protocols such as MPLS, has come to be known as “packet header” information. This is contrasted with “packet payload” or “packet content” information, which includes higher layer application and protocol information. This information may be spread across multiple network level packets. The major development in passive traffic measurement over the last roughly 20 years, that serves these needs, has been traffic flow measurement. Traffic flows are sets of packets with common network/transport header values observed locally in time. Routers commonly compile summary statistics of flows (total packets, bytes, timing information) and report them, together with the common header values and some associated router state – but without any payload information – in a flow record that is exported to a collector. Cisco’s NetFlow is the prime example. Flow records provide a relatively detailed representation of network traffic that supports many applications. Several of these are covered in detail in other chapters of this book: generation of traffic matrices and their use in network planning is described in Chapter 5; analysis of traffic patterns and anomalies for network security is described in Chapter 13. Related applications are the routine reporting of traffic matrices and trending of traffic volumes and application mix for customers and for service provider’s network and business development organizations (see e.g. [5]). Section 10.4 describes traffic flow measurement, including the operational formation of flow statistics, protocols for the standardization of flow measurement, flow measurement collection infrastructure, the use of sampling both packets and flow records themselves in order to meet the challenges of speed and scale and its impact on measurement accuracy, some recent proposals for traffic flow measurement and aggregation, and concludes with some applications of flow measurements. Uniform packet sampling is one member of a more general class of packet selection primitives, that also includes filtering and more general sampling operations. In Section 10.5, we describe standardization of packet selection operations, their realization in routers, and applications of combined selection primitive for network 324 N. Duffield and Al Morton management. We describe in detail the hash-based selection primitive, which allows for consistent selection of the same packet at different observation points, and discuss new measurement applications that this enables. Packet header-based flow measurements provide little visibility into properties of the packet payload. However, network- and transport-level packet headers provide only a partial indication of traffic properties for the purposes of application characterization, security monitoring and attack mitigation, and software and protocol debugging. Section 10.6 reviews technologies for DPI of packet payload beyond the network- and transport-level headers, and shows how it serves these applications. 10.3 SNMP, MIBs, and RMON In this section, we discuss traffic statistics that are maintained within routers and the methods and protocols for their recovery. A comprehensive treatment of these protocols and their realization can be found in [25]. 10.3.1 Router Measurement Databases: MIBs A MIB is a type of hierarchical database maintained by devices such as routers. MIBs have been defined by equipment vendors and standardized by the IETF. Currently, over 10,000 MIBs are defined. The MIB most relevant for traffic measurement purposes is MIB-II [60] that maintains counters for the total bytes and numbers of unicast and multicast packets received on an interface, along with discarded and errored packets. The Interface-MIB [59] further provides counts of multicast packets per multicast address. Protocol-specific MIBs, e.g., for MPLS [76], also provide counts of inbound and outbound packets per interface that use those protocols. 10.3.2 Retrieval of Measurements: SNMP SNMP [77] is the Internet Protocol used to manage MIBs. A SNMP agent in the managed device is used to access the MIB and communicate object values to or from a network management station. SNMP has a small number of basic command types. Read commands are used to retrieve objects from the MIB. Write commands are used to write object values to the MIB. Notify commands are used to set conditions under which the managed device will autonomously generate a report. The most recent version of SNMP, SNMPv3, offers security functionality, including encryption and authentication, that were weaker or absent in earlier versions. For traffic measurement applications, the MIB interface-level packet and byte counters are retrieved by periodic SNMP polling from the management station; a polling interval of 5 min is common. The total packets and bytes transmitted between successive polls are then obtained by subtraction. 10 Measurements of Data Plane Reliability and Performance 325 10.3.3 Remote Monitoring: RMON The RMON MIB [81] supports a more detailed capability for remote monitoring than MIB-II, enabling the aggregation and notification over relatively complex events, e.g involving multiple packets. The original focus of RMON was in remote monitoring of LANs; resource limitations make RMON generally unsuitable for monitoring high rate packet streams in the WAN context, e.g., to supply greater detail than presented by SNMP/MIB-II measurements. Indeed, the limitations of RMON motivate the alternate flow and packet measurement paradigm in which samples or aggregates of packet header information are exported from the router to a collector which supports reporting, analysis, and alarming functionality, rather than the router performing these functions itself. We explore this paradigm in more detail in the following sections. 10.3.4 Properties and Applications of SNMP/MIB We now review how SNMP/MIB measurements align with the general measurement challenges described in Section 10.1.3. Scope: The major strength of SNMP measurements is their ubiquitous availability from router MIBs. Scale: From the data management point of view, SNMP statistics have the advantage of being relatively compact, routinely comprising a fixed length data collected per interface at each polling instant, commonly every 5 min. Granularity: The main limitation of SNMP measurement is that they maintain packet and byte counters per interface only. Timeliness: The externally chosen and relatively infrequent polling times for SNMP measurements limit their utility for real-time or event-driven measurement applications. Historically, SNMP measurements have been a powerful tool in the management of networks with undifferentiated service classes. SNMP statistics have been used to trend link utilization, and network administrators have used these trends to plan and prioritize link deployment and upgrades, on the basis of heuristics that relate link utilization to acceptable levels of performance. Active performance measurements using the ping and traceroute tools can also inform these decisions. Although SNMP measurement do not directly report any constituent details within link aggregates, network topology and routing in practice constrain the set of possible edge-to-edge traffic flows that can give rise to the collection of measured traffic rates over all network links. This leads to the formulation of an inverse problem to recover the edge-to-edge traffic matrices from the link aggregates. A number of approaches have been proposed and some are sufficiently accurate to be of operational use; for further detail see Chapter 5. Knowledge of the traffic matrices provides powerful new information beyond simple trending, because it allows the prediction of link utilization under different scenarios for routing, topology, and spatially heterogeneous changes in demand. 326 N. Duffield and Al Morton 10.4 Traffic Flow Measurement This section describes traffic flow measurement, including the operational formation of flow statistics, protocols for the standardization of flow measurement, flow measurement collection infrastructure, the use of sampling both packets and flow records themselves in order to meet the challenges of speed and scale and its impact on measurement accuracy, some recent proposals for traffic flow measurement and aggregation, and concludes with some applications of flow measurements. 10.4.1 Flows and Flow Records 10.4.1.1 Flow and Flow Keys A flow of traffic is a set of packets with a common property, known as the flow key, observed within a period of time. A set of interleaved flows is depicted in Fig. 10.1. Many routers construct and export summary statistics on flows of packets that pass through them. A flow record can be thought of as summarizing a set of packets arising in the network through some higher-level transaction, e.g., a remote terminal session, or a web-page download. In practice, the set of packets that are included in a flow depends on the algorithm used by the router to assign packets to flows. The flow key is usually specified by fields from the packet header, such as the IP source and destination address and TCP/UDP port numbers, and may also include information from the packet’s treatment at the observation point, such as router interface(s) traversed. Flows in which the key is specified by individual values of these fields are often called raw flows, as opposed to aggregate flows in which the key is specified by a range of these quantities. As we discuss further in Section 10.4.3.2, routers commonly create flow records from a sampled substream of packets. 10.4.1.2 Operational Construction of Flow Records Flow statistics are created as follows. A router maintains a cache comprising entries for each active flow, i.e., those flows currently under measurement. Each entry includes the key and summary statistics for the flow such as total packets and bytes, Fig. 10.1 Flows of observed packets, key indicated by shading 10 Measurements of Data Plane Reliability and Performance 327 and times of observation of the first and last packets. When the router observes a packet, it performs a cache lookup on the key to determine if the corresponding flow is active. If not, it instantiates a new entry for that key. The flow statistics are then updated accordingly. A router terminates the recording of a flow according to criteria describe below; then the flow’s statistics are exported in a flow record, and the associated cache memory released for use by new flows. Flow termination criteria include: (i) inactive flow or interpacket timeout: the time since the last packet observed for the flow exceeds some threshold; (ii) protocol-level information, e.g., a TCP FIN packet that terminates a TCP connection; (iii) memory management: termination to release memory for new flows; and (iv) active flow timeout: to prevent data staleness, flows are terminated after a given elapsed time since the arrival of the first packet of the flow. The summary information in the flow record may include, as well as the flow key, and summary statistics of packet timing and size, other information relating to the packet treatment in the router, such as interfaces traversed, next hop router, and routing state information. Additionally, lower layer protocol information from the packet header may be included. For example, Cisco’s NetFlow has a partial ability to report the MPLS label stack: it can report up to three labels from the MPLS label stack, with position in stack configurable. NetFlow can in some cases report the loopback address of the certain tunnel endpoints. 10.4.1.3 Commercial and Standardized Flow Reporting The idea of modeling traffic as packets grouped by a common property seems first to have appeared in [54], and the idea was taken up in support of internet accounting in [62], and systematized as a general measurement methodology in [22]. Early standardization efforts within the Real Time Flow Measurement working group of the Internet Engineering Task Force (IETF) has now been supplanted by the work of the IP Flow Information eXport working group (IPFIX) [49]. In practice flow measurement has become largely identified with Cisco’s NetFlow [18] due to (i) the large installed base; (ii) its emulation in other vendors’ products, and (iii) its effective standardization by the use of NetFlow version 9 [23] as the starting point for the IPFIX protocol. NetFlow v9 offers the ability to administrators to define and configure flow keys, aggregation schemes, and the information reported in flow records. An alternative reporting paradigm is provided by sFlow [71], in which headerlevel information from a subset of sampled packets are exported directly without aggregating information from packet bearing the same key. sFlow reports include a position count of the sampled packet within the original traffic stream; this facilitates estimating traffic rates. 328 N. Duffield and Al Morton 10.4.2 Flow Measurement Infrastructure 10.4.2.1 Generation and Export of Flow Records Cisco originated NetFlow as a by-product of IP route caching [17], but it has subsequently evolved as a measurement and reporting subsystem in its own right. Other router vendors now support the compilation of flow statistics, e.g., Juniper’s JFlow [55], with the flow information being exported using the NetFlow version 9 format or according to the IPFIX standard. Note that implementation differences may lead to different information being reported across different routers. Standalone monitoring devices as discussed in Section 10.6.2 may also compile and export flow records. Cisco Flexible NetFlow [14] provides the ability to instantiate and separately configure multiple flow compilers that operate concurrently. This allows a single router to serve different measurement applications that may have different requirements: traffic can be selected by first filtering on header fields; parameters such as sampling granularity, spatial and temporal aggregation granularity, reporting detail and frequency, and collector destination can be specified for each instantiation. We discuss packet selection operations more generally in Section 10.5. 10.4.2.2 Collection and Mediation of Flow Records Flow records are exported from the observation point, either directly to a collector, or through a mediation device. NetFlow collection systems are available commercially [15] or as freeware [10], either in a basic form that receives and writes flow records to storage, or as part of larger traffic analysis system to support network management functions [5, 69], or focused on specific applications such as security [68]. Although export of flow records may take place directly to the ultimate collector, there are two architectural reasons that favor inserting mediation devices in the export path: scalability and reliability. The primary reason is scalability. Even with the compression of information that summarizes a set of packets in a fixed length flow record, the volumes of flow records produced by large-scale network infrastructure are enormous. As a rough example, a network comprising 100 10 Gb/s links that are 50% loaded in each direction, and in which each flow traverses ten routers, each of which compiles flow statistics after packet sampling at a rate of 1 in several hundred (see Section 10.4.3.2), would produce 1Gb/s of flow records, i.e., roughly 10 TeraBytes per day. A secondary reason for using mediation boxes has been transmission reliability. Until recently, NetFlow has exclusively used UDP for export, in part to avoid the need for buffer flow records at the exporter, as would be required by a reliable transport protocol. But the use of UDP exposes flow records to potential loss in transit, particularly over long WAN paths. Due to skew in flow length distributions (see Section 10.4.3.3) uncontrolled loss of the records of long flows could severely reduce measurement accuracy. 10 Measurements of Data Plane Reliability and Performance 329 Fig. 10.2 Flow measurement collection infrastructure: hardware elements, their resources, and sampling and aggregation operations that act on the measurements Mediation devices can address these issues and provided additional benefits: Data Reduction By aggregating and sampling flow records, then exporting the reduced data to a central collector. Reliable Staging The mediator can receive flow records over a LAN with controlled loss characteristics, then export flow records (or samples or aggregates) to the ultimate collector using a reliable transport protocol such as TCP. NetFlow v9 and the IPFIX protocol both support SCTP [78] for export, which gives administrators flexibility to select a desired trade-off between reliability and buffer resource usage at the exporter. Distributed Query The mediation devices may also support queries on the flow records that traverse them, and thus together constitute a distributed query system. Selective Export Multiple streams of flow records selected according to specified criteria may be exported to collectors serving different applications. An example of such an architecture is illustrated in Fig. 10.2; see also [39]. In each of a number of geographically distributed router centers, a mediation device receives flow records from its colocated routers; aggregates and samples are then exported to ultimate collector. Protocols for flow mediators are currently under standardization in the IPFIX working group of the IETF [49]. 10.4.2.3 Collection and Warehousing of Flow Records The final component of the collection infrastructure is the repository that serves to receive and store the flow records, and serve as a database for reporting and query functions. Concerning the attributes of a data store: Capacity Must be extensive; even with packet and flow sampling, a large service provider network may generate many GB of flow records per day. 330 N. Duffield and Al Morton DataBase Management System Must be well matched to the challenges of large datasets, including rapid ingestion and indexing, managing large tables, a highlevel query language to support complex queries, transaction logging, and data recovery. The Daytona DBMS is an example of such a system in current use; see [44]. Data Sources Interpretation of flow data typically requires joining with other datasets, which should also be present in the management system, including but not limited to, topology and configuration data, control plane measurements (see Chapter 11 for a description of routing state monitoring), MIB variables acquired by SNMP polling, network elements logs from authentication, authorization, and accounting servers, and logs from DHCP and other network servers. Data Quality Data may be corrupt or missing due to failures in the collection and reporting systems. The complexity and volume of measured data necessitate automated mechanisms to detect, mark, and mitigate unclean data; see e.g. [30]. Data Security and Customer Privacy Flow measurements and other data listed should be considered as sensitive customer information. Service provider policies must specify practices to maintain the integrity of the data, including controlled and auditable access restricted to individuals needing to work with the data, encryption, anonymization, and data retention policies. 10.4.3 Sampling in Flow Measurement and Collection 10.4.3.1 Sampling as a Data Reduction Method In the previous sections, we have touched on the fact that the speed of communications links provides a challenge for the formation of flow records at the router, and both speed and the scale of networks – the large number of interfaces that can produce flow records – provide a challenge for the collection and storage of flow records. Figure 10.2 illustrates the relevant resources at the router, mediator, and collector. To meet these challenges, data reduction must be performed. The reduction method must be well matched to the uses to which the reduced data is put. Three reduction methods are usually considered: Aggregation Summarizing measurements that share common properties. In the context of traffic flow measurement, header-level information on packets with the same key is aggregated into flows. Subsequent aggregation of flow records into predefined aggregates (e.g., aggregate traffic to each routing prefix) is a powerful tool for routine reporting. Filtering Selection of a subset of measurement that matches a specified criterion. Filtering is useful for drill down (e.g., to a traffic subset of interest). Sampling Selection of data points according to some nondeterministic criterion. 10 Measurements of Data Plane Reliability and Performance 331 A limitation for aggregation and filtering as general data reduction methods is the manner in which they lose visibility into the data: traffic not matching a filter is discarded; detail within an aggregate is lost (while flow records aggregate packets over time, they need not aggregate spatially, i.e., over packet header values). Of the three methods, only sampling retains the spatial granularity of the original data, and thus retains the ability to support arbitrary aggregations of the data, include those formulated after the measurements were made. This is important to support exploratory, forensic, and troubleshooting functions, where the traffic aggregates of interest are typically not known in advance. The downside of sampling is the statistical uncertainty in the resulting measurements; we address this further in Section 10.4.3.4. We now discuss sampling operations used during the construction and recovery of flow measurements. As illustrated in Fig. 10.2, packet sampling (see Section 10.4.3.2) is used in routers in order to reduce the rate of the stream of packet header information from which flow records are aggregated. The complete flow records are then subjected to further sampling (see Section 10.4.3.3) and aggregation within the collection infrastructure, at the mediator to reduce data volumes, or in the collector, for example, dynamically sampling from a flow record database in order to reduce query execution times, or permanently in order to select a representative set of flow records (or their aggregates) for archiving. We discuss the ramifications of sampling for measurement accuracy in Section 10.4.3.4, and some more recent developments in stateful sampling and aggregation the straddle the packet and flow levels in Section 10.4.3.5. Finally, we look ahead to Section 10.5, which sets random packet sampling in the broader context of packet selection operations and their applications, including filtering, both in the sense understood above, and also consistent packet selection as exemplified by hash-based sampling. 10.4.3.2 Random Packet Sampled Flows The main resource constraint for forming flow records is at the router flow cache in which the keys of active flows are maintained. To lookup packet keys at the full line rate of the router interfaces would require the cache to operate in fast, expensive memory (SRAM). Moreover, routers carry increasingly large numbers of flows concurrently, necessitating a large cache. By sampling the packet stream in advance of the construction of flow records, the cache lookup rate is reduced, enabling the cache to be implemented in slower, less expensive, memory (DRAM). A number of different sampling methods are available. Cisco’s Sampled NetFlow samples packets every N th packet systematically, where N is a configurable parameter. Random Sampled NetFlow [21] feature employs stratified sampling based on arrival count: one packet is selected at random out of every window on N consecutive arrivals. Although these two methods have the same average sampling rate, there are higher-order differences in the way multiple packets are sampled; for example, consecutive packets are never selected in Sampled NetFlow, while they can be in Random Sampled NetFlow. However, the effect of such differences on flow statistics is expected to be small except possibly for flows which that represent 332 N. Duffield and Al Morton noticeable proportion (greater than 1=N ) of the load, since the position of a given flow’s packets in the packet arrival order at an interface is then effectively randomized by the remaining traffic. In distinction, Juniper’s J-flow [55] offers the ability to sample runs of consecutive packets. Sampling and other packet selection methods have been standardized in the PSAMP working group of the IETF [24,32,33,82]. We review these in greater detail in Section 10.5. PSAMP is positioned as a protocol to select packets for reporting at an observation point, with IPFIX as the export protocol. For example, selected packets could be reported on as single packet flow records, using zero active timeout for immediate reporting. If sampling 1 out of N packets on average, then from a flow with far fewer than N packets, if any packets are sampled, typically only one packet will be sampled. In this case one might just as well sample packets without constructing flow records; this would save resources at the router since there would be no need to cache the single packet flows until expiration of the interpacket timeout. Indeed, there are many short flows: web traffic is a large component of Internet traffic, in which the average flow length is quite short, around 16 packets in one study [42]. However, there are several reasons to expect that longer flows will continue to account for much traffic. First, several prevalent applications and application classes predominantly generate long-lived flows, for example, multimedia downloads and streaming, and VoIP. Secondly, tunneling protocols such as IPSEC [56] may aggregate flows between multiple endpoints into a packet stream in which the endpoint identities are not visible in the network core; from the measurement standpoint, the stream will thus appear as a single longer flow. For these reasons, unless packet sampling periods becomes comparable with or larger than the number of packets in these flows, flow statistics will still afford useful compression of information. 10.4.3.3 Flow Record Sampling Sampling flow records present a challenge, because of the highly skewed distribution of flow sizes found in network traffic. Experimental studies have shown that the distribution of flow lengths is heavy tailed; in particular, a large proportion of the total bytes and packets in the traffic stream occur in a small proportion of the flows; see, e.g. [42]. This makes the requirements for flow record sampling fundamentally different to those for packet sampling. While packets have a bounded size, uniform and uncontrolled sampling due to transmission loss are far more problematic for flow records than for sampled packets, since omission of a single flow report can have huge impact on measured traffic volumes. This motivates sampling dependent on the size of the flow reported on. A simple approach would be to discard flow records whose byte size falls below a threshold. This gives a conservative, and hence biased measure of the total bytes, and is susceptible to subversion: an application or user that splits its traffic up into small flows could evade measurement altogether. This would be a weakness for accounting and security applications. 10 Measurements of Data Plane Reliability and Performance 333 Smart Sampling can be used to avoid the problems associated with uniform sampling of flow records. Smart Sampling is designed with the specific aim of achieving the optimal trade-off between the number of flow records actually sampled, and the accuracy of estimates of underlying traffic volumes derived from those samples. In the simplest form of Smart Sampling, called Threshold Sampling [36], each flow record is sampled independently with a probability that depends on the reported flow bytes: all records that report flow bytes greater than a certain threshold z are selected; those below threshold are selected with a probability proportional to the flow bytes. Thus, the probability to sample a flow record representing x bytes is pz .x/ D minf1; x=zg The desired optimality property described above holds in the following sense. Suppose X bytes P are distributed over some number m of flows of size x1 ; : : : ; xm so b b that X D m i D1 xi . We consider unbiased estimates X of X , i.e., X is a random b quantity whose average value is X . Suppose X is an unbiased estimate of X obtained from a random selection of a subset of n < m of the original flows, having sizes x1 ; : : : ; xn , where selection is independent according to some size-dependent probability p.x/. A standard procedure to obtain unbiased estimates is to divide the measured value by the probability that it was sampled [47]. Thus in Pour case each b D n xi =p.xi / sampled flow size is normalized by its sampling rate, so that X i D1 is an unbiased estimate of X . We express the optimal trade-off as trying to minimize a total “cost” that is a linear combination b Cz D z2 EŒn C VarŒX of the average number of samples and the estimation variance, where z is a parameter that expresses the relative importance we attach to making the number of samples small versus making the variance small. For example, when z is large, making EŒn small has a larger effect on reducing Cz . It is proved in [36] that the cost Cz is minimized for any set of flow sizes x1 ; : : : ; xm by using the sampling probabilities p.x/ D pz .x/. With the probabilities pz , each selected flow xi gives rise to an estimate xi =pz .xi / D maxfxi ; zg. Although optimal as stated, Threshold Sampling does not control the exact number of samples taken. For example, if the number of flows doubles during a burst, then on average, the number of samples also doubles (assuming the same flow size distribution). However, exact control may be required in some applications, e.g., when storage for samples has a fixed size constraint, or for sampling a specified number of representative records for archiving. A variant of Smart Sampling, called Priority Sampling [37], is able to achieve a fixed sample of size n < m, as follows. Each flow of size xi is assigned a random priority wi D xi =ai where ai is a uniformly distributed random number in .0; 1. Then the k flows of highest priority are selected for sampling, and each of them contributes an estimate maxfxi ; z0 g where z0 is now a data-dependent threshold z0 set to be .k C 1/st largest priority. It is shown in [37] that this estimate is unbiased. 334 N. Duffield and Al Morton Priority Sampling is well suited for back-end database applications serving queries that require estimation of total bytes in an arbitrary selection of flows (e.g., all those in a specific matrix element) over a specified time period. A random priority is generated once for each flow, and the records are stored in descending order of priority. Then an estimate based on k flows proceeds by reading k C 1 flow records of highest priority that match the selection criterion, forming an unbiased estimate as above. Because the flow records already are in priority-sorted order, selection is very fast (see [4]). 10.4.3.4 Estimation and the Statistical Impact of Sampling Whether sampling packets or flow records, the measured numbers of packet, bytes, or flows must be normalized in order to give an unbiased estimate of the actual traffic from which they were derived; we saw how this was done for threshold sampling in Section 10.4.3.3. For 1 in N packet sampling, byte estimates from selected packets are multiplied by N . The use of sampling for measuring traffic raises the question of how accurate estimates of traffic volumes will be. The statistical nature of estimates might be thought to preclude their use for some purposes. However, for many sampling schemes, including those described above, the frequency of estimation errors of a given size can be computed or approximated. This can help answer questions such as “if no packets matching a given key were sampled, then how likely is it that there were X or more bytes in packets with this key that were missed”. A rough indication of estimation error is the relative standard deviation (RSD), b divided by the true value X . The RSD i.e, the standard deviation of the estimator X for estimating an aggregate ofp X bytes of traffic using independent 1 in N packet sampling is bounded above by N xmax =X where xmax is the maximum p packet size. For flow sampling with threshold z, the RSD is bounded above by z=X . Observe the RSD decreases as the aggregate size increases. In cases where multiple stages of sampling and aggregation are employed – for example, packet sampled NetFlow followed by Threshold Sampling of flow records – the sampling variance is additive. In the example, the RSD becomes p .z C N xmax /=X As an example, consider 1 in N D 1;000 sampling of packets of maximum size xmax D 1;500 bytes with a flow sampling threshold of z D 50 MB. In this case z N xmax D 1:5 MB , and so Smart Sampling contributes most of the estimation error. With these sampling parameters, estimating the 10 min average rate of a 1 Gb/s backbone traffic stream on a backbone would incur a typical relative error of 3%. In fact, rigorous confidence intervals for the true bytes in terms of the estimated values can be derived (see [26, 79]), including for some cases of multistage sampling. Using an analysis of the sampling errors, the impact of flow sampling on usagebased charging, and ways to avoid or ameliorate estimation error, are described in [35]. The key idea is that a combination of (i) systematic undercounting of customer 10 Measurements of Data Plane Reliability and Performance 335 traffic by a small amount, and (ii) using sufficiently long billing periods, can reduce the likelihood over over-billing customers to an arbitrarily small probability. 10.4.3.5 Stateful Packet Sampling and Aggregation The dichotomy between packet sampling on a router and flow sampling in the measurement infrastructure, while architecturally simple, does not necessarily result in the best trade-off between resource usage and measurement accuracy. We briefly review some recent research that proposed to maintain various degrees of router state in order to select and maintain flow records for subsets of packets. Sample and Hold [41] All packets arriving at the router whose keys are not currently in the flow cache are subjected to sampling; packets that are selected in this manner have a corresponding flow cache entry created, and all subsequent packets with the same key are selected (subject to timeout). Thus, long flows are preferentially sampled over short flows, since the flow cache tends to be populated only by the longer flows. This achieves similar aims to Smart Sampling but in a purely packet-based solution. While the cache can be made smaller than would be required to measure all flows, a cache lookup is still required for each packet. Adaptive Sampling Methods Both NetFlow and Sample and Hold can be made adaptive by adjusting their underlying sampling rate and flow termination criteria in response to resource usage, e.g., to control cache occupancy and flow record export rate. Now recall from Section 10.4.3.3 that construction of unbiased estimators required normalization of sample bytes and packet counts by dividing by the sampling rate. Adjustment of the sampling rate requires matching renormalization in estimators in order to maintain unbiasedness. Partial flow records may be resampled (and further renormalized) and may be discarded in some cases (see [40]). In one variant of this approach the router maintains and exports a strictly bounded number of flow records, providing unbiased estimates of the original traffic bytes. Stepping Methods Stepping is an extension of the adaptive method in which, when downward adjustments of the sampling rate occur, estimates of the total bytes in packets of a given key that arrived since the previous such adjustment – the steps – are sampled and exported from the flow cache. Such exports can take place from the flow cache into DRAM, where the steps can be aggregated. The payoff is higher estimation accuracy, because once exported, the steps are not subject to loss (see [27]). Run-Based Estimation In its simplest form, run-based estimation involves caching in SRAM only the key of the last observed packet. If the current packet matches the key, the run event is registered in a cache in DRAM. Using a timeseries model, the statistics of the original traffic are estimated from those of the runs. A generalization of the approach can additionally utilize longer runs [45]. 336 N. Duffield and Al Morton 10.5 Packet Selection Methods for Traffic Flow Measurement 10.5.1 Packet Selection Primitives and Standards In Section 10.4.3.2 random packet sampling was presented as a necessity for reducing packet rates prior to the formation of flow statistics; moreover, random sampling has significant advantages over filtering and aggregation as a continuously operating general data reduction method. In this chapter we shift the emphasis somewhat and consider a set of packet selection primitives, and their ability to serve a variety of specific measurement applications. Following [33] we classify selection primitives as follows: Filtering Selection of packets based deterministically on their content. There are two important subcases: – Property Match Filtering Selection of a packet if a field or fields match a predefined value. – Hash-Based Selection A hash of the packet is calculated and the packet is selected if it falls in a certain range. Sampling Selection of packets nondeterministically. Some primitives of this type are provided by Cisco Flexible NetFlow [14] that allows combinations of certain random sampling and property match filters. The framework above was standardized in the Packet Sampling (PSAMP) working group of the IETF [33]. A collection of sampling primitives is described in [82], including but not limited to the fixed rate sampling from Section 10.4.3.2. Property match filtering can be based on packet header fields (such as IP address and port) and the packet treatment by the router, including interfaces traversed, and the routing state in operation during the packet’s transit of the router. Hash-based selection, including specific hash functions, is also standardized in [82]. We describe the operation and applications of hash-based selection in Section 10.5.2. From both at the implementation and standards viewpoint, packet selection is positioned as a front-end process that passes selected packets to a process that compiles and exports flow statistics. Thus, a PSAMP packet selector passes packets to an IPFIX flow reporting process. A flow record can report on single selected packets by setting the inactive flow timeout to zero. A key development in support of network management is the ability of routers and other measurement devices to support simultaneous operation of multiple independent measurements, each of which is composed of combinations of packet selection primitives. This type of capability is already present in Cisco Flexible NetFlow [14] and standardized in PSAMP/IPFIX. Each packet selection process can, in principle, be associated with its own independently configurable flow reporting process. The ability to dynamically configure or reconfigure packet selection provides a powerful tool for a variety of applications, from low-rate sampling of all traffic to supply routine reporting for Network Operation Center (NOC) wallboard displays, to targeted high-rate sampling that drills down on an anomaly in real time (see Fig. 10.3). 10 Measurements of Data Plane Reliability and Performance 337 Packet Header Fig. 10.3 Concurrent combinations of sampling and filtering packet selection primitives 10.5.2 Consistent Packet Sampling and Hash-Based Selection The aim of consistent packet sampling (also called Trajectory Sampling) is to sample a subset of packets at some or all routers that they traverse. The motivation is new measurement applications that are enabled or enhanced; see below. Consistent packet sampling can be implemented through hash-based selection. Routers calculate a hash of packet content that is invariant along the packet path, and the packet is selected for reporting if the hash values falls in a specified range. When all routers use the same hash function and range, the sampling decisions for each packet are identical at all points along its path. Thus, each packet signals implicitly to the router whether it should be sampled. Information on the sampled packet can be reported in flow records, potentially one per sampled packet. In order to aid association of different reports on the same packet by the collector, the report can include not only packet header fields, but also a packet label or digest, taking the form of a hash (distinct from that used for selection) whose input includes part of the packet payload. An ideal hash function would provide the appearance of uniform random sampling over the possible hash input values. This is important both for accurate traffic estimation purposes, and for integrity: network attackers should not be able to predict packet sampling outcomes. Use of a cryptographic hash function with private parameter provides the strongest conformance to the ideal. In practice, implementation constraints on computational resources may require weaker hash functions to be used. Hash-based packet selection has been proposed in [38], with further work on its applications passive performance monitoring in [34, 83]. Security ramifications of different hash function choices are discussed in [43]. Hash-based sampling has been standardized as part of the PSAMP standard in the IETF [82]. 338 N. Duffield and Al Morton Applications of consistent sampling include: Route Troubleshooting Direct measurements of packet paths can be used to detect routing loops and measure transient behavior of traffic paths under routing changes. This detailed view is not provided by monitoring routing protocols alone. Independent packet sampling at different locations does not provide such a fine timescale view in general, since a given packet is typically not sampled at multiple locations. Passive Performance Measurement Correlating packet samples at two or more points on a path enables direct measurement of the performance experienced by traffic on the path, such as loss (as indicated by packets present at one point on the path that are missing downstream) and latency (if reports on sampled packets include measurement timestamps from synchronized clocks). This is an attractive application for service providers since it can alert performance degradation at the level of individual customers, reflecting the same packet transit performance that customers themselves experience. 10.6 Deep Packet Inspection Sections 10.4 and 10.5 are concerned with the measurement and characterization of traffic at the granularity of a flow key that depends on the packet only through header fields. However, there are important network management tasks that depend on knowledge of packet payloads, and hence for which traffic flow monitoring is insufficient. The term DPI denotes measurement and possible treatment of packets based on their payload. We describe some broad designs policy issues associated with the deployment of DPI in Section 10.6.1; specific technologies for DPI devices are described in Section 10.6.2, and three applications of DPI for network management in Section 10.6.3: application-specific bandwidth management, network security monitoring, and troubleshooting. 10.6.1 Design and Policy Issues for DPI Deployment DPI functions are not uniformly featured in routers, and hence some uses will require additional infrastructure deployment. DPI is extremely resource intensive due to the need to access and process packet payload at line rate. This makes DPI expensive compared with flow measurement, which hinders its widespread deployment. A limited deployment may be restricted to important functional sites, or at a representative subset of different site types, e.g., a backbone link, an aggregation router, or in front of datacenter. Like all traffic measurements, DPI must maintain privacy and confidentiality of customer information throughout the measurement collection and analysis process. Although flow measurements already encode patterns of communications through 10 Measurements of Data Plane Reliability and Performance 339 source and destination IP addresses, DPI of packet payload may also encompass the content of the communications. Service provider policies must specify practices to maintain the privacy of the data, including controlled and auditable access restricted to individuals needing to work with the data, encryption, anonymization, and data retention policies. See also the discussion specific to DPI for security monitoring in Section 13.4. Furthermore, any use of DPI data must be conducted in accordance with legal regulations in force. Similar issues exist for providers of hostbased services as opposed to communications services, where servers intrinsically have access to user-specific data that may be presented by the customer in the course of using those services, e.g., email, search, or e-commerce transactions. 10.6.2 Technologies for DPI DPI functionality is realized in dedicated general-purpose traffic monitors [28], and within vendor equipment targeted at specific applications such as security monitoring [68] and application-specific bandwidth management [19]. As the value of DPI-based applications for service providers grows, DPI functionality has also appeared in some routers and switches [16]. General-purpose computing platforms have been used for DPI, e.g., using Snort [74], an open-source intrusion detection system. Some DPI devices operate in line where they perform network management functions directly, such as security-based filtering or application bandwidth management. Others act purely as monitors and require a copy of the packet stream to be presented at an interface. There are several ways by which this can be accomplished: (i) by copying the physical signal that carries the packets, e.g., with an optical splitter; (ii) by attaching the monitor to a shared medium carrying the traffic, or (iii) by having a router or switch copy packets to an interface on the monitor. The architectural challenges for all DPI platforms are: (i) the high incoming packet rate; (ii) the large number of distinct signatures against which each packet is to be matched – Snort has several hundred – and (iii) signatures that match over multiple packets, and hence require flow-level state to be maintained in the measurement device. These factors have tended to favor the use of dedicated DPI devices ahead of router-based integration in the past. They also drive architectural design for DPI devices in which aggregation and analysis if pushed down as close to the data stream as possible. Coupled with general-purpose computational platforms, tcpdump [52] is a public domain software that captures packets at an interface of the host on which it executes. Tcpdump has been widely used as both a diagnostic tool, and also to capture packet header traces in order to conduct reproducible exploratory studies. However, the enormous byte rates of network data in comparison with storage and transmission resources, generally preclude collecting packet header traces longer than a few minutes or perhaps hours. A number of anonymized packet header traces have been made available by researchers; see e.g., ([9]). Software for removal of confidential information from packet traces, including anonymization, is available (see [63]). 340 N. Duffield and Al Morton 10.6.3 Applications of DPI In this section, we motivate the importance of DPI by describing network management applications that require detail from packet payload: application characterization and management, network security, and network debugging. 10.6.3.1 Application Demand Characterization and Bandwidth Management Applications place diverse service requirements on the network. For example, realtime applications such as VoIP require relatively small bandwidth but have stringent latency requirements. Video downloads require high throughput but are elastic in terms of latency. Service providers can differentiate resources among the different service classes according to the size of the demands in each class. Hence a crucial task for network planning is to characterize and track changes in the traffic mix across application classes. In the past, application and application class could be inferred reasonably well from TCP/UDP port numbers on the basis of IANA well-known port assignments [50]. However, purely port-based identification is becoming less easy due to factors including (i) lack of adherence to port conventions by application designers, (ii) piggybacking of applications on well-known ports, such as HTTP port 80, in order to facilitate firewall traversal; and (iii) separation of control and data channels with dynamic allocation of data port during control level handshaking (see Chapter 5 for further details). On the other hand, knowledge of application operation can be used to develop packet content-level signatures. In some cases, this would involve matching strings of an application-level protocol across one or more network packets. For applications that use separate data and control channels, this could entail (a) matching a signature of the control channel in the manner just described with further inspection, then (b) identifying the data channel port communicated in the control channel, (c) using the identified data channel port to classify further packet or flow level measurements taken (see [80]). Application-based classification can be used purely passively. Knowledge of the mix and relative growth between different application classes is necessary for network planning. It can also be used actively to apply differentiated resource allocation policies to different application classes, concerning traffic shaping, dropping of outof-profile packets, or restoration priority after failures. As an example, access to a customer access channel can be prioritized so that the performance of delaysensitive VoIP traffic is not impaired by other traffic. A number of vendors supply equipment with such capabilities (see e.g. [19, 75]). 10.6.3.2 Network Security While some network attacks can be identified based on header-level information this is not true in general. As a counterexample, the well-known Slammer worm 10 Measurements of Data Plane Reliability and Performance 341 [64] was evident due to (i) its rapid growth leading to sharp increases in traffic volume; (ii) the increase was associated with particular values of the packet header field, and (iii) contextual information that the application exploited predominantly exchanges traffic across LANs or intranets rather than across the WAN. This combination of factors made it relatively easy to identify the worm and block its spread by instantiating header-level packet filters, without significantly impacting legitimate traffic. However, these conditions do not hold in general. Many network attacks exploit vulnerabilities in common applications such email, chat, p2p, and web-browsing mediated by network communications that, in contrast with the Slammer example [64], (i) are relatively stealthy, not exhibiting large changes in network traffic volume at least during the acquisition phase, (ii) are not distinguished from legitimate traffic by specific header field values, and hence (iii) blend into the background of legitimate traffic at the flow level. Examples include installation of malware such as keystroke loggers, or the acquisition and subsequent control of zombie hosts in botnets. To detect and mitigate these and other attacks, packet inspection is a powerful tool to enable matching against known signatures of malware, including viruses, worms, trojans, botnets. Indeed, a sizable proportion of the attack detection signatures commonly used in the public domain Snort packet inspection system [74] match only on the packet payload rather than the header. Similarly to Section 10.6.3.1, a network security tool may operate purely passively in order to gain information about unwanted traffic, or may be coupled to filtering functions that block specific flows of traffic (see Chapter 13 for further details). 10.6.3.3 Debugging for Software, Protocols, and Customer Support Both networking hardware and software that implement services can contain subtle dependencies and display unexpected behavior that, despite pre-deployment testing, only becomes evident in the live network. DPI permits network operators to monitor, evaluate, and correct such problems. To troubleshoot specific network or service layer issues, DPI devices could be deployed at a concentration point where specific protocol exchanges or application-layer transactions can be monitored for correctness. Operators might also use portable DPI devices, which would allow them deploy devices in specific locations to investigate suspected hardware or software bugs. Similarly, DPI enables technicians to assist customers in debugging customer equipment, and software installations and configurations. This can enable technicians to rapidly determine the nature of problems rated to network transmissions, rather than rely on potentially incomplete knowledge derived from customer dialogs. 342 N. Duffield and Al Morton 10.7 Active Performance Measurement This section is concerned with the challenges and design aspects of providing active performance measurement infrastructures for service providers. The four metric areas of common interest are: Connectivity Can a given host be reached from some set of hosts? Loss What proportion of a set of packets are lost on a path (or paths) between two hosts? Loss may be considered in an average sense (all packets over some period of loss) or granular in time (burst loss properties) or space (broken down, e.g., by customer or application). Delay The network latency over a path (or paths) between two hosts, viewed at the same granularity as for loss measurements. Throughput Bytes or packets successfully transmitted between two hosts, potentially broken down by application or protocol (e.g., TCP vs. UDP). Historically, active measurement tools such as ping and traceroute have long been used to baseline roundtrip loss and delay and map IP paths, either as standalone tools, or integrated into performance measurement systems. Bulk throughput has been estimated using the treno tool [58], which creates a probe stream that conforms to the dynamics of TCP. There is a large body of more recent research work proposing improved measurement methods and analysis (see, e.g., [29]). However, the focus of the remainder of this chapter concerns more the design and deployment issues for the components of an active measurement and reporting infrastructure of the type increasingly deployed by service providers and enterprise customers. Specifically: Performance Metric Standardization This is required in order for all parties involved in the measurement, dissemination and interpretation of results to agree on the methods of acquiring performance measurements, and their meaning. Such parties include network service providers, their customers, third-party measurement service providers, and measurement system vendors. Performance metric standardization is described in Section 10.8. Service Level Agreements Service providers must offer specific performance targets to their customers, based upon agreed metrics. Section 10.9 describes processes for establishing SLAs between service providers and customers. Deployment of Active Measurement Infrastructures Deployment issues for large-scale active measurement infrastructures are discussed in Section 10.10, together with some examples of different deployment modes. 10.8 Standardization of IP Performance Metrics In this section, we give an overview of standardization activities on IP performance metrics. There are not one, but two standard bodies that provide the authoritative view of IP network performance and on packet performance metrics in general. 10 Measurements of Data Plane Reliability and Performance 343 They are the IETF (primarily the IP Performance Metrics IPPM working group), and the International Telecommunications Union - Telecommunications Sector Study Group 12 (ITU-T SG 12, specifically the Packet Network Performance Question 17). Although there are some differences in the approaches and the metric specifications between these two bodies, they are relatively minor. The critical advantage of using standardized metrics is the same as for any good standard: the metrics can be implemented from unambiguous specifications, which ensure that two measurement devices will work the same way. They will assign timestamps at the same defined instants when a packet appears at the measurement point (such as first bit in, or the last bit out). They will use a waiting time to distinguish between packets with long delays and packets that do not arrive (because one cannot wait forever to report results, and for many applications a packet with extremely long delay is as good as lost). They will perform statistical summary calculations the same way, and when presented with identical network conditions to measure, they produce the same results. The ITU-T has defined its IP performance metrics in one primary Recommendation, Y.1540. The general approach is to define basic sections bounded by measurement points, which are Hosts at the source and destination(s) Network Sections (composed of routers and links, and usually defined by admin- istrative boundaries) Exchange Links (between the other entities) The next step is to define packet transfer reference events at the various section boundaries. There are two main types of reference events: Entry event to a host, exchange link, or network section Exit event from a host, exchange link, or network section Then, the fundamental outcomes of successful packet transfer and lost packet are defined, followed by performance parameters that can be calculated on a flow of packets (referred to using the convention “population of interest”). ITU-T’s metrics are useful in either active or passive measurement, and do not specify sampling methods. The IETF began work on network performance metrics in the mid-1990s, by first developing a comprehensive framework for active measurement [70]. The framework RFC established many important conventions and notions, including: The expanded use of the metric definition template developed in earlier IETF work on Benchmarking network devices [6]. The general concept of “packets of Type-P” to reflect the possibility that packets of different types would experience different treatment, and hence, performance as they traverse the path. A complete specification of Type-P and the source and destination addresses are usually equivalent to the ITU-T’s “population of interest”. 344 N. Duffield and Al Morton The notion of “wiretime”, which recognizes that physical devices are needed to observe packets at the IP-layer, and these devices may contribute to the observed performance as a source of error. Other important time-related considerations are detailed, too. The hierarchy of singletons (“atomic” results), samples (sets of singletons), and statistics (calculations on samples). A series of RFCs followed over the next decade, one for each fundamental metric that was identified. The IETF wisely put the various metric RFCs (RFC 2679 [2] and RFC 2680 [3]) on the Standards Track, so that the implementations could be compared with the specifications and used to improve their quality (and narrowdown some of the flexibility) over time. RFC 2330 [70] and RFC 3432 [72] specify Poisson and Periodic sampling, respectively. Throughput-related definitions are in RFC 5136 [12]. One area in which IETF was extremely flexible was its specification for delay variation, in RFC 3393[31]. This specification applies to almost any form of delay variation imaginable, and was endowed with this flexibility after considerable discussion and comparisons between the ITU-T preferred form and other methods (some of which were adopted in other IETF RFCs). This flexibility was achieved using the “selection function” concept, which allows the metric designer to compare any pair of packets (as long as each is unambiguously defined from a stream of packets). Thus, this version of the delay variation specification encouraged practitioners to gain experience with different metric formulations on IP networks, and facilitated comparison between different forms by establishing a common framework for their definition. A common selection function uses adjacent packets in the stream, and this is called “Inter-Packet Delay Variation”. In contrast, the ITU-T Recommendations of the early 1990s (for ATM networks) used essentially the same form of delay variation metric as in Y.1540 and as used today in Recommendations for the latest networking technologies. It is called the “2-point Packet Delay Variation” metric. This metric defines delay variation as the difference between a packet’s one-way delay and the delay for a single reference packet. The recommended reference is the packet with the minimum delay in the test sample, removing propagation from the delay distribution and emphasizing only the variation. This definition differs significantly from the inter-packet delay variation definition. Fortunately, an IETF project has rather completely investigated the two main forms of delay variation metrics, and is available to provide guidance on the appropriate form of metric for various tasks [66]. The comparison approach was to define the key tasks (such as de-jitter buffer size and queuing time estimation) and challenging measurement circumstances for delay variation measurements (such as path instability and packet loss), and to examine relevant literature. In summary, the ITU-T definition of “2-point Packet Delay Variation” was the best match to all tasks and most circumstances, but with a requirement for more stable timing being its only weakness. 10 Measurements of Data Plane Reliability and Performance 345 10.9 Performance Metrics in Service-Level Agreements In this section, we discuss Service-Level Agreements, or SLA, and how the key metrics defined above contribute to a successful relationship between customers and their service providers. 10.9.1 Definition of a Service-Level Agreement (SLA) For our purposes, we define a Service-Level Agreement as: A binding contract between Customer and Service Provider that identifies all important aspects of the service being delivered, constrains those aspects to a satisfactory performance level which can be objectively verified, and describes the method and format of the verification report. This definition makes the SLA-supporting role and design of active measurement systems quite clear. The measurement system must assess the service on each of the agreed aspects (metrics) according to the agreed reporting schedule and determine whether the performance thresholds have been met. The details of the SLA may even specify the points where the active measurement system will be connected to the network, the sending characteristics of the synthetic packets dedicated for verification testing, and the confidence interval beyond which the results conclusively indicate that the threshold was met/not met. 10.9.2 Process to Develop the Elements of an SLA This section describes a process to develop the critical performance aspects of an SLA. Typically, a network operator establishes a standard set of SLAs for a network service by conducting this process internally, using a surrogate for the customer. The specific details of the SLA may differ for different services, e.g., an enterprise Internet access service might have a different SLA from a premium VPN service. An SLA might specify performance metrics such as data delivery (the inverse of packet loss), site-to-site latency by region or location, delay variation or jitter, availability, etc. as well as a number of nonperformance metrics such as provisioning intervals. There are also cases in which a network operator may develop a customized SLA for a particular customer (e.g., because the size of their network or other special circumstances demand it). The process that a service provider and the customer would go through to develop a customized SLA illustrates the issues that need to be addressed when developing an SLA. We present an example of such a process here. In principle, the SLA represents a common language between the customer and service provider. The process involves collection of requirements and a meeting of 346 N. Duffield and Al Morton peers to compare the view from each side of the network boundaries. One set of steps to create agreeable requirements is given below. 1. The customer identifies the locations where connectivity to the communications service is required (Customer–Service Interfaces), and the service provider compares the location list with available services. 2. The customer and service provider agree on the performance metrics that will be the basis for the SLA. For example, a managed IP network provides a very basic service – packet transfer from source to destination. The SLA is based on packet transfer performance metrics, such as delay, delay variation, and loss ratio. If higher-layer functions are also provided (e.g., domain name to address resolution), then additional metrics can be included. 3. The customer must determine exactly how they plan to use a communications network to conduct business, and express the needs of their applications in terms of the packet performance metrics. The performance requirements may be derived from analysis of the component protocols of each customer application, from tests with simulated packet transfer impairments, or from prior experience. Sometimes, the service provider will consult on the application modeling. 4. In parallel, the service provider collects (or estimates) the levels of packet transfer performance that can be delivered between geographically dispersed service interfaces. Active measurements often serve this aspect of the process, by revealing the network performance possible under current conditions. 5. When the customer and service provider meet again, the requested and feasible performance levels for all of the performance metrics are compared. Where the requested performance levels cannot be met, revised network designs or a plan to achieve interim and long-term objectives in combination with deployment of new infrastructure may be developed, or the customer may relax specific requirements, or a combination of the two. 6. Once the performance levels of the SLA are agreed upon, it remains to decide on the formal reporting intervals and how the customer might access the ongoing measurement results. This aspect is important because formal reporting intervals are often quite long, on the order of a month. 7. If the customer needs up-to-date performance status to aid in their troubleshooting process, then monthly reports might be augmented with the ability to view a customized report of recent measurements. The active measurement system would communicate measured results on a frequent basis to support this monitoring function, as well as longer-term SLA reports. There are several process complexities worth mentioning. First, the customer may be able to easily determine the performance requirements for a single application flow, but the service providers’ measurements will likely be based on a test flow, which experiences the same treatment as the rest of the flows. The test packet flow may not have identical sending characteristics as customer flows, and will certainly represent only a small fraction of the aggregate traffic. Thus, the active test flow performance will represent the customer flow performance only on a long-term basis. Second, active measurements of throughput may have a negative affect on live 10 Measurements of Data Plane Reliability and Performance 347 traffic while they are in-progress. As a result, the throughput metric may be specified through other means, such as the information rate of the access link on each service interface, and not formally verified through active measurement. 10.10 Deployment of Active Measurement Infrastructures In this section, we describe several ways in which active measurement systems can be realized. One of the key design distinctions is the measurement device topology. We describe and contrast several of the topologies that have seen deployment, as this will be an important consideration for any system the reader might devise. We categorize the topologies according to where the devices conducting measurements are physically located. 10.10.1 Geographic Deployment at Customer–Service Interfaces In this topology, measurement devices (or measurement processes in multipurpose devices) are located as close as possible to the service interfaces. Figure 10.4a a b c Fig. 10.4 Deployment scenarios for active measurement infrastructure. MP D measurement point. (a) MP at ends of path in point-to-point service. (b) MP at network edge; no coverage of access links. (c) MP at central location with connectivity to remote locations 348 N. Duffield and Al Morton depicts this topology for a point-to-point service, with a Measurement Point (MP) at each end of the path. The Cisco Systems IP SLATM product embeds an active measurement system at routers and switches that often resides in close proximity to the Customer–Service Interfaces. The measurement results can be collected by accessing specific MIB modules using SNMP. The utility of IP SLATM capabilities was recognized for multi-vendor scenarios, and the Two-way Active Measurement Protocol (TWAMP) [46] standardizes a fundamental test control and operation capability. The primary advantage of this topology is that the measurement path covers the entire service in a single measurement, so the active test packets will experience conditions very similar to customer traffic. However, the measurement device/process must be located at a remote (customer) site to provide such coverage, so their cost is not shared across multiple services and it must be managed (and have results collected) remotely. The scale of the measurement system is also an issue. A full-mesh of two-way active measurements grows exponentially with the number of nodes, N , according to N .N 1/=2. 10.10.2 Geographic Deployment at Network Edges In Fig. 10.4b, the MPs move to intermediate nodes along the point-to-point path, the edge of the network providing service. In this scenario, the measurement devices/processes are located at the edge of the network providing service and the access links may not be covered by the measurements or the SLAs. We also show a third MP within the network cloud, which can be used to divide the path into segments. This topology makes it possible to share the measurement devices and the measurements they produce with overlapping paths that support different services, different customers, or parts of other point-to-point paths for the same customer. Of course, a process is needed to combine the results of segment measurements to estimate the edge-to-edge performance, and this problem has been successfully solved [51, 65, 67]. The key points to note are the following: The interesting cases are those where impairments are time-varying, thus we ex- pect to estimate features of time distributions, and not specific values (singletons) at particular times. Some performance metric statistics lend themselves to combination, such as means and ratios, so these should be selected for measurement and SLAs. For example, measurements of the minimum delay of path segments can usually be taken as additive when estimating the complete path performance. Average oneway delay is also additive, but somewhat more prone to estimation errors when the segment distributions are bimodal or have wide variance (a long tail). There must be a reasonable case made that (for each metric used) performance on one path segment will be independent of the other, because correlation causes the estimation methods to fail. An obvious correlation example is any metric 10 Measurements of Data Plane Reliability and Performance 349 that evaluates packet spacing differences – the measurement is dependent on the original spacing, and that spacing will change when there is any delay variation present on the path segments. We note that it is also possible to obtain complete path coverage using this topology, with assistance from low-cost test reflector devices/processes located at the service interfaces (such as those described in RFC 5357 [46]) (see [13] for more details). 10.10.3 Centralized Deployment with Remote Connectivity As alternative to remote deployment of measurement devices/processes, Fig. 10.4c shows all MPs moved to a central location with connectivity to strategic locations in the network (such as the network edges in key cities). This topology offers the advantage of easy access to the measurement devices at the central location, thus affording rapid reconfiguration and upgrade. However, reliable remote access links are needed between this single location and every network node that requires testing. Also, even if the remote access links are transparent from a packet loss perspective, they will still introduce delay that is not present on the customer’s path through the network. The mere cost of the remote access links may make remote device deployment in Fig. 10.4b more attractive. Thus, topologies like this have been deployed for remote connectivity monitors when the devices implementing a network technology do not have sufficient native support for remote device deployment (e.g., Frame Relay networks). A system exploiting this approach is described in [8] where tunneling is used to steer measurement packets on round-trip paths from a central host, via the access links. In this sense, virtual measurements are conducted between different pairs of hosts in the network core. A related approach for multicast VPN monitoring is described in [7]. 10.10.4 Collection for Infrastructure Measurements When measurement devices are geographically dispersed, there must be a means to collect the results of measurements and make them available for monitoring, reporting, and SLA compliance verification. This requires some form of protocol to fetch either the per-packet measurements, or the processed and summarized results for each intermediate measurement interval (e.g., 5–15 min). Once the measurement results have been collected at a central point, they should be stored in a database system and made available for on-going display, detailed analysis, and SLA verification/reporting. 350 N. Duffield and Al Morton 10.10.5 Other Types of Infrastructure Measurements 10.10.5.1 Independent Measurement Networks Measurement service vendors, such as Keynote [57], station measurement devices in locations of ISPs representing, e.g., typical customer access points, and conduct a variety of measurements between measurement devices or between them and service hosts, including, web and other server response times, access bandwidth, VoIP, and other access performance. Comparative performance measures are published and detailed results are made available through subscription. 10.10.5.2 Cross Provider and Network-Wide Measurements End-to-end paths commonly traverse multiple service providers. Thus, it is natural to measure the inter-provider components to performance. The most prominent example is the RIPE network [73], which has stationed measurement devices in a number of participating ISPs, conducts performance measurements between them, and disseminates selected views to the participants. Novel active measurement infrastructure is being deployed in advanced research and development networks (e.g., MeasurementLab/PlanetLab [61]), including work in developing architectures for managing access to and data recovery from measurement infrastructures. 10.10.5.3 Performance Measurement and Route Selection Router measurement capabilities may also be coupled to the operation of routing protocols themselves. Cisco Performance Routing [20] enables routers in a multiply-homed domain to conduct performance measurements to external networks. The measurements are then compared in order to determine the best egress to that network and adjust route parameters accordingly. 10.11 Outlook The challenges described in Section 10.1.3 will grow with network size and complexity. The fundamental challenges for passive measurement, that of large data volumes caused by network scale and speed, are usually addressed by sampling. Going forward, there are three related trade-offs for the measurement infrastructure. Unless the capacity of the measurement infrastructure grows commensurate with the growth in network speed and scale grows, sampling rates must decrease in order to fit the measurements within the current infrastructure. But decreasing sampling rates reduces the ability to provide an accurate fine-grained view the traffic. Although loss of detail and accuracy can be ameloriated by aggregation, that would go against the 10 Measurements of Data Plane Reliability and Performance 351 increasing demand for detailed measurements differentiated by customer, application, and service class. On the other hand, growing the infrastructure and retaining current sampling rates present its own challenges, and not just for in equipment and administration costs. Distributed measurement architectures are an attractive way to manage scale, enabling local analysis and aggregation rather than requiring recovery of data to a single central point. Then, the challenge becomes the design of distributed analysis and efficient communication methods between components of measurement infrastructure. This is particularly challenging for network security applications, which need a network-wide view in order to identify stealthy unwanted traffic. Active measurement presents analogous challenges in viewing network performance differentiated by, e.g., customer, application, traffic path, and network element. Aggregate performance measurements are no longer sufficient. There are a number of approaches to target probe packets on or onto particular paths: (i) the probe may craft the packet in order that network elements select the packet on the desired path; this approach was taken in [7, 8], or (ii) passively measuring customer traffic directly, e.g., by comparing timestamps between different points on the path to determine latency (see Section 10.5.2). Both these approaches require knowledge of the mapping between the desired entity to be measured from (customer, service class) and the observable parts of the packets. A challenge is that this mapping may be difficult to elucidate, or depend on network state that may become unstable precisely at the time a performance problem needs to be diagnosed. Tomographic methods have been proposed to infer performance on links from performance on sets of measured path that traverse them (see [1, 11]), typically under simplifying independence assumptions concerning packet loss, latency, and link failure. These approaches aim to supply indirectly, performance measurements that are not available directly. It remains a challenge to bring the early promise of these methods to fruition in production-level tools under general network conditions (see e.g. [48]). The relative utility of performance tomographic approaches will depend on the extent to which the detailed network performance measurements can be provided directly by router-based measurements in the future. This outlook stands in contrast to the state described in the opening section, where little measurement functionality was provided in the network infrastructure. As the best ideas in measurement research and development mature into standard equipment features, the challenge will be to manage the complexity and scale of the infrastructure and the data itself. References 1. Adams, A., Bu, T., Caceres, R., Duffield, N., Friedman, T., Horowitz, J., Lo Presti, F., Moon, S. B., Paxson, V., & Towsley. D. (2000). The use of end-to-end multicast measurements for characterizing internal network behavior. IEEE Communications Magazine, May 2000, 38(5), 152–159. 2. Almes, G., Kalidindi, S., & Zekauskas, M. (1999). A one-way delay metric for IPPM. RFC 2679, September 1999. 352 N. Duffield and Al Morton 3. Almes, G., Kalidindi, S., & Zekauskas, M. (1999). A one-way packet loss metric for IPPM. RFC 2680, September 1999. 4. Alon, N., Duffield, N., Lund, C., & Thorup, M. (2005). Estimating arbitrary subset sums with few probes. In Proceedings of 24th ACM Symposium on Principles of Database Systems (PODS) (pp. 317–325). Baltimore, MD, June 13–16, 2005. 5. AT&T Labs. Application traffic analyzer. http://www.research.att.com/viewProject.cfm? prjID=125. 6. Bradner, S. (1991). Benchmarking terminology for network interconnection devices. RFC 1242, July 1991. 7. Breslau, L., Chase, C., Duffield, N., Fenner, B., Mao, Y., & Sen, S. (2006). Vmscope: a virtual multicast vpn performance monitor. In INM ’06: Proceedings of the 2006 SIGCOMM Workshop on Internet Network Management (pp. 59–64). New York, NY, USA: ACM. 8. Burch, H., & Chase, C. (2005). Monitoring link delays with one measurement host. SIGMETRICS Performance Evaluation Review, 33(3):10–17. 9. CAIDA. The CAIDA anonymized 2009 internet traces dataset. http://www.caida.org/data/ passive/passive 2009 dataset.xml. 10. CAIDA. cflowd: Traffic flow analysis tool. http://www.caida.org/tools/measurement/cflowd/. 11. Castro, R., Coates, M., Liang, G., Nowak, R., & Yu, B. (2004). Network tomography: recent developments. Statistical Science, 19, 499–517. 12. Chimento, P., & Ishac, J. (2008). Defining network capacity. RFC 5136, February 2008. 13. Ciavattone, L., Morton, A., & Ramachandran, G. (2003). Standardized active measurements on a tier 1 IP backbone. IEEE Communications Magazine, pp. 90–97, June 2003. 14. Cisco Systems. Cisco IOS Flexible NetFlow. http://www.cisco.com/web/go/fnf. 15. Cisco Systems. Cisco NetFlow Collector Engine. http://www.cisco.com/en/US/products/sw/ netmgtsw/ps1964/. 16. Cisco Systems. Delivering the next generation data center. http://www.cisco.com/en/US/ products/ps9402/. 17. Cisco Systems. IOS switching services configuration guide. http://www.cisco.com/en/US/ docs/ios/12 1/switch/configuration/guide/xcdipsp.html. 18. Cisco Systems. NetFlow. http://www.cisco.com/warp/public/732/netflow/index.html. 19. Cisco Systems. Optimizing application traffic with cisco service control technology. http:// www.cisco.com/go/servicecontrol. 20. Cisco Systems. Performance Routing. http://www.cisco.com/web/go/pfr/. 21. Cisco Systems. Random Sampled NetFlow. http://www.cisco.com/en/US/docs/ios/12 0s/ feature/guide/nfstatsa.html. 22. Claffy, K. C., Braun, H.-W., & Polyzos, G. C. (1995). Parameterizable methodology for internet traffic flow profiling. IEEE Journal on Selected Areas in Communications, 13(8), 1481–1494, October 1995. 23. Claise, B. (2004). Cisco Systems NetFlow Services Export Version 9. RFC 3954, October 2004. 24. Claise, B., Johnson, A., & Quittek, J. (2009). Packet sampling (psamp) protocol specifications. RFC 5476, March 2009. 25. Claise, B., & Wolter, R. (2007). Network management: accounting and performance strategies. Cisco. 26. Cohen, E., Duffield, N., Lund, C., & Thorup, M. (2008). Confident estimation for multistage measurement sampling and aggregation. In ACM SIGMETRICS. June 2–6, 2008, Maryland, USA: Annapolis. 27. Cohen, E., Duffield, N. G., Kaplan, H., Lund, C.,& Thorup, M. (2007). Algorithms and estimators for accurate summarization of internet traffic. In IMC ’07: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement (pp. 265–278). New York, NY, USA: ACM. 28. Cranor, C., Johnson, T., Spataschek, O., & Shkapenyuk, V., (2003). Gigascope: a stream database for network applications. In SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (pp. 647–651). New York, NY, USA: ACM. 29. Crovella, M., & Krishnamurthy, B. (2006). Internet measurement: infrastructure, traffic and applications. New York, NY: Wiley. 10 Measurements of Data Plane Reliability and Performance 353 30. Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. New York, NY, USA: Wiley. 31. Demichelis, C., & Chimento, P. (2002). Ip packet delay variation metric for ip performance metrics (ippm). RFC 3393, November 2002. 32. Dietz, T., Claise, B., Aitken, P., Dressler, F., & Carle, G. (2009). Information model for packet sampling export. RFC 5477, March 2009. 33. Duffield, N.G., Claise, B., Chiou, D., Greenberg, A., Grossglauser, M., & Rexford, J. (2009). A framework for packet selection and reporting. RFC 5474, March 2009. 34. Duffield, N.G., Gerber, A., & Grossglauser, M. (2002). Trajectory engine: A backend for trajectory sampling. In IEEE Network Operations and Management Symposium (NOMS) 2002. Florence, Italy, 15–19 April 2002. 35. Duffield, N.G., Lund, C., & Thorup, M. (2001). Charging from sampled network usage. In Proceedings of 1st ACM SIGCOMM Internet Measurement Workshop (IMW) (pp. 245–256). San Francisco, CA, November 1–2, 2001. 36. Duffield, N.G., Lund, C., & Thorup, M. (2005). Learn more, sample less: control of volume and variance in network measurements. IEEE Transactions on Information Theory, 51(5), 1756–1775. 37. Duffield, N.G., Lund, C., & Thorup, M. (2007). Priority sampling for estimation of arbitrary subset sums. Journal of ACM, 54(6), Article 32, December 2007. Announced at SIGMETRICS’04. 38. Duffield, N., & Grossglauser, M. (2001). Trajectory sampling for direct traffic observation. IEEE/ACM Transactions on Networking, 9(3), 280–292, June 2001. 39. Duffield, N., & Lund, C. (2003). Predicting resource usage and estimation accuracy in an IP flow measurement collection infrastructure. In Proceedings of Internet Measurement Conference. Miami, FL, October 27–29, 2003. 40. Estan, C., Keys, K., Moore, D., & Varghese, G. (2004). Building a better netflow. In Proceedings of the ACM SIGCOMM 04. New York, NY, 12–16 June 2004. 41. Estan, C., & Varghese, G. (2002). New directions in traffic measurement and accounting. In Proceedings of ACM SIGCOMM ’2002. Pittsburgh, PA, August 2002. 42. Feldmann, A., Rexford, J., & Cáceres, R. (1998). Efficient policies for carrying web traffic over flow-switched networks. IEEE/ACM Transactions on Networking, 6(6), 673–685, December 1998. 43. Goldberg, S., & Rexford, J. (2007). Security vulnerabilities and solutions for packet sampling. In IEEE Sarnoff Symposium. Princeton, NJ, May 2007. 44. Greer, R. (1999). Daytona and the fourth-generation language cymbal. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (pp. 525–526). New York, NY, USA: ACM. 45. Hao, F., Kodialam, M., & Lakshman, T.V. (2004). Accel-rate: a faster mechanism for memory efficient per-flow traffic estimation. In SIGMETRICS ’04/Performance ’04: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (pp. 155–166). New York, NY, USA: ACM. 46. Hedayat, K., Krzanowski, R., Morton, A., Yum, K., & Babiarz, J. (2008). A two-way active measurement protocol (twamp). RFC 5357, October 2008. 47. Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663–685. 48. Huang, Y., Feamster, N., & Teixeira, R. (2008). Practical issues with using network tomography for fault diagnosis. SIGCOMM Computer Communication Review, 38(5), 53–58. 49. IETF. IP Flow Information Export (ipfix) charter. http://www.ietf.org/html.charters/ipfixcharter.html. Version of 16 December 2008. 50. Internet Assigned Numbers Authority. Port numbers. http://www.iana.org/assignments/portnumbers. 51. ITU-T Recommendation Y.1540. Network performance objectives for IP-based services, February 2006. 52. Jacobson, V., Leres, C., & McCanne, S. tcpdump. 354 N. Duffield and Al Morton 53. Jacobson V. Traceroute. ftp://ftp.ee.lbl.gov/traceroute.tar.gz. 54. Jain, R., & Routhier, S. (1986). Packet trains – measurements and a new model for computer network traffic. IEEE Journal on Selected Areas in Communications, 4(6), 986–995, September 1986. 55. Juniper Networks. Junose 8.2.x ip services configuration guide: Configuring j-flow statistics. http://www.juniper.net/techpubs/software/erx/junose82/swconfig-ip-services/html/ ip-jflow-stats-config.html. 56. Kent, S., & Atkinson, R. (1998). Security architecture for the Internet Protocol. RFC 2401, November 1998. 57. Keynote Systems. http://www.keynote.com. 58. Mathis, M., & Mahdavi, J. (1996). Diagnosing internet congestion with a transport layer performance tool. In Proceedings of INET 96. Montreal, Quebec, 24–28 June 1996. 59. McCloghrie, K., & Kastenholz, F. The interfaces group mib. RFC 2863, June 2000. 60. McCloghrie, K., & Rose, M. (1991). Management Information Base for Network Management of TCP/IP-based internets: MIB-II. RFC 1213, available from http://www. ietf.org/rfc, March 1991. 61. MeasurementLab. http://www.measurementlab.net/. 62. Mills, C., Hirsh, D.,& Ruth, D. (1991). Internet accounting: background. RFC 1272, November 1991. 63. Greg Minshall. tcpdpriv. http://ita.ee.lbl.gov/html/contrib/tcpdpriv.html. 64. Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., & Weaver, N. (2003). Inside the slammer worm. IEEE Security and Privacy, 1(4), 33–39. 65. Morton, A. (2008). Framework for metric composition, June 2009. draft-ietf-ippm-frameworkcompagg-08 (work in progress). 66. Morton, A., & Claise, B. (2009). Packet delay variation applicability statement. RFC 5481, March 2009. 67. Morton, A., & Stephan, E. (2008). Spatial composition of metrics, October 2009. draft-ietfippm-spatial-composition-10 (work in progress). 68. Narus, Inc. Narusinsight secure suite. http://www.narus.com/products/security.html. 69. Packetdesign. Traffic explorer. http://www.packetdesign.com/products/tex.htm. 70. Paxson, V., Almes, G., Mahdavi, J., & Mathis, M. (1998). Framework for ip performance metrics. RFC 2330, May 1998. 71. Phaal, P., Panchen, S., & McKee, N. (2001). Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks. RFC 3176, September 2001. http://www.ietf. org/rfc/rfc3176.txt. 72. Raisanen, V., Grotefeld, G., & Morton, A. (2002). Network performance measurement with periodic streams. RFC 3432, November 2002. 73. RIPE. http://www.ripe.net. 74. Roesch, M. (1999). Snort – Lightweight Intrusion Detection for Networks. In Proceedings of USENIX Lisa ’99, Seattle, WA, November 1999. 75. Sandvine. http://www.sandvine.com/. 76. Srinivasan, C., Viswanathan, A., & Nadeau, T. (2004). Multiprotocol label switching (MPLS) label switching router (LSR) management information base (MIB). RFC 3813, June 2004. 77. Stallings, W. (1999). SNMP, SNMP v2, SNMP v3, and RMON 1 and 2 (Third Edition). Reading, MA: Addison-Wesley. 78. Stewart, R., Ramalho, M., Xie, Q., Tuexen, M., & Conrad, P. (2004). Stream control transmission protocol (sctp) partial reliability extension. RFC3758, May 2004. 79. Thorup, M. (2006). Confidence intervals for priority sampling. In Proceedings of ACM SIGMETRICS/Performance 2006 (pp. 252–263) Saint-Malo, France, 26–30 June 2006. 80. van der Merwe J., Cáceres, R., Chu, Y.-H., & Sreenan, C. (2000). mmdump: a tool for monitoring internet multimedia traffic. SIGCOMM Computer Commununication Review, 30(5), 48–59. 81. Waldbusser, S. (2000). Remote network monitoring management information base. RFC 2819, available from http://www.ietf.org/rfc, May 2000. 10 Measurements of Data Plane Reliability and Performance 355 82. Zseby, T., Molina, M., Duffield, N.G., Niccolini, S., & Raspall, F. (2009). Sampling and filtering techniques for ip packet selection. RFC 5475, March 2009. 83. Zseby, T., Zander, S., & Carle, G. (2001). Evaluation of building blocks for passive one-waydelay measurements. In Proceedings of Passive and Active Measurement Workshop (PAM 2001). Amsterdam, The Netherlands, 23–24 April 2001. Chapter 11 Measurements of Control Plane Reliability and Performance Lee Breslau and Aman Shaikh 11.1 Introduction The control plane determines how traffic flows through an IP network. It consists of routers interconnected by links and routing protocols implemented as software processes running on them. Routers (or more specifically routing protocols) communicate with one another to determine the path that packets take from a source to a destination. As a result, the reliability and performance of the control plane is critical to the overall performance of applications and services running on the network. This chapter focuses on how to measure and monitor the reliability and performance of the control plane of a network. The original Internet service model supported only unicast delivery. That is, a packet injected into the network by a source host was intended to be delivered to a single destination. Multicast, in which a packet is replicated inside the network and delivered to multiple hosts was subsequently introduced as a service. While certain multicast routing protocols leverage unicast routing information, unicast and multicast have very distinct control planes. They are each governed by a different set of routing protocols, and measurement and monitoring of these protocols consequently take different forms. Therefore, we cover unicast and multicast control plane monitoring separately in Sections 11.2 and 11.3, respectively. We start Section 11.2 with a brief overview of how unicast forwarding works, describing different routing protocols and how they work to determine paths between a source and a destination. We then look at two key components of performance monitoring: instrumentation of the network for data collection in Section 11.2.2, and strategies and tools for data analysis in Section 11.2.3. More specifically, the instrumentation section describes what data we need to collect for route monitoring along with mechanisms for collecting the data needed. The analysis section focuses on various techniques and tools that show how the data is used for monitoring the L. Breslau and A. Shaikh () AT&T Labs – Research, Florham Park, NJ, USA e-mail: breslau@research.att.com; ashaikh@research.att.com C.R. Kalmanek et al. (eds.), Guide to Reliable Internet Services and Applications, Computer Communications and Networks, DOI 10.1007/978-1-84882-828-5 11, c Springer-Verlag London Limited 2010 357 358 L. Breslau and A. Shaikh performance of the control plane. While the focus of the section is on management and operational aspects, we also describe some of the research enabled by this data that has played a vital role in enhancing our understanding of the control plane behavior and performance in real life. We follow this up with a description of the AT&T OSPF Monitor [1] in Section 11.2.4 as a case study of a route monitor in real life. In Section 11.2.5, we describe control plane monitoring of MPLS, which has been deployed in service provider networks in the last few years and is a key enabler of Traffic Engineering (TE) and Fast Re-route (FRR) capabilities, as well as new services such as VPN and VPLS. Section 11.3 follows a similar approach in its treatment of multicast. We begin with a motivation for and historical perspective of the development and deployment of multicast. In Section 11.3.1, we provide a brief overview of the multicast routing protocols commonly in use today, PIM and MSDP. We then outline some of the challenges specific to monitoring the multicast control plane in Section 11.3.2. Section 11.3.3 provides detailed information about multicast monitoring. This includes an overview of early multicast monitoring efforts, a discussion of the information sources available for multicast monitoring, and a discussion of specific approaches and tools used in multicast monitoring. At the end of the chapter, in Section 11.4, we provide a brief summary and avenues for future work. 11.2 Unicast In this section, we focus on monitoring of unicast routing protocols. We begin by providing a brief overview of how routers forward unicast packets and the routing protocols used for determining the forwarding paths before delving into details of how to monitor these protocols. 11.2.1 Unicast Routing Overview Let us start with the description of how routing protocols enable the forwarding of unicast packets in IP networks. With unicast, each packet contains the address of the destination. When the packet arrives at a router, a table called the Forwarding Information Base (FIB), also known as the forwarding table, is consulted. This table allows the router to determine the next-hop router for the packet, based on its destination address. Packets are thus forwarded in a hop-by-hop fashion, requiring look-ups in the forwarding table of each router hop along its way to the destination. The forwarding table typically consists of a set of prefixes. Each prefix is represented by an IP address and a mask that specifies how many significant bits of a destination address need to match the address of the prefix. For example, a prefix represented as 10.0.0.0/16 would match a destination address whose first 16 bits 11 Measurements of Control Plane Reliability and Performance 359 are the same as the first 16 bits of 10.0.0.0 (i.e., 10.0). Thus, the address 10.0.0.1 matches this prefix, so do 10.0.0.2 and 10.0.1.1. It is possible, and is often the case that, more than one prefix in a FIB match a given (destination) address. In such a case, the prefix with the highest value of the mask length is used for determining the next-hop router. For example, if a FIB contains 10.0.0.0/16 and 10.0.0.0/24, and the destination is 10.0.0.1, prefix 10.0.0.0/24 is used for forwarding the packet even though both prefixes match the address. For this reason, IP forwarding is based on the longest prefix. Routers run one or more routing protocols to construct their FIBs. Every routing protocol allows a router to learn the network topology (or some part of it) by exchanging messages with other routers. The topology information is then used by a router to determine next hops for various prefixes, i.e., the FIB. Learning Topology Information Depending on how much topology information each router learns, the routing protocols can be divided into two main classes: distance-vector and link-state. In a distance-vector routing protocol at each step, every router learns the distance of each adjacent router to every prefix. Every prefix is connected to one or more routers in the network. The distance from a router to a prefix is the sum of weights of individual links on the path, where the weight of every link is assigned in the configuration file of the associated router. A router, upon learning distances from neighbors, chooses the one that is closest to a given prefix as its next-hop, and subsequently propagates its own distance (which is equal to the neighbor’s distance plus the weight of its link to the neighbor) to the prefix to all other neighbors. When a router comes up, it only knows about its directly connected prefixes (e.g., prefixes associated with point-to-point or broadcast links). The router propagates information about these prefixes to its neighbors, allowing them to determine their routes to them. The information then spreads further, and ultimately all routers in the network end up with next-hops for these prefixes. In a similar vein, the newly booted router also learns about other prefixes from its neighbors, and builds its entire FIB. The distance-vector protocols essentially implement a distributed version of the Bellman Ford shortest-path algorithm [2]. RIP [3] is an example of a distance-vector protocol. EIGRP, a Cisco-proprietary protocol, is another example. It contains mechanisms (an algorithm called DUAL [4]) to prevent forwarding loops that can be formed during network changes when routers can become inconsistent in their views of the topology. A subclass of distance-vector, called path-vector protocols include the actual path to the destination along with the distance in the updates sent to neighbors. The inclusion of the path helps in identifying and avoiding potential loops from forming during convergence. BGP [5] is an example of a path-vector protocol. With link-state routing protocols, each router learns the entire network topology. The topology is conceptually a directed graph – each router corresponds to 360 L. Breslau and A. Shaikh a node in this graph, and each link between neighboring routers corresponds to a unidirectional edge. Just like distance-vector protocols, each link also has an administratively assigned weight associated with it. Using the weighted topology graph, each router computes a shortest-path tree with itself as the root, and applies the results to compute next-hops for all possible destinations. Routing remains consistent as long as all the routers have the same view of the topology. The view of the topology is built in a distributed fashion, with each router describing its local connectivity (i.e., set of links incident on it along with their weights) in a message, and flooding this message to all routers in the network. OSPF [6] and IS-IS [7] are examples of link-state protocols. Autonomous Systems (ASes) and Hierarchical Routing The Internet is an inter-network of networks. By design, these networks are envisioned to be administered by independent entities. In other words, the Internet is a collection of independently administered networks. Roughly speaking, such networks are known as Autonomous Systems (ASes). Each autonomous system consists of a set of routers and links that are usually managed by a single administrative authority. Every autonomous system can run one or more routing protocols of its choice to route packets within the system. RIP, EIGRP, OSPF and IS-IS are typically used for routing packets within an AS and are, therefore, known as intradomain or Interior Gateway Protocols (IGPs). In addition, a routing protocol is needed to forward packets between ASes. BGP is used for this purpose and is known as an interdomain or an Exterior Gateway Protocol (EGP). Next, we present an overview of BGP and OSPF as they come up a lot in the subsequent discussions. For details on other routing protocols, please refer to [8]. 11.2.1.1 BGP Overview As mentioned in Section 11.2.1.1, BGP is the de facto routing protocol used to exchange routing information between ASes. BGP is a path-vector protocol (a subset of distance-vector protocols). In path-vector protocols, a router receives routes from its neighbors that describe their distance to prefixes, as well as the path used to reach the prefix in question. Since BGP is used to route packets between ASes, the path is described as a sequence of ASes traversed along the way to the prefix, the sequence being known as an ASPath. Thus, every route update received at a router contains the prefix and the ASPath indicating the path used by the neighbor to reach the prefix. The distance is not explicitly included; rather it implicitly equals the number of ASes in the ASPath. Apart from ASPath, BGP routes also contain other attributes. These attributes are used by a router to determine the most preferred route from all received routes to a destination prefix. Figure 11.1 shows the steps of a decision process that a 11 Measurements of Control Plane Reliability and Performance Fig. 11.1 The decision process used by BGP to select the best route to every prefix. Vendor-dependent steps are not included 1. 2. 3. 4. 5. 6. 361 Highest Local Preference Shortest ASPath Length Lowest Origin Type Lowest MED Prefer Closest Egress (based on IGP distance) Arbitrary Tie Breaking BGP-speaking router follows to select its most preferred route. The process is run independently for each prefix, and starts with all the available routes for the prefix in question. At every step, relevant attributes of the routes are compared. Routes with the most preferred values pass onto the next step while other routes are dropped from further consideration. At the end of the decision process, a router ends up with a single route for every prefix, and uses it to forward data traffic. Note that the second step of the decision process compares the length of ASPath of the routes that survived the first step, keeps the ones with the shortest ASPaths, while discarding the rest. We will not go into details of other steps except to point out that if faced with more than one route in step 5, the router selects route(s) which minimize the IGP distance a packet will have to travel to exit its AS. This process of preferring the closest egress is known as hot-potato or closest-egress routing. A router forms BGP sessions with other routers to exchange route updates. The two ends of a session can either belong to the same AS or a different AS. When the session is formed between routers in the same AS, it is known as an internal BGP (IBGP) session. In contrast, when the routers are in different ASes, the session is known as an external BGP (EBGP) session. For example, in Fig. 11.2, which shows multiple interconnected ASes and routers in them, solid lines depict IBGP sessions, whereas dashed lines represent EBGP sessions. The EBGP sessions setup between routers in neighbor ASes allow them to exchange routes to various prefixes. The routes learned over EBGP sessions are then distributed using IBGP sessions within an AS. For example, AS 2 in Fig. 11.2 learns routes from ASes 1, 3, and 4 over EBGP sessions, which are then distributed among its routers over IBGP sessions. In order to disseminate all routes learned via EBGP to every router, routers inside an AS like AS 1 need to form a full-mesh of IBGP sessions. A router receiving a route update over an EBGP session propagates it to all other routers in the mesh, however, route updates received over IBGP sessions are not forwarded back to the routers in the mesh (see [9] for full details). An IBGP full-mesh does not scale for ASes with a large number of routers. To improve scalability, large ASes use an IBGP hierarchy such as route reflection [10]. Route reflection allows the re-announcement of some routes learned over IBGP sessions. However, it sacrifices the number of candidate routes learned at each router for improved scalability. For example, AS 2 in Fig. 11.2 employs a route reflector hierarchy. 362 L. Breslau and A. Shaikh AS 2 AS 3 AS 1 AS 4 IBGP Session EBGP Session BGP Router BGP Route Reflector Fig. 11.2 Example topology with multiple ASes and BGP sessions 11.2.1.2 OSPF Overview As noted in Section 11.2.1.1, OSPF is a link-state protocol, which is widely used to control routing within an Autonomous System (AS).1 With link-state routing protocols, each router learns the entire view of the network topology represented as a weighted graph, uses it to compute a shortest-path tree with itself as the root, and applies the results to construct its forwarding table. This assures that packets are forwarded along the shortest paths in terms of link weights to their destinations [11]. We will refer to the computation of the shortest-path tree as an SPF computation, and the resultant tree as an SPF tree. For scalability, an OSPF network may be divided into areas determining a twolevel hierarchy as shown in Fig. 11.3. Area 0, known as the backbone area, resides at the top level of the hierarchy and provides connectivity to the non-backbone areas (numbered 1, 2, etc.). OSPF assigns each link to one or more areas.2 The routers that have links to multiple areas are called border routers. For example, routers C , D, and G are border routers in Fig. 11.3. Every router maintains a separate copy of the topology graph for each area to which it is connected. The router performs the SPF computation on each such topology graph and thereby learns how to reach nodes in all adjacent areas. A router does not learn the entire topology of remote areas. Instead, it learns the total weight of the shortest paths from one or more border routers to each prefix in 1 Even though an IGP like OSPF is used for routing within an AS, the boundary of an IGP domain and an AS do not have to coincide. An AS may consist of multiple IGP domains; conversely, a single IGP domain may span multiple ASes. 2 The original OSPF specification [6] required each link to be assigned to exactly one area, but a recent extension [12] allows a single link to be assigned to multiple areas. 11 Measurements of Control Plane Reliability and Performance 363 x Area 0 G 2 1 1 31 E F I H 1 12 1 1 J D C 5 y A 1 4 C OSPF domain Area 1 B 1 1 D 2 1 F Area 0 E 1 1 1 G 1 2 3 H 1 I 1 1 5 J Area 2 1 B 1 A x 4 y 1 E 1 D 1 B 1 A 5 x Area 1 G 1 F 1 C 2 I 1 H J y Border router AS border router OSPF Network Topology Topology View of Router G Shortest Path Tree at G Fig. 11.3 An example OSPF topology, the view of the topology from router G, and the shortestpath tree calculated at G. Although we show the OSPF topology as an undirected graph here for simplicity, the graph is directed in reality remote areas. Thus, after computing the SPF tree for each area, the router learns which border router to use as an intermediate node for reaching each remote node. In addition, the reachability of external IP prefixes (associated with nodes outside the OSPF domain) can be injected into OSPF (e.g., X and Y in Fig. 11.3). Roughly, reachability to an external prefix is determined as if the prefix was a node linked to the router that injects the prefix into OSPF. The router that injects the prefix into OSPF is called an AS Border Router (ASBR). For example, router A is an ASBR in Fig. 11.3. Routers running OSPF describe their local connectivity in Link State Advertisements (LSAs). These LSAs are flooded reliably to other routers in the network. The routers use LSAs to build a consistent view of the topology as described earlier. Flooding is made reliable by mandating that a router acknowledge the receipt of every LSA it receives from every neighbor. The flooding is hop-by-hop and hence does not itself depend on routing. The set of LSAs in a router’s memory is called the link-state database and conceptually forms the topology graph for the router. Two routers are neighbor routers if they have interfaces to a common network (i.e., they have a direct path between them that does not go through any other router). Neighbor routers form an adjacency so that they can exchange LSAs with each other. OSPF allows a link between the neighbor routers to be used for forwarding only if these routers have the same view of the topology, i.e., the same link-state database for the area the link belongs to. This ensures that forwarding data packets over the link does not create loops. Thus, two neighbor routers make sure that their link-state databases are in sync by exchanging out-of-sync parts of their link-state databases when they establish an adjacency. 364 L. Breslau and A. Shaikh 11.2.2 Instrumentation for Route Monitoring As mentioned, routers exchange information about the topology with other routers in the network to build their forwarding tables. As a result, understanding control plane dynamics requires collecting these messages and analyzing them. In this section, we focus on the collection aspect, leaving analysis for the next section. We first focus on how to instrument a single router, before turning our attention to the network-wide collection of messages. 11.2.2.1 Collecting Data from a Single Router Even though the kind of information exchanged in routing messages varies from protocol to protocol, the flow of messages through individual routers can be modeled in the same manner, as depicted in Fig. 11.4. Every router basically receives messages from its neighbors from time to time. These messages are sent by neighbors in response to events occurring in the network or expiration of timers; again, the exact reasons are protocol specific. As described in Section 11.2.1, the message describes some aspect of the network topology or reachability to a prefix along with a set of attributes. Upon receiving the message, the router runs its route selection procedure taking the newly received message into account. The procedure can change the best route to one or more prefixes in the FIB. A router also sends messages to neighbors as network topology and/or reachability to prefixes change – the trigger and contents of the messages depend on the protocol. Given this, to understand routing dynamics of a router would require instrumenting the router to collect (i) incoming messages into a router over all its links, (ii) the changes induced to the FIB, and (iii) outgoing messages to all the neighbors. Some protocols such as BGP allow routers to apply import policies to incoming messages; applying these policies results in either dropping of messages or modifications to the attributes. In such a scenario, it might be beneficial to collect incoming messages before and after application of import policies. In a similar vein, BGP Incoming Routing Message Outgoing Routing Message Router Incoming Routing Message Route Selection Process Incoming Routing Message Best Route Outgoing Routing Message Fig. 11.4 Message flow through a router FIB 11 Measurements of Control Plane Reliability and Performance 365 applies export policies to outgoing messages before they are sent to neighbors in which case messages can be collected before and after the application of export policies. Ideally, one would like the router to “copy” every incoming and outgoing message, as well as changes to the FIB to a management station. In reality, no standardized way for achieving this exists, and as a result no current router implementations support it. Despite this, one could get an approximate version of the required information in several different ways. One such way is to use splitters to read messages directly off a link. Unfortunately, this option is often impractical, expensive, and does not scale beyond a few routers and links. For this reason, this option is rarely used in practice. Another option is to log into the router through its CLI (Command Line Interface) or query SNMP MIBs [13] to extract the required information. Routers and (routing protocols running on them) often store a copy of the most recently received and transmitted messages in memory and allow them to be queried via CLI or SNMP MIBs. Thus, a network management station can periodically pull the information out of a router. Unfortunately, it is almost impossible to capture every incoming/outgoing message this way since even the most frequent polling supportable by routers fall far short of the highest frequency at which routing messages are exchanged. Even so, this option is used in practice at times since it provides a fairly inexpensive and practical way of getting some information about the routing state of a router. For example, the Peer Dragnet [14] tool uses information captured via the CLI to analyze inconsistent routes sent by EBGP peers of an AS. A third option to collect routing messages is to establish a routing session with a router just like any other router. This forces the router to send messages as it would to any other router.3 Obviously, this approach does not give information about incoming messages and changes to the FIB. Even for an outgoing message, the management station does not receive the message at the time a router sends it to other neighbors. Despite this, the approach provides valuable information about route dynamics. For distance-vector protocols, the outgoing message is usually the route selected by the router and for link-state protocols, these messages describe updates to the topology view of the router. As a result, this approach is used quite extensively in practice. For example, RouteViews [15] and RIPE [16] collect BGP updates from several ASes and their routers, as does the OSPF Monitor described in [1], and later in Section 11.2.4. One serious practical issue with this approach is the potential injection of routing messages from the management station, which could disrupt the functioning of the control plane. For protocols that allow import policies (e.g., BGP) one could apply a policy to drop any incoming messages from the management station, but for other protocols (e.g., OSPF, IS-IS) the only way to protect against injection of messages is to rely on the correctness of the software running on the management station. 3 A router running a distance-vector protocol sends its selected route for a given prefix to all its neighbors, except the next-hop of the route when split horizon [8] is implemented. It is this selected route that we are interested in, and will receive, at the management station. 366 L. Breslau and A. Shaikh 11.2.2.2 Collecting Network-Wide Data In Section 11.2.2.1, we discussed ways in which routing messages can be collected from a single router. In this section, we expand our focus to the entire network. The key question we focus on is: how many routers does one need to collect routing messages from? The naive answer is: from all routers of the network. Indeed, if the aim is to learn about each and every message flowing between routers and the exact state of routers at every instance of time, then there is no choice but to collect messages from all routers. In reality, collecting messages from all routers is extremely challenging due to scale issues. Thus, in practice the answer depends on the kind of routing protocol and the analysis requirement. Let’s go into some details. The kind of routing protocol – whether link-state or distance-vector – plays a major role in deciding how many routers one needs to collect data from. In a link-state protocol, every router learns the entire view of the network topology, and so collecting messages from even a single router is enough to determine the overall state of the network topology. As we will see later in Sections 11.2.3 and 11.2.4, even this seemingly “limited” data enables a rich set of management applications. Some examples are (and we will talk about these in more detail in subsequent sections): (i) ability to track network topology and its integrity (against design rules) in realtime, (ii) ability to determine events such as router/link up/downs and link weight changes as they unfold, (iii) ability to determine how forwarding paths evolve in response to network events, and (iv) ability to determine workload imposed by the routing messages. We should emphasize here that for all the applications, the data is providing the “view” from the router from which the data is being collected at that point of time. Other routers’ views can be somewhat different due to message propagation and processing delays. The exact nature of these delays, how they are affected by other events in the network, and their implications for the analysis/application at hand are poorly understood. Our belief is that these delays are small (on the order of milliseconds) in most cases, and thus can be safely ignored for all practical purposes. The story is different for distance-vector protocols since every router gets a partial view of the topology: only the distance of prefixes from neighbors. As a result, one often needs views from multiple, if not all, routers. The exact set depends on the network configuration and on the kind of analysis being performed. For example, if one wants to learn external routes coming into an AS, it suffices to monitor BGP routes from the routers at the edge of the network. In fact, numerous studies on BGP dynamics, inter-AS topology and relationships between ASes have been carried out based on BGP data collected from a fairly small set of ASes at RouteViews and RIPE. Although the completeness and representativeness of these studies is debatable, there is no doubt that such studies have tremendously increased awareness about BGP and its workings in the Internet. Furthermore, by combining routing data collected from a subset of routers with other network data, one can often determine routing state of other routers – at least in steady state once routing has converged after a change. For example, a paper by Feamster and Rexford [17] describes a 11 Measurements of Control Plane Reliability and Performance 367 methodology to determine BGP routes at every router inside an AS based on routes learned at the edge of the network, and configuration of IBGP sessions. 11.2.3 Applications of Route Monitoring In this section, we demonstrate the utility of the data collected by route monitors. We first describe the basic functionality enabled by the data. We then describe how this basic functionality can be used in various network management tasks. Finally, we describe how the data has been used in advancing the understanding of the behavior of routing protocols in real life. 11.2.3.1 Information Provided by Route Monitors Routing State and Dynamics Route monitors capture routing messages, and so they naturally provide information about the current state of routing and how it evolves over time. This information is useful for a variety of network management tasks such as troubleshooting and forensics, capacity planning, trending, and traffic engineering to name a few. For link-state protocols, the routing messages provide information about the topology (i.e., set of routers, links and link weights), whereas for distance-vector protocols, the information consists of route tables (i.e., set of destinations and the next-hop and distance from the router in question). Both pieces of information are useful. Furthermore, calculating routing tables from topology is straightforward: one just needs to emulate route calculation for every router in the topology. Going in the other direction from routing tables to topology is easy if information from all routers (running the distance-vector protocol) is available. In practice though, information is often collected from a subset of routers, in which case, deriving a complete topology view may not be possible. End-to-End Paths Knowing what path traffic takes in the network (from one router to another) is crucial for network management tasks such as fault localization and troubleshooting. For example, a link failure can affect performance of all paths traversing the link. If the only way of detecting such failures is through end-toend active probing, then knowing paths would allow operators to quickly localize the problem to the common link. Routing messages collected by route monitors allow one to determine these paths and how they evolve in response to routing events. Note that active probes (e.g., traceroute) also allow one to determine end-to-end paths in the network. However, tracking path changes in response to network events using active probing suffers from major scalability problems. First of all, the number of router pairs in a large network can be in the range of hundreds of thousands to millions. This makes probing every path at a fine time scale prohibitively expensive. A second problem arises due to the use of multiple equal cost paths (known as ECMP) between router pairs. ECMP arises when more than one path with smallest 368 L. Breslau and A. Shaikh weight exist between router pairs. Most intradomain protocols such as OSPF use all the paths by spreading data traffic across them.4 Since service providers often have redundant links in their networks, router pairs are more likely to have multiple paths than not. ECMP unfortunately exacerbates the scalability problem for active probing. Furthermore, engineering probes so that all ECMPs are covered is next to impossible since how routers would spread traffic across multiple paths is almost impossible to determine a priori. 11.2.3.2 Utility of Route Monitors in Network Management The data provided by route monitors and the basic information gleaned from them aid several network management tasks such as troubleshooting and forensics, network auditing, and capacity planning. Below we provide a detailed account of how this is done for each of these three tasks. Network Troubleshooting and Forensics Route monitors provide a view into routing events as they unfold. This view can be in the form of topology, routing tables, or end-to-end paths as mentioned in the previous sections; which form proves useful often depends on the specific troubleshooting task at hand. For example, if a customer complains about loss of reachability to certain parts of the Internet, looking at BGP routes and their history can provide clues about causes of the problems. Similarly, if performance issues are seen in some parts of the network, knowing what routing events are happening and how they are affecting paths can provide an explanation for the issues. Note that the route monitors’ utility not only stems from the current view of routing they provide (after all operators can always determine the current view by logging into routers), but from the historical data they provide which allows operators to piece together sequence of events leading to the problems. Routers do not store historical state, and so cannot provide such information. Going back to the debugging of customer complaining about lost reachability, it is rarely enough to determine the current state of the route, especially if no route exists to the prefix. To effectively pinpoint the problem, the operator might also need to know the history of route announcements and withdrawals for the prefix, and that data can only be provided by route monitors. Figure 11.5 shows snapshot of a tool that allows operators to view sequence of BGP route updates captured by a monitor deployed in a tier-1 ISP. Network Auditing and Protocol Conformance Another use of route monitors is for auditing the integrity of the networks and conformance of routing protocols to their specifications. To audit the integrity of the network, one needs to devise rules against which the actual routing behavior can be checked. For example, network administrators often have conventions and rules about weights assigned to links. 4 The exact algorithm for spreading traffic across ECMPs is implemented in the forwarding engine of routers. 11 Measurements of Control Plane Reliability and Performance 369 BGP Route History for 0.0.0.0/0 and its Subnets Prefix Time (GMT) Router Event ASPath Local Pref Origin MED Next-hop 1 Wed Apr 1 18:32:50 2009 10.0.0.1 WITHDRAW 192.168.0.0/24 ---- --2 Wed Apr 1 18:32:50 2009 10.0.0.1 ANNOUNCE 172.16.3.0/23 65001 65010 65145 90 IGP 0 10.0.1.3 3 Wed Apr 1 18:32:52 2009 10.0.0.1 ANNOUNCE 10.1.123.0/12 65001 65126 80 IGP 25 10.0.1.8 4 Wed Apr 1 18:32:55 2009 10.0.0.1 ANNOUNCE 192.168.3.0/18 65001 65324 65002 65121 65084 80 IGP 0 10.0.2.1 5 Wed Apr 1 18:32:58 2009 10.0.0.1 ANNOUNCE 192.168.0.0/24 65001 65223 65145 65 IGP 100 10.0.1.1 6 Wed Apr 1 18:33:31 2009 10.0.0.1 ANNOUNCE 172.23.4.0/21 65001 65132 90 IGP 10 10.0.2.1 7 Wed Apr 1 18:33:44 2009 10.0.0.1 ANNOUNCE 10.231.34.64/20 65001 65010 65192 65034 65 IGP 12 10.0.1.45 8 Wed Apr 1 18:33:47 2009 10.0.0.1 ANNOUNCE 192.168.0.0/24 65001 65023 65145 90 IGP 0 10.0.1.1 Count 9 Wed Apr 1 18:34:08 2009 10.0.0.1 ANNOUNCE 172.22.73.0/25 65001 65420 65321 65005 10 Wed Apr 1 18:34:21 2009 10.0.0.1 ANNOUNCE 172.172.72.0/21 65001 65014 65105 70 IGP 0 10.0.2.12 110 IGP 10 10.0.1.109 Fig. 11.5 Screen-shot of a tool to view BGP route announcement/withdrawals It then becomes necessary to monitor the network for potential deviations (that happen intentionally or due to mistakes) from these rules. Since (intradomain) routing messages provide current information about link weights, they provide a perfect source for checking whether network’s actual state conforms to the design rules or not. Checking that the network state matches the design rules is especially crucial during maintenance windows when a network undergoes significant change. Similar to network auditing, routing messages can also be used to verify that protocol implementations conform to the specifications. At the very least, one could check whether message format is correct as per the specifications or not. Another check is to compare the rate and sequence of messages against the expected behavior. The “Refresh LSA bug” caught by the OSPF Monitor [1] where OSPF LSAs were being refreshed much faster than the recommended value [6] is an instance of this. Capacity Planning Capacity planning, where network administrators determine how to grow their network to accommodate growth, is another task where routing data is extremely useful. In particular, the data allows planners to see how routing traffic is growing over time, which can then be used to predict resources required in the future. As such, the growth of two parameters is very important: the number of routes in the routing table, and the rate at which routing messages are disseminated. The former has significant bearing on the memory required on the routers, whereas the latter affects the CPU (and sometimes bandwidth) requirements for routers. For service providers, accurately knowing how long current CPU/memory configuration on routers can last, and when upgrades will be needed is extremely important for operational and financial planning. The growth patterns revealed by routing data play a key role in forming these estimates. These estimates also allow service providers to devise optimization techniques to reduce resource consumption. For example, consider layer-3 MPLS VPN [18] service, which allows enterprise customers to interconnect their (geographically distributed) sites via secure, dedicated tunnels over a provider network. Over the last few years, this service has witnessed a widespread deployment. This has led to tremendous growth in the number of BGP routes a VPN service provider has to keep track of, resulting in heavy memory usage on its 370 L. Breslau and A. Shaikh routers. Realizing this scalability problem, Kim et al. [19] have proposed a solution that allows a service provider to tradeoff direct connectivity between sites (e.g., from any-to-any to a more restricted hub-and-spoke where traffic between two sites now has to go through one or more hub sites) with number of routes that need to be stored. The data collected by the route monitors was crucial in this work: first, to realize that there is a problem, and next, to evaluate the efficacy of the scheme in realistic settings. In particular, Kim et al. show 90% reduction in the memory usage while limiting path stretch between sites to only a few hundred miles, and extra bandwidth usage by less than 10%. 11.2.3.3 Performance Assessment of Routing Protocols Routing data is key to understanding how routing protocols behave and perform in real life. We have already talked about one aspect of this behavior above, namely conformance to the specifications. Here we would like to talk about other aspects of the performance such as stability and convergence, which are key to quantifying the overall performance of the routing infrastructure. For example, numerous BGP studies detailing its behavior in the Internet have been enabled thanks to the data collected by RouteViews [15], RIPE [16], and other BGP monitors. We briefly describe some studies to illustrate the point. Route updates collected by BGP monitors have led to several studies analyzing the stability (or lack thereof) of BGP routing in the Internet.5 Govindan and Reddy [20] were the first to study the stability of BGP routes back in 1997 – a couple of years after commercialization of the Internet started. Their study analyzed BGP route updates collected from a large ISP and a popular Internet exchange point (where several service providers are interconnected to exchange routes and traffic). The study found a clear evidence of deteriorating stability of BGP routes which it attributed to the rapid growth – doubling of the number of ASes and prefixes in about 2 years – of the Internet. Subsequently, Labovitz et al. [21] observed a higher than expected number of BGP updates in the data collected at five US public Internet exchange points. The real surprising aspect of their study was the finding that about 99% of these updates did not indicate real topological changes, and had no reason to be there. The authors found that some of these updates were due to bugs in the BGP software of a router vendor at that time. Fixing of these bugs by the vendor led to an order of magnitude reduction in the volume of BGP route updates [22]. Convergence, the time taken by a routing protocol to recalculate new paths after a network change, is another critical performance metric. Labovitz et al. [23] were the first to systematically study this metric for BGP in the Internet. They found that BGP often took tens of seconds to converge – an order of magnitude more than what was thought at that time. The problem as they showed stems from the 5 The term stability refers to the stability of BGP routes, which roughly corresponds to how frequently they undergo changes. 11 Measurements of Control Plane Reliability and Performance 371 inclusion of ASPath in BGP route announcements (i.e., the very thing that makes BGP a path-vector protocol). The purpose of including the ASPath is to prevent loops and “count-to-infinity” problem6 that BGP’s distance-vector brethren (e.g., RIP) suffer from. However, this leads to “path exploration” as shown by Labovitz et al., where routers might cycle through multiple (often transient) routes with different ASPaths before settling on the final (stable) routes, thereby exacerbating the convergence times. Several ways of mitigating this problem have been proposed since then, essentially by including more information in BGP routes [24–28], but none of them have seen deployment to date. Mao et al. [29] tied hitherto independently explored stability and convergence aspects of BGP together by showing how route flap damping (RFD) [30] used for improving stability of BGP could interact with path exploration to adversely impact convergence of BGP. RFD is a mechanism that limits propagation of unstable routes, thereby mitigating adverse impact of persistent flapping of network elements and mis-configurations, which improves overall stability of BGP, and was a recommended practice [31] in early 2000. Unfortunately, as Mao et al. showed, RFD can also suppress relatively stable routes by treating route announcements received during path exploration as evidence of instability of a route. Specifically, the study showed that a route needs to be withdrawn only once and then re-announced for RFD to suppress it for up to an hour in certain circumstances. This work coupled with manifold increase in router CPU processing capability resulted in a recommendation by RIPE [32] to disable RFD. Routing data is not only valuable in analyzing performance of protocol separately, but also useful for understanding how they interact with one another as Teixeira et al. [33] did by focusing on how OSPF distance changes in a tier-1 ISP affected BGP routing. Their study showed that despite the apparent separation between intra and interdomain routing protocols, OSPF distance changes do affect BGP routes due to what is known as the “hot-potato routing”. 7 The extent of the impact depended on several factors including location and timing of a distance change. Even more surprisingly, BGP route updates resulting from such changes could lag by as much as a minute in some cases, resulting in large delays in convergence. In closing, these and numerous other studies have not only enhanced our knowledge of how routing protocols behave in the Internet, but have also led to improvements in their performance (such as reduction in unwarranted BGP updates or disabling of RFD as mentioned earlier). 6 With distance-vector protocols, two or more routers can get locked into a cyclical dependency where each router in the cycle uses the previous router as a next-hop for reaching a destination. The routers then increment their distance to the destination in a step-wise fashion until all of them reach infinity, which is termed as “counting to infinity”. For more details, refer to [8]. 7 As explained in Section 11.2.1.1, hot-potato routing refers to BGP’s propensity to select the shortest way out of its local AS to a prefix when presented with multiple equally good routes (i.e., ways out of the AS). This allows an AS to hand off data packets as quickly as possible to its neighboring AS much like a hot potato. 372 L. Breslau and A. Shaikh 11.2.4 Case Study of a Route Monitor: The AT&T OSPF Monitor Several route monitoring systems are available both as academic/research endeavors as well as commercial products. RouteViews [15] and RIPE [16] collect BGP route updates from several ISPs and backbones around the world. The data is used extensively for both troubleshooting and academic studies of the interdomain routing system. The corresponding web sites also list several tools for analysis of the data. On the intradomain side, a paper by Shaikh and Greenberg [1] describes an OSPF monitor. The paper provides detailed description of the architecture and design of the system and follows it up with a performance evaluation and deployment experience. On the commercial side, Packet Design’s Route Explorer [34] and Packet Storm’s Route Analyzer [35] are route monitoring products. The Route Explorer provides monitoring capability for several routing protocols including OSPF, IS-IS, EIGRP and BGP, whereas Route Analyzer provides similar functionality for OSPF. Out of various route monitoring systems mentioned above, we focus on the OSPF Monitor described by Shaikh and Greenberg [1] as a case study in this section since the paper provides extensive details about system architecture, design, functionality, and deployment. This is something not readily available for other route monitoring systems, especially the architecture and design aspects, which are key to understanding how control plane monitoring is realized in practice. From here on, we will refer to the OSPF Monitor described in [1] as the AT&T OSPF Monitor, and go into details of the system in terms of data collection and analysis aspects next. The AT&T OSPF Monitor separates data (specifically, LSAs) collection from data analysis. The main reasoning behind this is to keep data collection as passive and simple as possible due to the collector’s proximity to the network. The component used for LSA collection is called an LSA Reflector (LSAR). The data analysis on the other hand is divided into two components: LSA aGgregator (LSAG) and OSPFScan. The LSAG deals with LSA streams in real time, whereas OSPFScan provides capabilities for off-line analysis of the LSA archives. This three component architecture is illustrated in Fig. 11.6. We briefly describe these three components now. The LSAR supports three modes for capturing LSAs: the host mode, the full adjacency mode, and the partial adjacency mode. With the host mode, which only works on a broadcast media such as Ethernet LAN, the LSAR subscribes to a multicast group to receive LSAs being disseminated. This is a completely passive way of capturing LSAs, but suffers from reliability issues, slow initialization of link-state database and only works on broadcast media. With the full adjacency mode, the LSAR establishes an OSPF adjacency with a router to receive LSAs. This allows LSAR to leverage OSPF’s reliable flooding mechanism, thereby overcoming both the disadvantages of the host mode. However, the main drawback of this approach is that instability of LSAR or its link to the router can trigger SPF calculations in the entire network, potentially destabilizing the network. The reason for SPF calculation stems from the fact that with a full adjacency, the router includes a link to the LSAR in its LSA sent to the network. The partial adjacency mode of collecting LSAs provides a way to circumvent this problem while retaining all the benefits of having an adjacency. In this mode, the LSAR establishes adjacency with a router, 11 Measurements of Control Plane Reliability and Performance LSAG Real−time Monitoring LSAs OSPFScan Off−line Analysis LSAs TCP connection LSAR 1 ‘‘Reflect’’ LSAs LSAR 2 ‘‘Reflect’’ LSAs LSA Cache LSA Cache Area 1 373 LSA Archive Area 0 Area 2 OSPF Domain Fig. 11.6 The architecture of the AT&T OSPF monitor described in [1] but only allows it to proceed to a stage where LSAs can be received over it from the router, but it cannot be included in the LSA sent by the router to the network. To keep the LSAR-router adjacency in the intermediate state, the LSAR describes its own Router-LSA8 to the router during the link-state database synchronization process but never actually sends it out to the router. As a result, the database is never synchronized, the adjacency stays in OSPF’s loading state [6], and is never fully established. Keeping the adjacency in the loading state protects the network from the instability of the LSAR or its link to the router. Having described data collection by the LSAR, let us now turn our attention to the LSAG, which processes LSAs in real time. The LSAG populates a model of the OSPF network topology as it processes the LSAs. The model captures elements such as OSPF areas, routers, subnets, interfaces, links, and relationship between them (e.g., an area object consists of a set of routers that belong to the area, a router object in turn consists of a set of interfaces belonging to the router, etc.). Using the model as a base, the LSAG identifies changes (such as router up/down, link up/down, link cost changes, etc.) to the network topology and generates messages about them. Even though there are only about five basic network events, about 30 different types of messages are generated by the LSAG because of how broadcast media (such as Ethernet) are supported in OSPF, how a change in one area propagates to other areas, and how external information is redistributed into OSPF. In addition to identifying changes to the network topology, the LSAG also identifies elements that are unstable, and generates messages about such flapping elements. The LSAG also generates messages for non-conforming behavior, such as when 8 A Router-LSA in OSPF is originated by every router to describe its outgoing links to adjacent routers along with their associated weights. 374 L. Breslau and A. Shaikh refresh LSAs are observed too often. Apart from using the topology model to identify changes, the LSAG also uses it to produce snapshots of the topology periodically and when network changes occur. One use of these snapshots is for performing an audit of link weights as described in Section 11.2.3.2. Finally, we turn our attention to OSPFScan, which supports off-line analysis of LSA archives. One thing worth mentioning about the AT&T OSPF Monitor is that the capabilities supported by OSPFScan for off-line analysis are mostly a superset of the ones supported in real time by the LSAG with the underlying idea being anything that can be done in real time can be performed off-line as a playback. In terms of processing of LSAs, OSPFScan follows a three-step process: parse the LSA, test the LSA against a user-specified query expression, and analyze the LSA according to user interest if it satisfies the query. The parsing step converts each LSA record into what is termed a canonical form to which the query expression and subsequent analysis is applied. The use of a canonical form makes it easy to adapt OSPFScan to support LSA archive formats other than the native one used by the LSAR. The query language resembles C-style syntax; an example query expression is “areaid == ‘0.0.0.0”’. When a query is specified, OSPFScan matches every LSA record against the query, carrying out subsequent analysis for the matching records, while filtering out the non-matching ones. For example, the expression above would result in the analysis of only those LSAs that were collected from area 0.0.0.0. In terms of analysis, OSPFScan provides the following capabilities: 1. Modeling Topology Changes Recall that OSPF represents the network topology as a graph. Therefore, OSPFScan allows modeling of OSPF dynamics as a sequence of changes to the underlying graph where a change represents addition/deletion of vertices/edges to this graph. Furthermore, OSPFScan allows a user to analyze these changes by saving each change as a single topology change record. Each such record contains information about the topological element (vertex/edge) that changed along with the nature of the change. For example, a router is treated as a vertex, and the record contains the OSPF router-id to identify it. We should point out that the topology change records and LSAG message logs essentially describe the same thing, but the former is geared more for computer processing, whereas the latter is aimed at humans. 2. Emulation of OSPF Routing OSPFScan allows a user to reconstruct a routing table of a given set of routers at any point of time based on the LSA archives. For a sequence of topology changes, OSPFScan also allows the user to determine changes to these routing tables. Together, these allow calculation of end-to-end paths through the OSPF domain at a given time, and see how this path changed in response to network events over a period of time. The routing tables also facilitate analysis of OSPF’s impact on BGP through hot-potato routing [33]. 3. Classification of LSA Traffic OSPFScan allows various ways of “slicingand-dicing” of LSA archives. For example, it allows isolating LSAs indicating changes from the background refresh traffic. As another example, it also allows classification of LSAs (both change and refresh) into new and duplicate instances. This capability was used in a case study that analyzed one month LSA traffic for an enterprise network [36]. 11 Measurements of Control Plane Reliability and Performance 375 11.2.5 MPLS Recall that MPLS has been deployed widely in service provider networks over the last few years. It has played a key role in evolving best-effort service model of IP networks by enabling traffic engineering (TE), fast reroute (FRR), and class of service (CoS) differentiation. In addition, MPLS has also allowed providers to offer value-added services such as VPN and VPLS. Unlike traditional unicast forwarding in IP networks where routers match destination IP address to the longest matching prefix, MPLS uses a label switching paradigm. Each (IP) packet is encapsulated in an MPLS header, which contains among other things the label which is used by a router to determine the outgoing interface. The value of the label changes along every hop. Thus, while determining the outgoing interface, the router also determines the label with which it replaces the incoming label of the packet. This means that a router running MPLS has to maintain an LFIB (Label Forwarding Information Base), which contains mapping between incoming label and (outgoing interface, outgoing label) pairs. The sequence of routers an MPLS packet follows is known as an LSP (Label Switched Path). The first router along the LSP encapsulates a packet into an MPLS header, while the last router removes the MPLS header and forwards the resulting packet based on the underlying header. The LFIB used for MPLS switching is populated by its control plane. This is done by creating and distributing mapping between a label and an FEC or a Forwarding Equivalence Class. An FEC is defined as a set of packets that need to receive the same forwarding treatment inside an MPLS network. A router running MPLS first generates a unique label for each FEC it supports, and uses one of the control plane protocols to distribute the label-FEC mappings to other routers. The dissemination of this information allows each router to determine incoming and outgoing labels and outgoing interface for each FEC, and thereby populate its LFIB. MPLS currently uses three routing protocols for distributing label-FEC mappings: LDP (Label Distribution Protocol) [37], RSVP-TE (Resource reSerVation Protocol) [38], and BGP [39, 40]. With LDP, a router exchanges label-FEC mappings with each of its neighbors using a persistent session. FECs, in case of LDP, are generally IP prefixes. The labels learned from the neighbors allow the router to determine mapping between incoming and outgoing labels. To determine the outgoing interface, LDP relies on the IGP (such as OSPF, IS-IS etc.) running in the underlying IP network. Thus, LSPs created by LDP follow the paths calculated by the IGP from source router to the destination prefix. RSVP, on the end, is used for “explicitly” created and routed LSPs between two end points; the path need not follow the IGP path. The first router of the LSP initiates path setup by sending an RSVP message. The message propagates along the (to be established) LSP to the last router. Every intermediate router processes the message, creating an entry in its LFIB for the LSP. RSVP also allows reservation of bandwidth along the LSP, making it ideal for TE and CoS routing. Finally, BGP is used for distributing prefix to label mappings (mostly) in the context of VPN services. With VPNs, different 376 L. Breslau and A. Shaikh customers of a VPN service provider can use overlapping IP address blocks, and BGP-distributed label to prefix mapping allows a provider’s egress edge router to determine which customer a given packet belongs to. The flow of control messages through individual routers running LDP and RSVPTE can be modeled in the same manner as traditional unicast routing protocols as shown in Fig. 11.4. Thus, to monitor these protocols, one needs to collect incoming messages, outgoing messages, and changes occurring to the LFIB at every router. As a result, various techniques described in Section 11.2.2 for data collection apply to these protocols as well. One caveat applies to RSVP though since it does not have a notion of a protocol session. Given this, it is not possible to collect information about RSVP messages through a session with an RSVP router. To collect information about RSVP dynamics thus requires some mechanism for routers to send messages to a monitoring session when tunnels are setup and torn down – SNMP traps defined in RFC 3812 [41] provide such a capability. Once routing data is collected from LDP or RSVP routers, it can be used in similar fashion as described in Section 11.2.3. For example, knowing label binding messages sent by LDP routers allows an operator to know if LSPs are established correctly or not. As another example, knowing the size of an LFIB (i.e., the number of LSPs traversing a router) and how it is evolving can be a key parameter in capacity planning. 11.3 Multicast Throughout its relatively brief but rapidly evolving history, the Internet has primarily provided unicast service. A datagram is sent from a single sender to a single receiver, where each endpoint is identified by an IP address. Many applications, however, involve communication between more than two entities, and often the same data needs to be delivered to multiple recipients. As examples, software updates may be distributed from a single server to multiple recipients, and streaming content, such as live video, may be transmitted to many receivers simultaneously. When the network layer only supports one-to-one communication, it is the responsibility of the end systems to replicate data and transmit multiple copies of the same packet. This solution is inefficient both with respect to processing overhead at the sender and bandwidth utilization within the network. Multicast [42], on the other hand, presents an efficient mechanism for network delivery of the same content to multiple destinations. In IP multicast, the sender transmits a single copy of a packet into the network. The network layer replicates the packet at appropriate routers in the network such that copies are delivered to all interested receivers and at most one copy of the packet traverses any network link. Multicast is built around the notion of a multicast group, which is a 32bit identifier taken from the Class D portion of the IP address space (224.0.0.0 – 239.255.255.255). In multicast packets, the group address is contained in the destination IP address field in the header. Receivers make known their interest in 11 Measurements of Control Plane Reliability and Performance 377 receiving packets sent to the group address via a group membership protocol such as IGMP [43], and multicast routing protocols enable multicast packets to be delivered to the interested receivers. Multicast was first proposed in the 1980s and was deployed on an experimental basis in the early 1990s. This early deployment, known as the MBone [44] (for Multicast Backbone), consisted of areas of the Internet in which multicast was deployed. These areas were connected together using IP-in-IP tunnels enabling multicast packets to traverse unicast-only portions of the Internet. The predominant applications used in the MBone, videoconferencing and video broadcast, primarily supported small group collaboration and broadcast of technical meetings and conferences. After rapid initial growth, the MBone peaked and then began to flounder. The technology, while initially promising, did not find its way into service provider networks. Several reasons have been given for this. These include the lack of a clear business model (i.e., who would be charged for packets that are replicated and delivered to many receivers), security concerns (i.e., the original any-to-any IP multicast service model allowed any host in the network to transmit packets to a multicast group), and concerns about manageability (i.e., lack of tools to monitor, troubleshoot and debug this new technology). More recently, deployment of network layer multicast service within IP networks has been increasing. This deployment has occurred primarily in enterprise networks, in which some of the earlier concerns with multicast (e.g., security, business model) are more easily mitigated. Common multicast applications in enterprise networks include software distribution and dissemination of financial trading information. The deployment of multicast within enterprise networks has also driven deployment in service provider networks in order to support the needs of Virtual Private Network (VPN) customers who use multicast in their networks. The Multicast VPN solution defined for the Internet [45, 46] requires customer multicast traffic to be encapsulated in a second instance of IP multicast for transport across the service provider backbone. Finally, the widespread deployment of IPTV, an application that benefits greatly from multicast service, is creating further growth of IP multicast. Forwarding multicast packets within a network makes use of a separate FIB from the unicast FIB and depends on a new set of routing protocols to create and maintain these FIB entries. As such, the set of tools used to monitor unicast routing cannot be used. In this section, we review the basics of multicast routing, identify issues that make monitoring and managing multicast more difficult than monitoring unicast routing, and finally describe tools and strategies for monitoring this technology. 11.3.1 Multicast Routing Protocols A multicast FIB entry is indexed by a multicast group and a source specification, where the latter consists of an address and mask. Packets that match the group address and source specification will be routed according to the FIB entry. The FIB entry itself contains an incoming interface over which packets matching the source 378 L. Breslau and A. Shaikh and group are expected to arrive, and a set of zero or more outgoing interfaces over which copies of the packets should be transmitted. The union of FIB entries pertaining the same group and source(s) across all routers forms a tree, denoting the set of links over which a packet is forwarded to reach the set of interested receivers. It is the job of multicast routing protocols to establish the appropriate FIB entries in the routers and thereby form this multicast tree. Over the last two decades, several multicast routing protocols have been proposed and in some cases implemented and deployed. These include DVMRP [47], MOSPF [48], CBT [49], MSDP [50], and PIM [51, 52]. In this section, we give an overview of PIM and MSDP as they are the most widely deployed multicast routing protocols. 11.3.1.1 PIM Protocol Independent Multicast, or PIM, is the dominant multicast routing protocol deployed in IP networks. PIM does not exchange reachability information in the sense that unicast routing protocols, such as OSPF and BGP, do. Rather, it leverages information in the unicast FIB in order to construct multicast trees, and it is agnostic as to the source of the unicast routing information. There are multiple variants of PIM, including PIM Sparse Mode (PIM-SM), PIM Dense Mode (PIM-DM), Source Specific PIM (PIM-SSM), and Bidirectional PIM (PIM-Bidir). In this section, we present a brief overview of the basic operation of PIM-SM and PIM-SSM, as they are the most commonly deployed variants of PIM, in order to motivate the challenges in multicast monitoring and their solutions. Before turning to PIM we discuss one key aspect of multicast trees and the protocols that construct them. Multicast trees can be classified as shared trees or source trees. A shared tree is one that is used to forward packets from multiple sources. In this case, the multicast routing entry is denoted by a group and a set of sources (e.g., using an address and a mask). For a shared tree, the set of sources usually includes all sources, and the routing table entry is denoted by the . ; G/ pair, where G denotes the multicast group address and ‘*’ denotes a wildcard (indicating all sources). A source tree, on the other hand, is used to forward packets from a single source, and is denoted as .S; G/, where G again refers to the multicast group and S refers to a single source. PIM-SM uses both shared and source trees, depending on both the variant and how it is configured. In both cases, multicast trees are constructed by sending Join messages from the leaves of the tree (the routers that are directly connected to hosts that want to receive packets transmitted to the multicast group) toward the root of the tree. In the case of a source tree, the root is a source that transmits data to the multicast group and the Join message is referred to as an .S; G/ Join. For a shared tree, the root is a special node referred to as a Rendezvous Point, or RP, and the Join message is referred to as a . ; G/ Join. The RP for a group, which can be configured 11 Measurements of Control Plane Reliability and Performance 379 statically at each router or determined by a dynamic protocol such as BSR [53], must be agreed upon by all routers in a PIM domain.9 PIM Join messages are transmitted hop-by-hop toward the root of the tree. At each router, the next hop is determined using the unicast FIB. Specifically, the Join message is transmitted to the next hop on the best route (as determined by the unicast routing table) toward the root (i.e., source or RP). As such, the Join message follows the shortest path from the receiver to the root of the tree. At each hop, the router keeps track of the neighbor router from which the Join message was received and the neighbor router to which it was forwarded. The latter is denoted as the upstream neighbor in the multicast FIB and the former is denoted as a downstream neighbor. When subsequent multicast data packets are received from the upstream neighbor, they will be forwarded to the downstream neighbor. When a router receives a subsequent . ; G/ or .S; G/ Join message for a FIB entry that already exists, the router from which the Join message is received is added to the list of downstream neighbors. However, the Join message need not be forwarded upstream as a Join message will have already been forwarded toward the root of the tree. In this way, Join messages from multiple downstream neighbors are merged, and when data packets are received, they will be replicated with a copy forwarded to each downstream neighbor. PIM uses soft state, so that Join messages are retransmitted hop-by-hop periodically, and state that is not refreshed is deleted when an appropriate timer expires. In PIM-SM, all communication begins on a shared tree. Last hop routers transmit Join messages toward the RP, forming a shared tree with the RP at the root and last hop routers as leaves. This process is depicted in Steps 1–3 in Fig. 11.7a, in which router R2 transmits a Join message toward the RP. This message is then forwarded by R1 to the RP. R3 subsequently transmits a Join message toward the RP, which is received by R1 and not forwarded further. When a source wants to transmit packets to the group, it encapsulates these packets in PIM Register messages transmitted using unicast to the RP. The RP decapsulates these packets and transmits them on the shared tree, so that they are delivered to all routers that joined the tree. The RP then sends an .S; G/ Join message toward the source, building a source tree from the source to the RP. Steps 4–5 in Fig. 11.7a depict a Register message from a source S to the RP followed by a subsequent Join from the RP to S. Once this source tree is established, packets are sent using native multicast from the source to the RP and from the RP to the leaf routers, as shown in Fig. 11.7b. When multiple sources have data to send to the multicast group, each will send PIM Register messages to the RP, which in turn will send PIM Join messages to the sources, thereby creating multiple .S; G/ trees. While all communication, in PIM-SM begin on shared trees, the protocol allows for the use of source trees. Specifically, when a last hop router receives packets from a source, it has the option to switch to a source tree for that source. It does this by 9 A PIM domain is defined as a contiguous set of routers all configured to operate within a common boundary. All routers in the domain must map a group address to the same RP. 380 L. Breslau and A. Shaikh a b 4 Register S RP S (S,G) Join 5 RP 2 (*,G) Join Data Packets R1 R1 1 3 (*,G) Join (*,G) Join R2 R3 R2 Shared Tree Creation Shared Tree Data Flow c S R3 d RP S RP 2 (S,G) Join Data Packets R1 1 R1 3 (S,G) Join R2 (S,G) Join R3 Source Tree Creation R2 R3 Source Tree Data Flow Fig. 11.7 Example PIM Operation: (a) Sequence of control messages for shared tree creation. (b) Resulting flow of data packets. (c) Sequence of control messages for switchover to source tree. (d) Resulting flow of data packets sending an .S; G/ Join toward the source, joining the source tree (just as the RP did in the description above). Once it has received packets on the source tree, it then sends a Prune message for the source on the shared tree, indicating that it no longer wants to receive packets from that source on the shared tree. The Join messages needed to switch from the shared to source tree are shown in Fig. 11.7c, and the resulting flow of data packets is shown in Fig. 11.7d. Source trees allow for more efficient paths from the source to receiver(s) at the expense of higher protocol and state overhead. PIM-SSM (Source Specific Multicast) does away with the need for RPs, thereby simplifying multicast tree construction and maintenance while using a subset of 11 Measurements of Control Plane Reliability and Performance 381 the PIM-SM protocol mechanisms. PIM-SSM only uses source trees. The source of traffic is known to hosts interested in joining the multicast group (e.g., via an out-ofband mechanism). These receivers signal their interest in the group via IGMP, and their directly connected routers send .S; G/ Join messages directly to the source, thereby building a source tree rooted at the sender. 11.3.1.2 MSDP In PIM-SM, there is a single RP that acts as the root of a shared tree for a given multicast group. (Note that a single router may act as an RP for many groups.) This provides a mechanism for rendezvous and subsequent communication between sources and receivers without either having any pre-existing knowledge of the other. However, there are two situations in which multiple RPs for a group may be desirable. The first involves multicast communication between domains. Specifically, two or more service providers may wish to enable multicast communication between them. If there is only a single RP for a group, failure of the RP in one provider’s network may impact service in the other’s network, even if all of the sources and receivers are located in the latter’s network. Service providers may not be willing to depend on a critical resource (e.g., the RP) located in another service provider’s network for what may be purely intradomain communication. Further, even without RP outages, performance may be suboptimal if purely intradomain communication is required to follow interdomain paths. That is, a multicast tree between senders and receivers in one ISP’s network may traverse another ISP’s network. Thus, each provider may wish to have an RP located within its own domain. The second situation in which multiple RPs may be useful involves communication within a single PIM domain. Specifically, redundant RPs provide a measure of robustness, and this can be implemented using IP anycast [54]. Each RP is configured with the same IP address, and the RP mapping mechanism identifies this anycast address as the RP address. Each router wishing to join a shared tree sends a . ; G/ Join message toward the RP address. By virtue of anycast routing, which uses unicast routing to route the message to the “closest” RP, the router will join a shared tree rooted at a nearby RP. As a result, multiple disjoint shared trees will be formed within the domain. Similarly, when a source transmits a PIM Register message to an anycast RP address, this message will only reach the nearest RP. As such, sources and receivers will only communicate with those subsets of routers closest to the same RP, and the required multicast connectivity will not be achieved. The problem of enabling multicast communication when multiple RPs exist for the same group (whether within or between domains) is solved by the Multicast Source Discovery Protocol (MSDP) [50]. MSDP enables multicast communication between different PIM-SM domains (e.g., operated by different service providers) as well as within a PIM-SM domain using multiple anycast RPs. MSDP-speaking RPs form peering relationships with each other to inform each other of active sources. Upon learning about an active source for a group for which there are interested 382 L. Breslau and A. Shaikh receivers, an RP joins the source tree of that source so that it can receive packets from the source and transmit them within its own domain or on its own shared tree. We give a brief overview of MSDP. Each RP forms an MSDP peering relationship with one or more other RPs using a TCP connection. These MSDP connections form a virtual topology among the various RPs. RPs share information about sources as follows. For each source from which it receives a PIM Register message, an RP transmits an MSDP Source-Active (SA) message to its MSDP peers. This SA message, which identifies a source and the group to which it is sending, is flooded across the MSDP virtual topology so that it is received by all other MSDP-speaking RP routers. Upon receipt of an SA message, an RP (in addition to flooding the message to its other MSDP peers) determines whether there are interested receivers in its domain. Specifically, if the RP has previously received a Join message for the shared tree indicated by the group in the SA message, the RP will transmit a PIM Join message to the source. In this way, the RP joins the source tree rooted at the source in question, receives multicast packets from it, and multicasts these packets on the shared tree rooted at the RP. Thus, multicast communication is enabled when multiple RPs exist for the same group, whether within or across domains. 11.3.2 Challenges in Monitoring Multicast In the early days of multicast, one of the often cited reasons for its slow deployment was the difficulty of monitoring and managing the service; commercial routers implemented the protocols, but network operators had little way of knowing how the service was working when they deployed it. While this was by no means the only impediment to its deployment, it did present a significant challenge to network operators. To some degree, the problems cited early on with multicast management remain true today. Before turning to specific tools and techniques used to monitor and manage multicast in order to provide a stable and reliable network service, we identify some of the generic challenges for managing the technology, while deferring some of the protocol-specific issues to Section 11.3.3. While multicast is by no means a new technology, it is not yet mature. Because it has only been deployed in a significant way in the last few years, there does not yet exist the experience and knowledge surrounding it as exists with unicast service. This manifests itself in two related ways. First, engineers and operators in many cases are unfamiliar with the technology and face a steep learning curve in troubleshooting and monitoring multicast. Second, due to a rather limited deployment experience, the kinds of tools that have evolved in the unicast world and that have been essential in route monitoring do not yet exist for multicast. Putting aside the relative newness of the technology, there are aspects of multicast that make it inherently more challenging to manage than unicast. Most obviously, the nature of what constitutes a route followed by a packet has changed. In unicast routing, the path taken by a packet from source to destination consists of a sequence 11 Measurements of Control Plane Reliability and Performance 383 of routers (usually no more than 20 or 30). This path is easily identifiable (e.g., using tools such as traceroute) and can be presented to a network operator in a way that is easy to understand. In multicast routing, a packet no longer traverses an ordered sequence of routers, but rather follows a tree of routers from a source to multiple destinations. The tree can be very large, consisting of hundreds of routers. Identifying the tree becomes more challenging, and perhaps more significantly, presenting it to a network operator in a useful manner is difficult. In addition to being large, multicast trees are not static. That is, they are driven by application behavior, and the set of senders and receivers may change during the lifetime of an application. As such, branches may be added to and pruned from multicast trees over time, and these changes can happen on short timescales. Thus, understanding the state of multicast is made more difficult by the dynamic nature of the multicast trees. Finally, the multicast routing state used to forward a packet from a source to a set of receivers can be data driven. That is, the state may not be instantiated until an application starts sending traffic or expresses interest in receiving it. In contrast, with unicast routing, the FIB entries used to route a packet from a source to a destination are independent of the existence of application traffic. Thus, routing table entries can be queried (either directly with SNMP or indirectly with a utility like traceroute) in order to discover or verify a route. With multicast the analogous routing state may not exist until applications are started. Using PIM-SM as an example, the shared tree from the RP to receivers is formed as a result of receivers joining a multicast group. Similarly, the state needed to route a packet from a source to the RP is not created until the source sends a PIM Register message to the RP and the RP subsequently sends an .S; G/ Join to the source. Given this, answering such questions (as one might want to do in advance of a streaming broadcast) as “how would packets be routed from the source to receivers” is problematic. Given the inherent difficulties in monitoring and managing multicast routing, there exists a need for new tools, methods and capabilities to assist in this process. We now turn to the challenges of monitoring specific protocols and the ways in which these challenges can be met. 11.3.3 Multicast Route Monitoring Multicast routing involves complex protocols. In order to understand, troubleshoot and debug the state of multicast in a network, operators need to be able to answer several key questions. These include: What is the FIB entry for a particular source and group at a router? What is the multicast tree for a .S; G/ or . ; G/ pair? What route will a packet take from a source to one or more receivers? (As will be explained below, this question differs subtly from the preceding one.) Are multicast trees stable or dynamic? 384 L. Breslau and A. Shaikh Are packets transmitted by source S to group G being received where they should be? Is multicast routing properly configured in the network? Answering these and other questions about multicast requires a new set of management tools and capabilities. In this section, we describe how monitoring tools can be used to answer these questions. Before doing so, we briefly review the network management capabilities developed during earlier experiences with multicast. 11.3.3.1 Early MBone Tools The MBone grew from a few dozen subnets in 1992 to over 3,000 four years later [55]. At its inception, it connected a small community of collaborating researchers, but it expanded to include a much broader set of users and applications. It was initially maintained by a few people who knew administrators at all the participating sites. Therefore, monitoring and debugging of the infrastructure developed in an ad hoc manner. As the MBone grew, it faced an increasing set of management challenges. To meet these challenges, the researchers who managed and used it developed a broad set of tools. While we avoid an exhaustive review of these tools we give a few representative examples here which encompass both application and network layer tools. mrinfo discovered the multicast topology by querying multicast routers for their neighbors. mtrace was used to discover the path packets traversed to reach a receiver from a source. rtpmon was an application-level monitoring tool that provided end-to-end performance measurements for a multicast group. The DVMRP Route Monitor [56] monitored routing exchanges between multicast routers in the MBone. The tools mentioned here, and the many others that were developed (see [57, 58] for a more complete list) provided great value to the early MBone users. They addressed real problems and allowed operators and users to understand, monitor, and troubleshoot the experimental network. While in many cases they provided insight and lessons, which inform current efforts, they are unable to form the basis for a current multicast management strategy. Many of the tools use RTCP and monitor application performance. Others were built specifically to monitor mrouted, the public domain multicast routing daemon used in the early MBone. Neither of these support the needs of large ISPs to monitor their multicast infrastructure. Instead, today’s multicast management and monitoring strategy must be built around tools that work in the context of the multi-vendor commercial routers managed by the ISPs. 11 Measurements of Control Plane Reliability and Performance 385 11.3.3.2 Information Sources While the earlier experience with the MBone provided some valuable insight as to the challenges with managing multicast, it also showed the need for tools that worked with commercial routers and that could be deployed by service providers at scale. Such tools must work in the confines of the capabilities available on the routers that support multicast. We discuss the options for gathering information about multicast in this section, in order to motivate the kinds of solutions described later. As described in Section 11.2.3, route monitors provide enormous capability with respect to monitoring unicast routing. BGP monitors peer with BGP speaking routers to collect routing updates and thereby monitor network reachability and s